36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2405.18795","publisher":"arxiv","paperJSON":{"title":"Federated Q-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost","paperID":"2405.18795","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"ABSTRACT","paragraphs":[[{"text":"In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and adopts three novel designs: separate event-triggered communication and policy switching, heterogeneous communication triggering conditions, and optional forced synchronization. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.","element":"span"}]]},{"heading":"1 INTRODUCTION","paragraphs":[[{"text":"Federated reinforcement learning (FRL) is a distributed learning framework that combines the principles of reinforcement learning (RL) ","element":"span"},{"href":"#id-0","referenceIndex":58,"text":"(Sutton & Barto, ","element":"a"},{"href":"#id-0","referenceIndex":58,"text":"2018) ","element":"a"},{"text":"and federated learning (FL) ","element":"span"},{"href":"#id-1","referenceIndex":46,"text":"(McMahan et al., ","element":"a"},{"href":"#id-1","referenceIndex":46,"text":"2017)","element":"a"},{"text":". Focusing on sequential decision-making, FRL aims to learn an optimal policy through parallel explorations by multiple agents under the coordination of a central server. Often modeled as a Markov decision process (MDP), multiple agents independently interact with an initially unknown environment and collaboratively train their decision-making models with limited information exchange between the agents. This approach accelerates the learning process with low communication costs. Some model-based algorithms (e.g., ","element":"span"},{"href":"#id-2","referenceIndex":12,"text":"Chen et al. ","element":"a"},{"href":"#id-2","referenceIndex":12,"text":"(2023)","element":"a"},{"text":") and policy-based algorithms (e.g., ","element":"span"},{"href":"#id-3","referenceIndex":22,"text":"Fan et al. ","element":"a"},{"href":"#id-3","referenceIndex":22,"text":"(2021)","element":"a"},{"text":") have shown speedup with respect to the number of agents in terms of learning regret or convergence rate. Recent progress has been made in FRL algorithms based on model-free value-based approaches, which directly learn the value functions and the optimal policy without estimating the underlying model (e.g., ","element":"span"},{"href":"#id-4","referenceIndex":65,"text":"Woo et al. ","element":"a"},{"href":"#id-4","referenceIndex":65,"text":"(2023)","element":"a"},{"text":"). However, limiting our focus to tabular MDPs, most existing model-free federated algorithms do not actively update the exploration policies for local agents and fail to provide low regret. A comprehensive literature review is provided in Appendix ","element":"span"},{"href":"#id-5","referenceIndex":114,"text":"B.","element":"a"}],[{"text":"1.1 ","element":"span"},{"text":"F","element":"span"},{"text":"EDERATED ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-","element":"span"},{"text":"LEARNING","element":"span"},{"text":": ","element":"span"},{"text":"PRIOR WORKS AND LIMITATIONS","element":"span"}],[{"text":"In this paper, we focus on model-free FRL based on the classic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithm ","element":"span"},{"href":"#id-6","referenceIndex":64,"text":"(Watkins, ","element":"a"},{"href":"#id-6","referenceIndex":64,"text":"1989)","element":"a"},{"text":", tailored for episodic tabular MDPs with inhomogeneous transition kernels. Specifically, we assume the presence of a central server and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"local agents in the system. Each agent interacts independently with an episodic MDP consisting of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"states, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"actions, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"steps per episode.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"denote the number of steps for each agent. Under the single-agent setting of the episodic MDP, ","element":"span"},{"href":"#id-7","referenceIndex":18,"text":"Domingues et al. ","element":"a"},{"href":"#id-7","referenceIndex":18,"text":"(2021) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Jin et al. ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"(2018) ","element":"a"},{"text":"established a lower bound for the expected total regret of ","element":"span"},{"style":{"height":18.75},"width":233.17,"height":46.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-0.png","element":"img","alt":" Ω(√H2SAT)","inline":true},{"text":". An algorithm is considered almost optimal when it achieves a regret upper bound of ","element":"span"},{"style":{"height":18.83},"width":249.4,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-1.png","element":"img","alt":"˜O(√H2SAT)1 ","inline":true,"padRight":true},{"text":"for large values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". Multiple model-based algorithms (e.g., ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Zhang et al. ","element":"a"},{"href":"#id-9","referenceIndex":80,"text":"(2023)","element":"a"},{"text":") have been shown to be almost optimal. Research on provably efficient model-free algorithms began with ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Jin et al. ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"(2018) ","element":"a"},{"text":"and was further advanced by ","element":"span"},{"href":"#id-10","referenceIndex":8,"text":"Bai et al. ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"(2019)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":40,"text":"Li et al. ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"(2021)","element":"a"},{"text":". Specifically, ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":40,"text":"Li et al. ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"(2021) ","element":"a"},{"text":"proposed almost optimal algorithms that utilized reference-advantage decomposition for variance reduction.","element":"span"}],[{"text":"For the federated setting, the information bound naturally translates to ","element":"span"},{"style":{"height":18.75},"width":276.18,"height":46.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-2.png","element":"img","alt":" Ω(√H2SAMT)","inline":true},{"text":", allowing us to define almost optimal federated algorithms similarly. However, the literature on federated model-free algorithms is quite limited. ","element":"span"},{"href":"#id-10","referenceIndex":8,"text":"Bai et al. ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"(2019) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"proposed concurrent algorithms where multiple agents generate episodes simultaneously and share their original data with the central server. These designs achieved low policy-switching costs ","element":"span"},{"style":{"height":18.75},"width":235.9,"height":46.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-3.png","element":"img","alt":" O(√H3SAT)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.75},"width":235.89,"height":46.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-4.png","element":"img","alt":"O(√H2SAT)","inline":true,"padRight":true},{"text":"respectively but incurred a high communication cost of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"MT","element":"span"},{"text":")","element":"span"},{"text":". ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"proposed federated algorithms with near-linear regret speedup compared to ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Jin et al. ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"(2018) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-10","referenceIndex":8,"text":"Bai ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"et al. ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"(2019) ","element":"a"},{"text":"and logarithmic communication cost, but they only achieved a suboptimal regret of ","element":"span"},{"style":{"height":18.83},"width":278.9,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/1-5.png","element":"img","alt":"˜O(√MH3SAT)","inline":true},{"text":". This raises the following question:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Is it possible to design an almost optimal federated model-free RL algorithm that enjoys a logarithmic communication cost?","element":"span"}],[{"text":"1.2 ","element":"span"},{"text":"S","element":"span"},{"text":"UMMARY OF OUR CONTRIBUTIONS","element":"span"}],[{"text":"We answer this question affirmatively by proposing the FedQ-Advantage algorithm to achieve the almost optimal regret and the logarithmic communication cost. Our contributions are summarized below.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Algorithmic design. ","element":"span"},{"text":"In FedQ-Advantage, the server coordinates the agents by actively updating their policies, while the agents execute these policies, collect trajectories, and periodically share local aggregations with the server. We adopt upper confidence bounds (UCB) to promote exploration and use the reference-advantage decomposition when updating the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function. The algorithm design incorporates the following key elements that are crucial for achieving near-optimal regret. (1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Aligned round-wise communication and unaligned stage-wise updates. ","element":"span"},{"text":"We design a new mechanism based on event-triggered communication and policy switching, both of which are triggered when specific conditions are satisfied. This structure divides the learning process into aligned communication rounds, which are grouped into stages. These stages are unaligned across different state-action-step tuples. Communication takes place after each round, while policy switching occurs only at the end of each stage. This mechanism employs the unaligned stage design from ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":", which is not used in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":". During communication, local agents share the round-wise aggregated sums of function values over visits to each tuple rather than entire trajectories. The central server then constructs global estimates within the reference-advantage decomposition framework, maintaining low communication costs.","element":"span"}],[{"text":"(2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Heterogeneous event-triggered Communication. ","element":"span"},{"text":"An agent terminates its exploration and requests communication in a round when the number of visits to any state-action-step tuple reaches a threshold, which guarantees sufficient exploration under restrictions. We adopt a heterogeneous design for the threshold that encourages more visits in the early rounds of a stage and limits the visits in later rounds to form desired stage renewals. This differs from the condition in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"that always poses strict limits.","element":"span"}],[{"text":"(3) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An optional forced synchronization mechanism. ","element":"span"},{"text":"Under this mechanism offered by FedQAdvantage, when one agent triggers the communication condition, the central server terminates the exploration for all agents and initiates a new round. This approach enhances robustness to heterogeneity in agents’ exploration speeds and eliminates waiting time. In the absence of forced synchronization, the central server waits for each agent to individually meet the communication condition, thereby reducing the number of communication rounds required.","element":"span"}],[{"id":"id-14","text":"Table 1: Comparison of regrets and communication costs for multi-agent RL algorithms.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1570,"height":426,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":": number of steps per episode; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":": total number of steps; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":": number of states; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":": number of actions; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":": number of agents. -: not discussed. ","element":"span"},{"style":{"height":10.4},"width":191.18,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-1.png","element":"img","alt":"fM equals to M","inline":true,"padRight":true},{"text":"if the forced synchronization design is used and equals to 1 else.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Performance guarantees. ","element":"span"},{"text":"FedQ-Advantage provably achieves an almost optimal regret and near-linear speedup in the number of agents compared with its single-agent counterparts ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"(Zhang ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"et al., ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"2020) ","element":"a"},{"text":"when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is sufficiently large. The regret bound holds regardless of whether forced synchronization is used. Its communication cost scales logarithmically with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", outperforming the federated algorithms in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"and matching the policy switching cost in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":", which is the best cost for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning in the literature. To the best of our knowledge, it is the first model-free federated RL algorithm to achieve almost optimal regret with logarithmic communication cost. We compare the regret and communication costs under multi-agent tabular episodic MDPs in Table ","element":"span"},{"href":"#id-14","text":"1. ","element":"a"},{"text":"Numerical experiments also demonstrate that FedQ-Advantage has better regret and communication cost compared to the federated algorithms in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Technical novelty. ","element":"span"},{"text":"We highlight two technical contributions here. (1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Stage-wise approximations in non-martingale analysis. ","element":"span"},{"text":"The event-triggered stage renewal presents a non-trivial challenge involving the concentration of the sum of non-martingale difference sequences. The weight assigned to each visit of a given tuple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"depends on the total number of visits between two model aggregation points, which is not causally known during the visitation. This paper proves the concentration by relating the sequence to a martingale difference sequence and bounding their stage-wise gap, which is different from ","element":"span"},{"href":"#id-4","referenceIndex":65,"text":"Woo et al. ","element":"a"},{"href":"#id-4","referenceIndex":65,"text":"(2023) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-15","referenceIndex":66,"text":"Woo et al. ","element":"a"},{"href":"#id-15","referenceIndex":66,"text":"(2024) ","element":"a"},{"text":"that used static behavior policies and bound similar gaps element-wisely. Our approach does not rely on a stationary visiting probability or the estimation of visiting numbers. (2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Heterogeneous triggering conditions for synchronization. ","element":"span"},{"text":"For different rounds (of synchronization) in a given stage (of policy update), we use different triggering conditions that allow more visits of a tuple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in early rounds. This reduces the number of synchronizations within a stage to ","element":"span"},{"style":{"height":16},"width":219.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-2.png","element":"img","alt":" O(fM log H)","inline":true,"padRight":true},{"text":"from ","element":"span"},{"style":{"height":16},"width":154.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-3.png","element":"img","alt":"O(fMH)","inline":true},{"text":", which would occur under homogeneous triggering conditions in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":". This is key to improving the communication cost of ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"and matching the policy switching cost of ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":".","element":"span"}],[{"text":"The rest of this paper is organized as follows. Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"provides the background and problem formulation. Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"presents the algorithm design of FedQ-Advantage. Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"studies the performance guarantees in terms of regret and communication cost. Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"concludes the paper. Related works, proofs, numerical experiments, and more details are presented in the appendices.","element":"span"}]]},{"heading":"2 BACKGROUND AND PROBLEM FORMULATION","paragraphs":[[{"text":"2.1 ","element":"span"},{"text":"P","element":"span"},{"text":"RELIMINARIES","element":"span"}],[{"text":"We first introduce the mathematical model and background on Markov decision processes. Throughout this paper, we assume that ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":"/","element":"span"},{"text":"0 = 0","element":"span"},{"text":". For any ","element":"span"},{"style":{"height":11.6},"width":109.88,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-4.png","element":"img","alt":" C ∈ N","inline":true},{"text":", we use ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"text":"] ","element":"span"},{"text":"to denote the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . C","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". We use ","element":"span"},{"text":"I","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"] ","element":"span"},{"text":"to denote the indicator function, which equals 1 when the event ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is true and 0 otherwise.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Tabular episodic Markov decision process (MDP). ","element":"span"},{"text":"A tabular episodic MDP is denoted as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":":= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":", H, ","element":"span"},{"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":", r","element":"span"},{"text":")","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is the set of states with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is the set of actions with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|A| ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"is the number of steps in each episode, ","element":"span"},{"style":{"height":17.9},"width":239.38,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-5.png","element":"img","alt":" P := {Ph}Hh=1","inline":true,"padRight":true},{"text":"is the transition kernel so that ","element":"span"},{"style":{"height":16},"width":185.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/2-6.png","element":"img","alt":" Ph(· | s, a)","inline":true,"padRight":true},{"text":"characterizes the distribution over the next state given the state action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"at step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":", and ","element":"span"},{"style":{"height":17.9},"width":220.89,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-0.png","element":"img","alt":"r := {rh}Hh=1 ","inline":true,"padRight":true},{"text":"is the collection of reward functions. We assume that ","element":"span"},{"style":{"height":16},"width":255.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-1.png","element":"img","alt":" rh(s, a) ∈ [0, 1]","inline":true,"padRight":true},{"text":"is a deterministic ","element":"span"},{"text":"function of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":", while the results can be easily extended to the case when ","element":"span"},{"style":{"height":9.19},"width":36.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-2.png","element":"img","alt":" rh","inline":true,"padRight":true},{"text":"is random.","element":"span"}],[{"text":"In each episode of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":", an initial state ","element":"span"},{"style":{"height":9.19},"width":34.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-3.png","element":"img","alt":" s1","inline":true,"padRight":true},{"text":"is selected arbitrarily by an adversary. Then, at each step ","element":"span"},{"style":{"height":16},"width":145.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-4.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":", an agent observes a state ","element":"span"},{"style":{"height":13.19},"width":128.97,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-5.png","element":"img","alt":" sh ∈ S","inline":true},{"text":", picks an action ","element":"span"},{"style":{"height":13.99},"width":137.36,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-6.png","element":"img","alt":" ah ∈ A","inline":true},{"text":", receives the reward ","element":"span"},{"style":{"height":16},"width":260.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-7.png","element":"img","alt":"rh = rh(sh, ah)","inline":true,"padRight":true},{"text":"and then transits to the next state ","element":"span"},{"style":{"height":10.79},"width":77.8,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-8.png","element":"img","alt":" sh+1","inline":true},{"text":". The episode ends when an absorbing state ","element":"span"},{"style":{"height":10.79},"width":87.38,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-9.png","element":"img","alt":"sH+1","inline":true,"padRight":true},{"text":"is reached. Later on, for the ease of presentation, we use “for any ","element":"span"},{"style":{"height":16},"width":191.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-10.png","element":"img","alt":" (∀) (s, a, h)","inline":true},{"text":"\" to represent “for any ","element":"span"},{"style":{"height":16},"width":460.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-11.png","element":"img","alt":" (∀) (s, a, h) ∈ S × A × [H]","inline":true},{"text":"\" and denote ","element":"span"},{"style":{"height":18.17},"width":844.31,"height":45.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-12.png","element":"img","alt":" Ps,a,hf = Esh+1∼Ph(·|s,a)(f(sh+1)|sh = s, ah = a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.94},"width":370.9,"height":49.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-13.png","element":"img","alt":" 1sf = f(s), ∀(s, a, h)","inline":true,"padRight":true},{"text":"for any function ","element":"span"},{"style":{"height":14},"width":184.9,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-14.png","element":"img","alt":" f : S → R.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Policies, state value functions, and action value functions. ","element":"span"},{"text":"A policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-15.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is a collection of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"functions ","element":"span"},{"style":{"height":23.17},"width":359.36,"height":57.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-16.png","element":"img","alt":"�πh : S → ∆A�h∈[H]","inline":true},{"text":", where ","element":"span"},{"style":{"height":14.18},"width":59.21,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-17.png","element":"img","alt":" ∆A","inline":true,"padRight":true},{"text":"is the set of probability distributions over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":". A policy ","element":"span"},{"text":"is deterministic if for any ","element":"span"},{"style":{"height":16},"width":208.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-18.png","element":"img","alt":" s ∈ S, πh(s)","inline":true,"padRight":true},{"text":"concentrates all the probability mass on an action ","element":"span"},{"style":{"height":12.4},"width":157.26,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-19.png","element":"img","alt":" a ∈ A. In","inline":true,"padRight":true},{"text":"this case, we denote ","element":"span"},{"style":{"height":16},"width":177.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-20.png","element":"img","alt":" πh(s) = a.","inline":true}],[{"text":"Let ","element":"span"},{"style":{"height":16.11},"width":566.52,"height":40.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-21.png","element":"img","alt":" V πh : S → R and Qπh : S × A → R","inline":true,"padRight":true},{"text":"denote the state value function and the action value function ","element":"span"},{"text":"at step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"under policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-22.png","element":"img","alt":" π","inline":true},{"text":". Mathematically, ","element":"span"},{"style":{"height":21.42},"width":920.4,"height":53.56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-23.png","element":"img","alt":" V πh (s) := �Hh′=h E(sh′,ah′)∼(P,π) [rh′(sh′, ah′) | sh = s].","inline":true}],[{"text":"We also use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":22},"width":1362.97,"height":54.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-24.png","element":"img","alt":"πh(s, a) := rh(s, a)+�Hh′=h+1 E(sh′,ah′)∼(P,π) [rh′(sh′, ah′) | sh = s, ah = a]. Since","inline":true}],[{"text":"the state and action spaces and the horizon are all finite, there always exists an optimal policy ","element":"span"},{"style":{"height":10.8},"width":111.26,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-25.png","element":"img","alt":" π⋆ that","inline":true,"padRight":true},{"text":"achieves the optimal value ","element":"span"},{"style":{"height":19.11},"width":521.12,"height":47.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-26.png","element":"img","alt":" V ⋆h (s) = supπ V πh (s) = V π∗h (s)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":11.6},"width":94.67,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-27.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":131.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-28.png","element":"img","alt":" h ∈ [H]","inline":true,"padRight":true},{"href":"#id-16","referenceIndex":7,"text":"(Azar et al.,","element":"a"}],[{"href":"#id-16","referenceIndex":7,"text":"2017)","element":"a"},{"text":". The Bellman equation and the Bellman optimality equation can be expressed as ","element":"span"},{"style":{"height":14.8},"width":36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-29.png","element":"img","alt":"","inline":true},{"style":{"height":29.6},"width":36,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-30.png","element":"img","alt":"","inline":true}],[{"id":"id-130","style":{"width":"92%"},"width":1467,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-31.png","element":"img"}],[{"text":"2.2 ","element":"span"},{"text":"T","element":"span"},{"text":"HE FEDERATED ","element":"span"},{"text":"RL ","element":"span"},{"text":"FRAMEWORK","element":"span"}],[{"text":"We consider an FRL setting with a central server and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"agents, each interacting with an independent copy of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":". The agents communicate with the server periodically: after receiving local information, the central server aggregates it and broadcasts certain information to the agents to coordinate their exploration. We assume that the central server knows the reward functions ","element":"span"},{"style":{"height":17.9},"width":413.42,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-32.png","element":"img","alt":" {rh}Hh=1 beforehand2. We","inline":true,"padRight":true},{"text":"define the communication cost of an algorithm as the number of scalars (integers or real numbers) communicated between the server and agents similar to ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":".","element":"span"}],[{"text":"For agent ","element":"span"},{"style":{"height":13.2},"width":157.83,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-33.png","element":"img","alt":" m, let Um","inline":true,"padRight":true},{"text":"be the number of generated episodes, ","element":"span"},{"style":{"height":10.58},"width":80.88,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-34.png","element":"img","alt":" πm,u ","inline":true,"padRight":true},{"text":"be the policy in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":"-th episode, and ","element":"span"},{"style":{"height":16.71},"width":79.52,"height":41.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-35.png","element":"img","alt":"xm,u1","inline":true,"padRight":true},{"text":"be the corresponding initial state. The regret of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"agents over ","element":"span"},{"style":{"height":20.4},"width":299.5,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-36.png","element":"img","alt":"ˆT = H �Mm=1 Um","inline":true,"padRight":true},{"text":"total steps is","element":"span"}],[{"style":{"width":"54%"},"width":871,"height":127,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-37.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":18.83},"width":181.53,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-38.png","element":"img","alt":" T := ˆT/M","inline":true,"padRight":true},{"text":"is the average total steps for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"agents.","element":"span"}]]},{"heading":"3 ALGORITHM DESIGN","paragraphs":[[{"text":"In this section, we elaborate on our model-free federated RL algorithm termed FedQ-Advantage.","element":"span"}],[{"text":"3.1 ","element":"span"},{"text":"B","element":"span"},{"text":"ASIC ","element":"span"},{"text":"S","element":"span"},{"text":"TRUCTURE","element":"span"},{"text":": A","element":"span"},{"text":"LIGNED ","element":"span"},{"text":"R","element":"span"},{"text":"OUNDS AND ","element":"span"},{"text":"U","element":"span"},{"text":"NALIGNED ","element":"span"},{"text":"S","element":"span"},{"text":"TAGES","element":"span"}],[{"text":"We first review the single-agent algorithm in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":". The agent generates episodes and splits them into ","element":"span"},{"style":{"fontWeight":"bold"},"text":"stages ","element":"span"},{"text":"for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":". Denoting ","element":"span"},{"style":{"height":16},"width":163.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-39.png","element":"img","alt":" yt(s, a, h)","inline":true,"padRight":true},{"text":"as the number of visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"-th stage for ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":", it requires that ","element":"span"},{"style":{"height":16},"width":627.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/3-40.png","element":"img","alt":" yt+1(s, a, h) = ⌊(1 + 1/H)yt(s, a, h)⌋","inline":true},{"text":", and the updates of estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-functions at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"only happen at the end of each stage. Due to the randomness of the visits, stage renewals for different triples might not happen simultaneously, resulting in unaligned stages. The exponential increase of the stage size leads to a low policy switching cost: the number of ","element":"span"},{"text":"different implemented policies is upper bounded by ","element":"span"},{"style":{"height":17.38},"width":267.44,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-0.png","element":"img","alt":" O(H2SA log T)","inline":true},{"text":". This provides the potential to parallelize the episodes generated under the same policy to multiple agents.","element":"span"}],[{"text":"However, the simple design in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"does not accommodate the unaligned stages. Thus, FedQ-Advantage designs novel aligned rounds for unaligned stages. Next, we introduce our algorithm design, which is also visually shown in Figure ","element":"span"},{"href":"#id-17","text":"1. ","element":"a"},{"text":"It proceeds in rounds indexed by ","element":"span"},{"style":{"height":16},"width":290.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-1.png","element":"img","alt":" k ∈ [K] and agent","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"generates ","element":"span"},{"style":{"height":13.39},"width":78.66,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-2.png","element":"img","alt":" nm,k","inline":true,"padRight":true},{"text":"episodes in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". The communication between agents and the central server occurs at the end of each round. For each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":", we divide rounds ","element":"span"},{"style":{"height":16},"width":129.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-3.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"text":"into stages ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . .","element":"span"},{"text":". Each stage contains consecutive multiple rounds: denote ","element":"span"},{"style":{"height":17.5},"width":130.35,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-4.png","element":"img","alt":" kth(s, a)","inline":true,"padRight":true},{"text":"as the index of the first round that ","element":"span"},{"text":"belongs to stage ","element":"span"},{"style":{"height":19.23},"width":248.28,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-5.png","element":"img","alt":" t, so kth < kt+1h","inline":true,"padRight":true},{"text":", and stage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"is composed of rounds ","element":"span"},{"style":{"height":19.23},"width":502.06,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-6.png","element":"img","alt":" kth, kth + 1, . . . , kt+1h − 1. Note","inline":true,"padRight":true},{"text":"that the definition of stages is specific to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":", meaning that a given round may belong to different stages for different ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":". Each round equips the agents with a common policy ","element":"span"},{"style":{"height":13.38},"width":41.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-7.png","element":"img","alt":" πk ","inline":true,"padRight":true},{"text":"for independent explorations and an event-triggered termination condition that will be explained later. At the end of each round, state renewal is judged for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"separately, resulting in unaligned stages.","element":"span"}],[{"id":"id-17","style":{"width":"84%"},"width":1337,"height":293,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-8.png","element":"img"}],[{"text":"Figure 1: The relationship between rounds and stages for different triples ","element":"figcaption","subtype":"caption"},{"style":{"height":17.38},"width":449.43,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-9.png","element":"img","alt":" (s1, a1, h1) and (s2, a2, h2).","inline":true,"padRight":true},{"text":"Each square represents a round, and the number inside indicates the round index. A stage is composed of consecutive rounds. Communication occurs at the end of each round and the estimated ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Q","element":"figcaption","subtype":"caption"},{"text":"-function is updated at the end of each stage. We can find from the figure that a round may belong to different stages for different triples. For example, the round ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":52.65,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-10.png","element":"img","alt":" k11","inline":true,"padRight":true},{"text":"is in stage ","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"of ","element":"figcaption","subtype":"caption"},{"style":{"height":17.39},"width":183.26,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-11.png","element":"img","alt":" (s1, a1, h1)","inline":true},{"text":", while in stage ","element":"figcaption","subtype":"caption"},{"text":"2 ","element":"figcaption","subtype":"caption"},{"text":"of ","element":"figcaption","subtype":"caption"},{"style":{"height":17.39},"width":183.26,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-12.png","element":"img","alt":" (s2, a2, h2)","inline":true},{"text":". Here, ","element":"figcaption","subtype":"caption"},{"style":{"height":19.98},"width":378.51,"height":49.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-13.png","element":"img","alt":" kit = kt+1hi (si, ai) − 1","inline":true,"padRight":true},{"text":"represents the index of the last round in stage ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"t ","element":"figcaption","subtype":"caption"},{"text":"for ","element":"figcaption","subtype":"caption"},{"style":{"height":16.99},"width":497.72,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-14.png","element":"img","alt":" (si, ai, hi), t ∈ {1, 2, · · · , Ti}","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":243.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-15.png","element":"img","alt":" i ∈ {1, 2}. T1","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":39.29,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-16.png","element":"img","alt":" T2","inline":true,"padRight":true},{"text":"are the total number of stages for ","element":"figcaption","subtype":"caption"},{"style":{"height":17.38},"width":443.48,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-17.png","element":"img","alt":"(s1, a1, h1) and (s2, a2, h2)","inline":true,"padRight":true},{"text":"respectively.","element":"figcaption","subtype":"caption"}],[{"text":"FedQ-Advantage updates the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"only at the end of each stage using stage-wise or global mean values regarding the next states of visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":". Thus, agents only need to prepare and share corresponding local round-wise means for global aggregations. It results in an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"MHS","element":"span"},{"text":") ","element":"span"},{"text":"communication cost within each round that is independent of the number of episodes.","element":"span"}],[{"text":"3.2 ","element":"span"},{"text":"A","element":"span"},{"text":"LGORITHM DETAILS","element":"span"}],[{"text":"We provide a notation table in Appendix ","element":"span"},{"href":"#id-18","referenceIndex":90,"text":"A ","element":"a"},{"text":"to facilitate understanding of this section. For the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th (","element":"span"},{"style":{"height":17.39},"width":192.35,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-18.png","element":"img","alt":"j ∈ [nm,k]","inline":true},{"text":") episode in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th round, let ","element":"span"},{"style":{"height":19.51},"width":96.48,"height":48.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-19.png","element":"img","alt":" sk,m,j1","inline":true,"padRight":true},{"text":"be the initial state for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"-th agent, and ","element":"span"},{"style":{"height":20.07},"width":469.06,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-20.png","element":"img","alt":" {(sk,m,jh , ak,m,jh , rk,m,jh )}Hh=1","inline":true,"padRight":true},{"text":"be the corresponding trajectory. Define ","element":"span"},{"style":{"height":19.23},"width":343.66,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-21.png","element":"img","alt":" {V kh : S → R}H+1h=1","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"height":19.23},"width":414.66,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-22.png","element":"img","alt":"{Qkh : S × A → R}H+1h=1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":370.39,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-23.png","element":"img","alt":" {V ref,kh : S → R}H+1h=1","inline":true,"padRight":true},{"text":"as the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"-function, the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function and the reference function at the beginning of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Here, ","element":"span"},{"style":{"height":21.54},"width":398.31,"height":53.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-24.png","element":"img","alt":" QkH+1, V kH+1, V ref,kH+1 = 0","inline":true},{"text":". ","element":"span"},{"text":"We use ","element":"span"},{"style":{"height":20.47},"width":367.03,"height":51.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-25.png","element":"img","alt":" V adv,kh = V kh − V ref,kh","inline":true,"padRight":true},{"text":"to denote the estimated advantage function. For any predefined functions ","element":"span"},{"style":{"height":14.8},"width":272.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-26.png","element":"img","alt":" g : S × A → R","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":14},"width":191.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-27.png","element":"img","alt":" f : S → R","inline":true},{"text":", we will use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"in replace of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":") ","element":"span"},{"text":"when there is no ambiguity for simplification. We also denote ","element":"span"},{"style":{"height":17.9},"width":123.99,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-28.png","element":"img","alt":" tkh(s, a)","inline":true,"padRight":true},{"text":"as the stage index in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":422.3,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-29.png","element":"img","alt":"Iren,kh (s, a) = I[tkh > tk−1h ]","inline":true,"padRight":true},{"text":"as a stage renewal indicator with ","element":"span"},{"style":{"height":20.07},"width":330.49,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-30.png","element":"img","alt":" Iren,1h = 1, ∀(s, a, h).","inline":true}],[{"text":"Then we briefly explain each component of the algorithm in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"as follows.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 1. Coordinated exploration for agents. ","element":"span"},{"text":"At the beginning of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", the server holds the values on all states, actions, and steps for functions ","element":"span"},{"style":{"height":20.07},"width":469.6,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-31.png","element":"img","alt":" {Qkh, V kh , V ref,kh , N kh, nkh, ˜nkh}","inline":true},{"text":". Here, ","element":"span"},{"style":{"height":17.9},"width":144.93,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-32.png","element":"img","alt":" N kh(s, a)","inline":true,"padRight":true},{"text":"is the total number of visits for all agents up to but not including stage ","element":"span"},{"style":{"height":17.9},"width":193.25,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-33.png","element":"img","alt":" tkh, nkh(s, a)","inline":true,"padRight":true},{"text":"is the total ","element":"span"},{"text":"number of visits for all agents in the stage ","element":"span"},{"style":{"height":17.9},"width":110.49,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-34.png","element":"img","alt":" tkh − 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":17.9},"width":133.53,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-35.png","element":"img","alt":" ˜nkh(s, a)","inline":true,"padRight":true},{"text":"is the number of visits for all ","element":"span"},{"text":"the agents in the stage ","element":"span"},{"style":{"height":17.9},"width":33.39,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-36.png","element":"img","alt":" tkh","inline":true,"padRight":true},{"text":"before the start of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Here, ","element":"span"},{"style":{"height":17.9},"width":132,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-37.png","element":"img","alt":" nkh = 0","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":17.9},"width":122.48,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-38.png","element":"img","alt":" tkh = 1","inline":true},{"text":". When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"text":", ","element":"span"},{"style":{"height":20.07},"width":613.98,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-39.png","element":"img","alt":"Q1h = V 1h = V ref,1h = H, ∀(s, a, h)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.39},"width":40.14,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/4-40.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"is an arbitrary deterministic policy. It also holds ","element":"span"},{"text":"the following global values ","element":"span"},{"style":{"height":20.47},"width":1124.25,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-0.png","element":"img","alt":" µref,kh (s, a), σref,kh (s, a), µadv,kh (s, a), σadv,kh (s, a), µval,kh (s, a), ∀(s, a, h)","inline":true},{"text":". When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"they are initialized as 0. Further explanations will be provided in their updates in Step 4. Next, the central server decides a deterministic policy ","element":"span"},{"style":{"height":17.9},"width":239.62,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-1.png","element":"img","alt":" πk = {πkh}Hh=1","inline":true},{"text":", and then broadcasts ","element":"span"},{"style":{"height":17.9},"width":141.82,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-2.png","element":"img","alt":" πkh along","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":18.17},"width":511.82,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-3.png","element":"img","alt":" {nkh(s, πkh(s)), ˜nkh(s, πkh(s))}s,h","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":244.9,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-4.png","element":"img","alt":" {V kh , V ref,kh }s,h","inline":true,"padRight":true},{"text":"to all of the agents. Once receiving such information, the agents will execute policy ","element":"span"},{"style":{"height":13.38},"width":41.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-5.png","element":"img","alt":" πk ","inline":true,"padRight":true},{"text":"and start collecting trajectories.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 2. Event-triggered termination of exploration. ","element":"span"},{"text":"We introduce a Boolean variable ","element":"span"},{"style":{"height":11.59},"width":75,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-6.png","element":"img","alt":" usyn","inline":true,"padRight":true},{"text":"as an input to the algorithm for the forced synchronization. During the exploration under ","element":"span"},{"style":{"height":13.39},"width":41.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-7.png","element":"img","alt":" πk","inline":true},{"text":", every agent will monitor its total number of visits for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"triple within the current round. Define","element":"span"}],[{"id":"id-25","style":{"width":"91%"},"width":1452,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-8.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":15.19},"width":236.14,"height":37.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-9.png","element":"img","alt":" usyn = TRUE","inline":true},{"text":", for any agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":", at the end of each episode, if any ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"has been visited by ","element":"span"},{"style":{"height":17.9},"width":126.85,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-10.png","element":"img","alt":"ckh(s, a)","inline":true,"padRight":true},{"text":"times, the agent will stop exploration and send a signal to the server that requests all agents ","element":"span"},{"text":"to abort the exploration. If ","element":"span"},{"style":{"height":15.59},"width":251.66,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-11.png","element":"img","alt":" usyn = FALSE","inline":true},{"text":", the central server will wait until for each agent, there exists a triple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"that has been visited by ","element":"span"},{"style":{"height":17.9},"width":232.66,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-12.png","element":"img","alt":" ckh(s, a) times.","inline":true}],[{"text":"During this process, each agent ","element":"span"},{"style":{"height":13.39},"width":235.62,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-13.png","element":"img","alt":" m collect nm,k ","inline":true,"padRight":true},{"text":"trajectories ","element":"span"},{"style":{"height":20.07},"width":659.39,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-14.png","element":"img","alt":" {(sk,m,jh , ak,m,jh , rk,m,jh )}Hh=1, j ∈ [nm,k]","inline":true,"padRight":true},{"text":"and calculates the following local quantities:","element":"span"}],[{"id":"id-20","style":{"width":"88%"},"width":1410,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-15.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":20.07},"width":170.22,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-16.png","element":"img","alt":" nm,kh (s, a)","inline":true,"padRight":true},{"text":"is the number of visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Thus, we have","element":"span"}],[{"id":"id-28","style":{"width":"83%"},"width":1330,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-17.png","element":"img"}],[{"text":"Other quantities correspond to the summation of the values of five different functions applied to the next states of all the visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". These five functions are ","element":"span"},{"style":{"height":22.07},"width":620.42,"height":55.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-18.png","element":"img","alt":"V ref,kh+1 , V adv,kh+1 , V kh+1, [V ref,kh+1 ]2, [V adv,kh+1 ]2","inline":true},{"text":". Mathematically, for ","element":"span"},{"style":{"height":14},"width":203.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-19.png","element":"img","alt":" f : S → R","inline":true},{"text":", letting ","element":"span"},{"style":{"height":20.3},"width":242.48,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-20.png","element":"img","alt":" Akm,s,a,h(f) =","inline":true},{"style":{"height":25.61},"width":750.12,"height":64.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-21.png","element":"img","alt":"�nm,kj=1 f(sk,m,jh+1 )× I[(sk,m,jh , ak,m,jh ) = (s, a)]","inline":true,"padRight":true},{"text":"as the summation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"on the next states for all the visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". When there is no ambiguity, we will use the simplified notation ","element":"span"},{"style":{"height":20.3},"width":378.98,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-22.png","element":"img","alt":" Akm(f) = Akm,s,a,h(f)","inline":true},{"text":". Then, ","element":"span"},{"style":{"height":22.87},"width":910.18,"height":57.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-23.png","element":"img","alt":" µm,kh,ref(s, a) = Akm(V ref,kh+1 ), µm,kh,adv(s, a) = Akm(V adv,kh+1 ),","inline":true},{"style":{"height":22.47},"width":895.78,"height":56.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-24.png","element":"img","alt":"µm,kh,val(s, a) = Akm(V kh+1), σm,kh,ref(s, a) = Akm([V ref,kh+1 ]2)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.87},"width":477.09,"height":57.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-25.png","element":"img","alt":" σm,kh,adv(s, a) = Akm([V adv,kh+1 ]2)","inline":true},{"text":". These ","element":"span"},{"text":"quantities correspond to local aggregations of different types of value functions and can be adaptively calculated when collecting the trajectories as shown in Algorithm ","element":"span"},{"href":"#id-19","text":"2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Step 3. Stage renewal. ","element":"span"},{"text":"After the exploration in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", agents share local quantities in Equation ","element":"span"},{"href":"#id-20","text":"(3) ","element":"a"},{"text":"on all ","element":"span"},{"style":{"height":17.9},"width":459.96,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-26.png","element":"img","alt":" (s, a, h) such that a = πkh(s)","inline":true,"padRight":true},{"text":"to the central server. Then it finds existing visits in stage ","element":"span"},{"style":{"height":17.9},"width":78.7,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-27.png","element":"img","alt":" tkh as","inline":true}],[{"style":{"width":"76%"},"width":1208,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-28.png","element":"img"}],[{"text":"and renew the stages for triples that are sufficiently visited: ","element":"span"},{"style":{"height":16},"width":162.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-29.png","element":"img","alt":" ∀(s, a, h),","inline":true}],[{"id":"id-21","style":{"width":"97%"},"width":1540,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-30.png","element":"img"}],[{"text":"In Equation ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":", when ","element":"span"},{"style":{"height":17.9},"width":139.49,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-31.png","element":"img","alt":" nkh = 0","inline":true},{"text":", i.e., ","element":"span"},{"style":{"height":17.9},"width":129.96,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-32.png","element":"img","alt":" tkh = 1","inline":true},{"text":", the state renewal threshold is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MH","element":"span"},{"text":". Else, it is ","element":"span"},{"style":{"height":17.9},"width":218.75,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-33.png","element":"img","alt":"(1 + 1/H)nkh","inline":true},{"text":". With Equation ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":", the central can determine the stage renewal indicator ","element":"span"},{"style":{"height":20.07},"width":188.16,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-34.png","element":"img","alt":" Iren,k+1h and","inline":true,"padRight":true},{"text":"the counts ","element":"span"},{"style":{"height":19.23},"width":297.62,"height":48.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-35.png","element":"img","alt":" N k+1h , nk+1h , ˜nk+1h","inline":true,"padRight":true},{"text":"as shown in lines 13 and 18 in Algorithm ","element":"span"},{"href":"#id-22","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Step 4. Updates of estimated value functions and policies. ","element":"span"},{"text":"According to the stage renewal, the central server updates the global values ","element":"span"},{"style":{"height":20.47},"width":740.65,"height":51.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-36.png","element":"img","alt":" µref,kh , σref,kh , µadv,kh , σadv,kh , µval,kh at all (s, a, h)","inline":true,"padRight":true},{"text":"as follows:","element":"span"}],[{"id":"id-23","style":{"width":"71%"},"width":1131,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/5-37.png","element":"img"}],[{"style":{"width":"92%"},"width":1463,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-0.png","element":"img"}],[{"text":"Equation ","element":"span"},{"href":"#id-23","text":"(8) ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":20.07},"width":286.26,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-1.png","element":"img","alt":" (µ, σ)ref,k+1h (s, a)","inline":true,"padRight":true},{"text":"gives the sum of the estimated reference functions and squared reference functions at the next states for all agents and all visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"up to the end of round ","element":"span"},{"style":{"height":19.23},"width":460.62,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-2.png","element":"img","alt":" k. (µadv, µval, σadv)k+1h (s, a)","inline":true,"padRight":true},{"text":"gives the sum of the estimated advantage functions, value functions, and squared advantage functions at the next states for all agents and all visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in stage ","element":"span"},{"style":{"height":17.9},"width":33.39,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-3.png","element":"img","alt":" tkh","inline":true,"padRight":true},{"text":"and up to the end of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Here, ","element":"span"},{"style":{"height":20.07},"width":180.32,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-4.png","element":"img","alt":" (1 − Iren,kh )","inline":true,"padRight":true},{"text":"clears the historical cumulation if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":90.7,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-5.png","element":"img","alt":"k − 1","inline":true,"padRight":true},{"text":"belong to different stages.","element":"span"}],[{"text":"Next, the central server updates the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function for all ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"triples with a stage renewal while keeping others unchanged:","element":"span"}],[{"id":"id-113","style":{"width":"89%"},"width":1418,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-6.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":20.07},"width":249.46,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-7.png","element":"img","alt":" Qk+1,1h , Qk+1,2h","inline":true,"padRight":true},{"text":"represents the Hoeffding-type used in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":", and the reference-advantage type update used in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":", respectively:","element":"span"}],[{"id":"id-27","style":{"width":"88%"},"width":1396,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-8.png","element":"img"}],[{"text":"In these updates where stage renewal happens, ","element":"span"},{"style":{"height":19.23},"width":196.03,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-9.png","element":"img","alt":" N k+1h , nk+1h","inline":true,"padRight":true},{"text":"count all historical visits and visits in stage ","element":"span"},{"style":{"height":17.9},"width":33.39,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-10.png","element":"img","alt":" tkh","inline":true},{"text":", respectively. Thus, ","element":"span"},{"style":{"height":20.47},"width":228.74,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-11.png","element":"img","alt":" µval,k+1h /nk+1h","inline":true,"padRight":true},{"text":"is the stage-wise mean of the estimated value function, ","element":"span"},{"style":{"height":20.47},"width":235.64,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-12.png","element":"img","alt":"µadv,k+1h /nk+1h","inline":true,"padRight":true},{"text":"is the stage-wise mean of the estimated advantage function, and ","element":"span"},{"style":{"height":20.07},"width":329.09,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-13.png","element":"img","alt":" µref,k+1h /N k+1h gives","inline":true,"padRight":true},{"text":"the all-history estimated mean of reference function. ","element":"span"},{"style":{"height":20.07},"width":220.65,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-14.png","element":"img","alt":" bk+1,1h , bk+1,2h","inline":true,"padRight":true},{"text":"are upper confidence bounds (UCB) that dominate the variances in the above empirical mean estimations. Their expressions are provided in line 15 of Algorithm ","element":"span"},{"href":"#id-22","text":"1.","element":"a"}],[{"text":"Next, the central server updates the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"-function and the policy as follows:","element":"span"}],[{"id":"id-114","style":{"width":"94%"},"width":1492,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-15.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Step 5. Updates of the reference function. ","element":"span"},{"text":"With a constant ","element":"span"},{"style":{"height":14.79},"width":152.38,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-16.png","element":"img","alt":" N0 ∈ R+","inline":true},{"text":", the central server conducts","element":"span"}],[{"id":"id-24","style":{"width":"90%"},"width":1435,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-17.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":20},"width":820.07,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-18.png","element":"img","alt":" ks,h = inf{k ∈ N+ : �a′∈A N k+1h (s, a′) ≥ N0}","inline":true},{"text":". Equation ","element":"span"},{"href":"#id-24","text":"(14) ","element":"a"},{"text":"means that at the end of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", for all ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"such that the stage for ","element":"span"},{"style":{"height":17.9},"width":201.99,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-19.png","element":"img","alt":" (s, πkh(s), h)","inline":true,"padRight":true},{"text":"is renewed, we will update the reference ","element":"span"},{"text":"function at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"based on the updated value function ","element":"span"},{"style":{"height":19.23},"width":90.18,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-20.png","element":"img","alt":" V k+1h","inline":true,"padRight":true},{"text":"if round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is the first round such that the global visiting number to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"across all complete stages reaches ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-21.png","element":"img","alt":" N0","inline":true},{"text":". “First round\" indicates that the reference update on each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"happens at most once during the whole learning process with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2 ","element":"span"},{"style":{"fontStyle":"italic"},"text":". . .","element":"span"},{"text":", and the reference function on ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"will be settled after its update. This design matches the single-agent algorithms in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-12","referenceIndex":40,"text":"Li et al. ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"(2021)","element":"a"},{"text":".","element":"span"}],[{"text":"Now we are ready to provide FedQ-Advantage in Algorithms ","element":"span"},{"href":"#id-22","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-19","text":"2 ","element":"a"},{"text":"for the behaviors of the central server and the agents. In Algorithm ","element":"span"},{"href":"#id-22","text":"1, ","element":"a"},{"style":{"height":13.19},"width":39.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-22.png","element":"img","alt":" T0","inline":true,"padRight":true},{"text":"limits the total number of steps for all agents. Line 20 in Algorithm ","element":"span"},{"href":"#id-22","text":"1 ","element":"a"},{"text":"updates the reference function at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"when the visiting number exceeds ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-23.png","element":"img","alt":" N0","inline":true,"padRight":true},{"text":"for the first time and keeps it unchanged for other situations, coinciding with Equation ","element":"span"},{"href":"#id-24","text":"(14)","element":"a"},{"text":".","element":"span"}],[{"text":"3.3 ","element":"span"},{"text":"I","element":"span"},{"text":"NTUITION BEHIND THE ","element":"span"},{"text":"A","element":"span"},{"text":"LGORITHM ","element":"span"},{"text":"D","element":"span"},{"text":"ESIGN","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Exponentially increasing stage sizes: infrequent policy switching. ","element":"span"},{"text":"FedQ-Advantage guarantees that ","element":"span"},{"style":{"height":16},"width":678.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-24.png","element":"img","alt":" yt+1(s, a, h) = (1 + Θ(1)/H)yt(s, a, h)","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":163.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-25.png","element":"img","alt":" yt(s, a, h)","inline":true,"padRight":true},{"text":"represents the number of visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in stage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". The exponential increasing rate is controlled by the threshold ","element":"span"},{"style":{"height":17.9},"width":126.85,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-26.png","element":"img","alt":" ckh(s, a)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"text":"Equation ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"and stage renewal condition in Equation ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":". By analyzing the two cases of Equation ","element":"span"},{"href":"#id-25","text":"(2)","element":"a"},{"text":", we can prove that ","element":"span"},{"style":{"height":19.23},"width":365.29,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-27.png","element":"img","alt":" ˆnk+1h ≤ (1 + 2/H)nkh","inline":true},{"text":", which implies that ","element":"span"},{"style":{"height":16},"width":343.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-28.png","element":"img","alt":" yt+1 ≤ (1 + 2/H)yt","inline":true},{"text":". Equation ","element":"span"},{"href":"#id-21","text":"(7) ","element":"a"},{"text":"further implies that ","element":"span"},{"style":{"height":16},"width":334.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-29.png","element":"img","alt":" yt+1 ≥ (1 + 1/H)yt","inline":true},{"text":". The details are provided in (d) and (e) of Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"Our visits in stages satisfy a similar exponential increasing pattern as ","element":"span"},{"style":{"height":16},"width":372.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-30.png","element":"img","alt":" yt+1 = ⌊(1 + 1/H)yt⌋","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":", and FedQ-Advantage switches policies infrequently since estimated ","element":"span"},{"style":{"height":14},"width":62.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/6-31.png","element":"img","alt":" Q−","inline":true},{"text":"functions and the policies are only updated after stage renewals.","element":"span"}],[{"id":"id-22","style":{"width":"100%"},"width":1592,"height":1201,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/7-0.png","element":"img"}],[{"text":"16: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"else","element":"span"}],[{"text":"17: ","element":"span"},{"text":"(Stage unchanged) ","element":"span"},{"style":{"height":20.07},"width":1085.73,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/7-1.png","element":"img","alt":" Qk+1h = Qkh, Iren,k+1h = 0, nk+1h = nkh, ˜nk+1h = ˆnk+1h , N k+1h = N kh.","inline":true,"padRight":true},{"text":"18: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"end if","element":"span"}],[{"style":{"width":"93%"},"width":1489,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/7-2.png","element":"img"}],[{"text":"23: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"end while","element":"span"}],[{"id":"id-19","style":{"width":"100%"},"width":1586,"height":717,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/7-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Reference-advantage decompositions: the key to the almost optimal regret. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-learning algorithms ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"(Jin et al., ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"2018; ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"Bai et al., ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"2019; ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al., ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"2020; ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"Li et al., ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"2021; ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al., ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"2024a) ","element":"a"},{"text":"update the estimated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function in the following form: ","element":"span"},{"style":{"height":18.12},"width":830.47,"height":45.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-0.png","element":"img","alt":" Qh(s, a) ← rh(s, a) + EST(Ps,a,hV ⋆h+1) + b. Here,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"b > ","element":"span"},{"text":"0 ","element":"span"},{"text":"is the upper confidence bound (UCB) that promotes exploration, and EST","element":"span"},{"style":{"height":16},"width":42.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-1.png","element":"img","alt":"(·)","inline":true,"padRight":true},{"text":"represents the empirical estimation, which takes the form of a weighted sum of the historically estimated value functions for the next states following the visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"This update is motivated by the Bellman optimality equation. The error EST","element":"span"},{"style":{"height":18.12},"width":438.8,"height":45.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-2.png","element":"img","alt":"(Ps,a,hV ⋆h+1) − Ps,a,hV ⋆h+1 ","inline":true,"padRight":true},{"text":"can be decomposed into the variance ","element":"span"},{"text":"from the random transitions to the next states and the bias in the estimated value functions. To handle the bias that is more severe in the early visits, ","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"Jin et al. ","element":"a"},{"href":"#id-8","referenceIndex":30,"text":"(2018)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":8,"text":"Bai et al. ","element":"a"},{"href":"#id-10","referenceIndex":8,"text":"(2019)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"required that the weights concentrate on the most recent ","element":"span"},{"style":{"height":16},"width":138.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-3.png","element":"img","alt":" Θ(1/H)","inline":true,"padRight":true},{"text":"proportion of visits like Equation ","element":"span"},{"href":"#id-27","text":"(11) ","element":"a"},{"text":"that only use visits in the current stage, which causes sample inefficiency and suboptimal regret.","element":"span"}],[{"text":"To address this issue, FedQ-Advantage uses the reference-advantage decomposition adopted by ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":40,"text":"Li et al. ","element":"a"},{"href":"#id-12","referenceIndex":40,"text":"(2021)","element":"a"},{"text":". Equation ","element":"span"},{"href":"#id-27","text":"(12) ","element":"a"},{"text":"in Step 4 represents the decomposition. We decompose the estimation of ","element":"span"},{"style":{"height":16.91},"width":178.62,"height":42.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-4.png","element":"img","alt":" Ps,a,hV ⋆h+1 ","inline":true,"padRight":true},{"text":"into the reference part ","element":"span"},{"style":{"height":20.07},"width":236.55,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-5.png","element":"img","alt":" µref,k+1h /N k+1h","inline":true,"padRight":true},{"text":"and the advantage part ","element":"span"},{"style":{"height":20.47},"width":235.64,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-6.png","element":"img","alt":"µadv,k+1h /nk+1h","inline":true,"padRight":true},{"text":". For the advantage part, we use stage-wise mean to eliminate the large biases in early value estimations. For the reference part, since the reference function will settle after ","element":"span"},{"style":{"height":18.83},"width":189.18,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-7.png","element":"img","alt":" N0 = ˜O(1)","inline":true,"padRight":true},{"text":"visits as shown in Step 5, we can neglect the bias and use the mean of all historical visits. This design reduces the error in the empirical estimation by improving the sample efficiency in the reference part and restricting the error ranges in the advantage part. The reference-advantage decomposition, together with the exponentially increasing rate of stage sizes, is the key to our improved regret compared to ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"and our linear regret speedup compared to the almost optimal regret given in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Heterogeneous event-triggered communication: the key to our improved communication cost. ","element":"span"},{"text":"FedQ-Advantage uses ","element":"span"},{"style":{"height":17.9},"width":36.24,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-8.png","element":"img","alt":" ckh","inline":true,"padRight":true},{"text":"given in Equation ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"to limit the number of new visits for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in ","element":"span"},{"text":"round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". While the first case in Equation ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"is similar to the homogeneous condition in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":", we design the second case to allow more new visits when the number of existing visits in the current stage is small. Specifically, when ","element":"span"},{"style":{"height":17.9},"width":182.63,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-9.png","element":"img","alt":" nkh ≥ MH","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.9},"width":431.45,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-10.png","element":"img","alt":" ˜nkh ≤ (1 − 1/H)nkh, ckh =","inline":true},{"style":{"height":19.2},"width":540.65,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-11.png","element":"img","alt":"�(nkh − ˜nkh)/M� ≥ �nkh/(MH)�","inline":true},{"text":". Thus, FedQ-Advantage allows more visits in the early rounds ","element":"span"},{"text":"of each stage compared to ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"and reduces the number of communication rounds, which is key to our improved communication cost shown in Table ","element":"span"},{"href":"#id-14","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Optional forced synchronization: accommodating heterogeneous exploration speeds. ","element":"span"},{"text":"Section ","element":"span"},{"href":"#id-20","text":"3.2 ","element":"a"},{"text":"and eqs. ","element":"span"},{"href":"#id-28","text":"(4) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","text":"(5) ","element":"a"},{"text":"highlight the effect of the optional forced synchronization used in Step 2. Section ","element":"span"},{"href":"#id-20","text":"3.2 ","element":"a"},{"text":"shows a common limitation of new visits, which is sufficient for our linear regret speedup compared to the single-agent algorithm in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":". Next, we discuss the robustness and trade-offs of optional forced synchronization when under heterogeneous exploration speeds of agents.","element":"span"}],[{"text":"When optional forced synchronization is enabled (i.e., ","element":"span"},{"style":{"height":15.19},"width":241.72,"height":37.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-12.png","element":"img","alt":" usyn = TRUE","inline":true},{"text":"), exploration and communication occur as soon as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"one ","element":"span"},{"text":"agent reaches the threshold ","element":"span"},{"style":{"height":17.9},"width":126.85,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-13.png","element":"img","alt":" ckh(s, a)","inline":true},{"text":". This allows faster agents to ","element":"span"},{"text":"avoid waiting for slower ones, minimizing waiting time. However, Equation ","element":"span"},{"href":"#id-28","text":"(4) ","element":"a"},{"text":"guarantees sufficient exploration by only one agent, resulting in varied episode counts across agents. This configuration is suitable for tasks sensitive to waiting time.","element":"span"}],[{"text":"When optional forced synchronization is disabled (i.e., ","element":"span"},{"style":{"height":15.59},"width":248.33,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-14.png","element":"img","alt":" usyn = FALSE","inline":true},{"text":"), communication occurs only after ","element":"span"},{"style":{"fontWeight":"bold"},"text":"all ","element":"span"},{"text":"agents meet the threshold ","element":"span"},{"style":{"height":17.9},"width":126.85,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-15.png","element":"img","alt":" ckh(s, a)","inline":true},{"text":". Equation ","element":"span"},{"href":"#id-28","text":"(5) ","element":"a"},{"text":"ensures sufficient exploration by all agents, ","element":"span"},{"text":"with episode counts being roughly balanced. This allows for more extensive exploration within a round, reducing communication costs but potentially increasing waiting time for faster agents.","element":"span"}]]},{"heading":"4 PERFORMANCE GUARANTEES","paragraphs":[[{"text":"Next, we provide regret upper bound for FedQ-Advantage as follows.","element":"span"}],[{"id":"id-31","style":{"fontWeight":"bold"},"text":"Theorem 4.1 ","element":"span"},{"text":"(Regret of FedQ-Advantage)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"href":"#id-19","style":{"height":24.58},"width":848.41,"height":61.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-16.png","element":"img","alt":" ι = log(2/p) with p ∈ (0, 1) and N0 = 5184 SAH5ιβ2 +","inline":true}],[{"style":{"height":24.58},"width":428.31,"height":61.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-17.png","element":"img","alt":"16 MSAH3β with β ∈ (0, H]","inline":true},{"style":{"fontStyle":"italic"},"text":". For Algorithms ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-19","style":{"fontStyle":"italic"},"text":"2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"with probability at least ","element":"span"},{"style":{"height":17.78},"width":394.98,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-18.png","element":"img","alt":" 1−(4SAT 51 +SAHT 41 +","inline":true}],[{"style":{"width":"81%"},"width":1298,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/8-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Here, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is the total number of rounds, ","element":"span"},{"style":{"height":20.4},"width":274.54,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-0.png","element":"img","alt":" T = H �Kk=1 nk","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the total number of steps for each agent, ","element":"span"},{"style":{"height":20.32},"width":680.72,"height":50.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-1.png","element":"img","alt":"T1 = (2+ 2H )T0 +MSAH(H +1), and ˜O","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"hides logarithmic multipliers on ","element":"span"},{"style":{"height":16},"width":387.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-2.png","element":"img","alt":" T0, M, H, S, A, 1/p and","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"poly represents some polynomial. This result does not depend on the value of ","element":"span"},{"style":{"height":11.59},"width":75,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-3.png","element":"img","alt":" usyn","inline":true},{"style":{"fontStyle":"italic"},"text":". See Equation ","element":"span"},{"href":"#id-29","text":"(26) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"in Appendix ","element":"span"},{"href":"#id-30","referenceIndex":344,"style":{"fontStyle":"italic"},"text":"E ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for the complete upper bound.","element":"span"}],[{"text":"Theorem ","element":"span"},{"href":"#id-31","text":"4.1 ","element":"a"},{"text":"indicates that the total regret scales as ","element":"span"},{"style":{"height":18.83},"width":278.9,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-4.png","element":"img","alt":"˜O(√H2MTSA)","inline":true,"padRight":true},{"text":"when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is larger than some polynomial of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MHSA ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16},"width":160.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-5.png","element":"img","alt":" β = Ω(1)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-6.png","element":"img","alt":" N0","inline":true},{"text":". This is almost optimal compared to the information bound ","element":"span"},{"style":{"height":18.75},"width":276.18,"height":46.87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-7.png","element":"img","alt":" Ω(√H2SAMT)","inline":true,"padRight":true},{"text":"and is better than ","element":"span"},{"style":{"height":18.83},"width":278.9,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-8.png","element":"img","alt":"˜O(√H3SAMT)","inline":true,"padRight":true},{"text":"for algorithms in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":". When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 1","element":"span"},{"text":", our regret bound becomes ","element":"span"},{"style":{"height":18.83},"width":362.4,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-9.png","element":"img","alt":"˜O((1 + β)√H2TSA)","inline":true,"padRight":true},{"text":"when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is large, which is better than ","element":"span"},{"style":{"height":18.83},"width":429.77,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-10.png","element":"img","alt":"˜O((1 + β√H)√H2TSA)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"thanks to our tighter regret analysis. This also means that to reach an almost optimal regret bound, ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"requires ","element":"span"},{"style":{"height":18.3},"width":255.58,"height":45.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-11.png","element":"img","alt":" β ≤ O(1/√H)","inline":true,"padRight":true},{"text":"and FedQ-Advantage lays a weaker one ","element":"span"},{"style":{"height":16},"width":157.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-12.png","element":"img","alt":" β ≤ Ω(1)","inline":true},{"text":". When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M > ","element":"span"},{"text":"1","element":"span"},{"text":", focusing on the dominate terms ","element":"span"},{"style":{"height":24.4},"width":420.34,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-13.png","element":"img","alt":"˜OÄ(1 + β)√MSAH2Tä","inline":true},{"text":"when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is large, our algorithm achieves a near-linear regret speedup","element":"span"}],[{"text":"while the overhead term ","element":"span"},{"style":{"height":18.83},"width":383.05,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-14.png","element":"img","alt":"˜O(Mpoly(HSA, 1/β))","inline":true,"padRight":true},{"text":"results from the burn-in cost for using reference-advantage decomposition ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"(Zhang et al., ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"2020)","element":"a"},{"text":", and the ","element":"span"},{"style":{"height":16},"width":139.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-15.png","element":"img","alt":" Ω(HM)","inline":true,"padRight":true},{"text":"visits collected in the first stage for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":", which servers as the multi-agent burn-in cost that is common in federated algorithms (see e.g. ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":65,"text":"Woo et al. ","element":"a"},{"href":"#id-4","referenceIndex":65,"text":"(2023; ","element":"a"},{"href":"#id-15","referenceIndex":66,"text":"2024)","element":"a"},{"text":").","element":"span"}],[{"text":"Next, we discuss the improved communication cost compared to ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a) ","element":"a"},{"text":"as follows.","element":"span"}],[{"id":"id-32","style":{"fontWeight":"bold"},"text":"Theorem 4.2 ","element":"span"},{"text":"(Communication rounds of FedQ-Advantage)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Algorithms ","element":"span"},{"href":"#id-22","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-19","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":11.59},"width":118.84,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-16.png","element":"img","alt":" usyn =","inline":true,"padRight":true},{"text":"TRUE","element":"span"},{"style":{"fontStyle":"italic"},"text":", the number of communication rounds ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the total number of steps ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy that","element":"span"}],[{"style":{"width":"63%"},"width":1005,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"height":15.59},"width":248.33,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-18.png","element":"img","alt":" usyn = FALSE","inline":true},{"style":{"fontStyle":"italic"},"text":", the number of communication rounds ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the total number of steps ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"57%"},"width":919,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-19.png","element":"img"}],[{"text":"Theorem ","element":"span"},{"href":"#id-32","text":"4.2 ","element":"a"},{"text":"implies if ","element":"span"},{"style":{"height":15.99},"width":338.58,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-20.png","element":"img","alt":" usyn = TRUE and T","inline":true,"padRight":true},{"text":"is sufficiently large, ","element":"span"},{"style":{"height":19.2},"width":547.46,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-21.png","element":"img","alt":" K = O �MH2SA(log H) log T�.","inline":true,"padRight":true},{"text":"Since the total number of communicated scalars is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"MHS","element":"span"},{"text":") ","element":"span"},{"text":"in each round, the total communication cost scales in ","element":"span"},{"style":{"height":17.38},"width":471.68,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-22.png","element":"img","alt":" O(M 2H3S2A(log H) log T)","inline":true},{"text":". Thanks to the heterogeneous design in ","element":"span"},{"style":{"height":17.9},"width":126.85,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-23.png","element":"img","alt":" ckh(s, a)","inline":true},{"text":", it is ","element":"span"},{"text":"better than ","element":"span"},{"style":{"height":17.38},"width":346.2,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-24.png","element":"img","alt":" O(M 2H4S2A log T)","inline":true,"padRight":true},{"text":"for FedQ-Hoeffding and FedQ-Bernstein in ","element":"span"},{"href":"#id-13","referenceIndex":82,"text":"Zheng et al. ","element":"a"},{"href":"#id-13","referenceIndex":82,"text":"(2024a)","element":"a"},{"text":". If ","element":"span"},{"style":{"height":15.99},"width":394.36,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-25.png","element":"img","alt":"usyn = FALSE, when T","inline":true,"padRight":true},{"text":"is sufficiently large, ","element":"span"},{"style":{"height":19.2},"width":494.16,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-26.png","element":"img","alt":" K = O �H2SA(log H) log T�","inline":true},{"text":", which is independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":". Since the total number of communicated scalars is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"MHS","element":"span"},{"text":") ","element":"span"},{"text":"in each round, the total communication cost scales in ","element":"span"},{"style":{"height":17.39},"width":463.29,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-27.png","element":"img","alt":" O(MH3S2A(log H) log T).","inline":true}],[{"text":"Our result for ","element":"span"},{"style":{"height":15.59},"width":259.54,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-28.png","element":"img","alt":" usyn = FALSE","inline":true,"padRight":true},{"text":"also implies a low policy switching cost, which is defined as the times of policy switching. Knowing that the cost of ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"is ","element":"span"},{"style":{"height":17.39},"width":291.8,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-29.png","element":"img","alt":" O(H2SA log(T))","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.2},"width":494.16,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/9-30.png","element":"img","alt":"K = O �H2SA(log H) log T�","inline":true,"padRight":true},{"text":"for FedQ-Advantage, our communication round matches the policy switching cost up to a logarithmic factor under restrictions on sharing original trajectories. We also remark Equation ","element":"span"},{"href":"#id-33","text":"(102) ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-34","referenceIndex":635,"text":"F ","element":"a"},{"text":"shows that FedQ-Advantage can also reach the same local switching cost as ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":". We refer readers to ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020) ","element":"a"},{"text":"for more information.","element":"span"}],[{"text":"We will provide the complete proofs of Theorems ","element":"span"},{"href":"#id-31","text":"4.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-32","text":"4.2 ","element":"a"},{"text":"in Appendices ","element":"span"},{"href":"#id-30","referenceIndex":344,"text":"E ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-34","referenceIndex":635,"text":"F ","element":"a"},{"text":"respectively.","element":"span"}]]},{"heading":"5 CONCLUSION","paragraphs":[[{"text":"This paper develops the model-free FRL algorithm FedQ-Advantage with provably almost optimal regret and logarithmic communication cost. Specifically, it achieves an almost-optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large. Our algorithm also improves the logarithmic communication cost in the literature. Technically, our algorithm uses the UCB, reference-advantage decomposition and designs separate mechanisms for synchronization and policy switching, which can find broader applications for other RL problems.","element":"span"}]]},{"heading":"ACKNOWLEDGMENT","paragraphs":[[{"text":"The work of Z. Zheng, H. Zhang, and L. Xue was supported by the U.S. National Science Foundation under the grants DMS-1811552, DMS-1953189, and CCF-2007823 and by the U.S. National Institutes of Health under the grant 1R01GM152812.","element":"span"}]]},{"heading":"REFERENCES","paragraphs":[[{"id":"id-40","text":"Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative ","element":"span"},{"text":"model is minimax optimal. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pp. 67–83. PMLR, 2020.","element":"span"}],[{"id":"id-73","text":"Mridul Agarwal, Bhargav Ganguly, and Vaneet Aggarwal. Communication efficient parallel rein- ","element":"span"},{"text":"forcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Uncertainty in Artificial Intelligence","element":"span"},{"text":", pp. 247–256. PMLR, 2021.","element":"span"}],[{"id":"id-38","text":"Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case ","element":"span"},{"text":"regret bounds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 30, 2017.","element":"span"}],[{"id":"id-79","text":"Aqeel Anwar and Arijit Raychowdhury. Multi-task federated reinforcement learning with adversaries. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2103.06473","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-97","text":"Mahmoud Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, and Michael Rabbat. Gossip-based ","element":"span"},{"text":"actor-learner architectures for deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-37","text":"Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement ","element":"span"},{"text":"learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 21, 2008.","element":"span"}],[{"id":"id-16","text":"Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for rein- ","element":"span"},{"text":"forcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 263–272. PMLR, 2017.","element":"span"}],[{"id":"id-10","text":"Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low ","element":"span"},{"text":"switching cost. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-100","text":"Soumya Banerjee, Samia Bouzefrane, and Amar Abane. Identity management with hybrid blockchain ","element":"span"},{"text":"approach: A deliberate extension with federated-inverse-reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2021 IEEE 22nd International Conference on High Performance Switching and Routing (HPSR)","element":"span"},{"text":", pp. 1–6. IEEE, 2021.","element":"span"}],[{"id":"id-75","text":"Ali Beikmohammadi, Sarit Khirirat, and Sindri Magnússon. Compressed federated reinforcement ","element":"span"},{"text":"learning with a generative model. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases","element":"span"},{"text":", pp. 20–37. Springer, 2024.","element":"span"}],[{"id":"id-91","text":"Tianyi Chen, Kaiqing Zhang, Georgios B Giannakis, and Tamer Ba¸sar. Communication-efficient ","element":"span"},{"text":"policy gradient methods for distributed reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Control of Network Systems","element":"span"},{"text":", 9(2):917–929, 2021a.","element":"span"}],[{"id":"id-2","text":"Yiding Chen, Xuezhou Zhang, Kaiqing Zhang, Mengdi Wang, and Xiaojin Zhu. Byzantine-robust ","element":"span"},{"text":"online and offline distributed reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 3230–3269. PMLR, 2023.","element":"span"}],[{"id":"id-85","text":"Ziyi Chen, Yi Zhou, and Rongrong Chen. Multi-agent off-policy tdc with near-optimal sample ","element":"span"},{"text":"and communication complexity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2021 55th Asilomar Conference on Signals, Systems, and Computers","element":"span"},{"text":", pp. 504–508. IEEE, 2021b.","element":"span"}],[{"id":"id-96","text":"Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient ","element":"span"},{"text":"decentralized actor-critic algorithms with finite-time analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 3794–3834. PMLR, 2022.","element":"span"}],[{"id":"id-41","text":"Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable ","element":"span"},{"text":"reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1507–1516. PMLR, 2019.","element":"span"}],[{"id":"id-83","text":"Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed td (0) with linear ","element":"span"},{"text":"function approximation on multi-agent reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1626–1635. PMLR, 2019.","element":"span"}],[{"id":"id-84","text":"Thinh T Doan, Siva Theja Maguluri, and Justin Romberg. Finite-time performance of distributed ","element":"span"},{"text":"temporal-difference learning with linear function approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Mathematics of Data Science","element":"span"},{"text":", 3(1):298–320, 2021.","element":"span"}],[{"id":"id-7","text":"Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic rein- ","element":"span"},{"text":"forcement learning in finite mdps: Minimax lower bounds revisited. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Algorithmic Learning Theory","element":"span"},{"text":", pp. 578–598. PMLR, 2021.","element":"span"}],[{"id":"id-53","text":"Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction ","element":"span"},{"text":"methods for policy evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1049–1058. PMLR, 2017.","element":"span"}],[{"id":"id-68","text":"Abhimanyu Dubey and Alex Pentland. Provably efficient cooperative multi-agent reinforcement ","element":"span"},{"text":"learning with function approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2103.04972","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-98","text":"Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, ","element":"span"},{"text":"Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1407–1416. PMLR, 2018.","element":"span"}],[{"id":"id-3","text":"Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang ","element":"span"},{"text":"Low. Fault-tolerant federated reinforcement learning with theoretical guarantee. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:1007–1021, 2021.","element":"span"}],[{"id":"id-78","text":"Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Cheston Tan, and Bryan Kian Hsiang Low. Fedhql: ","element":"span"},{"text":"Federated heterogeneous q-learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems","element":"span"},{"text":", pp. 2810–2812, 2023.","element":"span"}],[{"id":"id-93","text":"Swetha Ganesh, Jiayu Chen, Gugan Thoppe, and Vaneet Aggarwal. Global convergence guarantees ","element":"span"},{"text":"for federated policy gradient methods with adversaries. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2403.09940","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-62","text":"Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-101","text":"Wei Gong, Linxiao Cao, Yifei Zhu, Fang Zuo, Xin He, and Haoquan Zhou. Federated inverse ","element":"span"},{"text":"reinforcement learning for smart icus with differential privacy. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Internet of Things Journal","element":"span"},{"text":", 10(21):19117–19124, 2023.","element":"span"}],[{"id":"id-47","text":"Robert M Gower, Mark Schmidt, Francis Bach, and Peter Richtárik. Variance-reduced methods for ","element":"span"},{"text":"machine learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE","element":"span"},{"text":", 108(11):1968–1983, 2020.","element":"span"}],[{"id":"id-72","text":"Zhaohan Guo and Emma Brunskill. Concurrent pac rl. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 29, pp. 2624–2630, 2015.","element":"span"}],[{"id":"id-71","text":"Hao-Lun Hsu, Weixin Wang, Miroslav Pajic, and Pan Xu. Randomized exploration in cooperative ","element":"span"},{"text":"multi-agent reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.10728","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-8","text":"Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 31, 2018.","element":"span"}],[{"id":"id-70","text":"Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement ","element":"span"},{"text":"learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on learning theory","element":"span"},{"text":", pp. 2137–2143. PMLR, 2020.","element":"span"}],[{"id":"id-76","text":"Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated reinforcement ","element":"span"},{"text":"learning with environment heterogeneity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 18–37. PMLR, 2022.","element":"span"}],[{"id":"id-48","text":"Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance ","element":"span"},{"text":"reduction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 26, 2013.","element":"span"}],[{"id":"id-39","text":"Sham Kakade, Mengdi Wang, and Lin F Yang. Variance reduction methods for sublinear reinforce- ","element":"span"},{"text":"ment learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.09184","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-54","text":"Koulik Khamaru, Ashwin Pananjady, Feng Ruan, Martin J Wainwright, and Michael I Jordan. ","element":"span"},{"text":"Is temporal difference learning optimal? an instance-dependent analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Mathematics of Data Science","element":"span"},{"text":", 3(4):1013–1040, 2021.","element":"span"}],[{"id":"id-77","text":"Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement ","element":"span"},{"text":"learning: Linear speedup under markovian sampling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 10997–11057. PMLR, 2022.","element":"span"}],[{"id":"id-92","text":"Guangchen Lan, Han Wang, James Anderson, Christopher Brinton, and Vaneet Aggarwal. Improved ","element":"span"},{"text":"communication efficiency in federated natural policy gradient via admm-based gradient updates. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.19807","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-94","text":"Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, and Christopher G Brinton. ","element":"span"},{"text":"Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.08003","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-59","text":"Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous ","element":"span"},{"text":"q-learning: Sharper analysis and variance reduction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:7031–7043, 2020.","element":"span"}],[{"id":"id-12","text":"Gen Li, Laixi Shi, Yuxin Chen, Yuantao Gu, and Yuejie Chi. Breaking the sample complexity barrier ","element":"span"},{"text":"to regret-optimal model-free reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:17762–17776, 2021.","element":"span"}],[{"id":"id-90","text":"Rui Liu and Alex Olshevsky. Distributed td (0) with almost no communication. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Control Systems Letters","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-102","text":"Shicheng Liu and Minghui Zhu. Distributed inverse constrained reinforcement learning for multi- ","element":"span"},{"text":"agent systems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:33444–33456, 2022.","element":"span"}],[{"id":"id-103","text":"Shicheng Liu and Minghui Zhu. Meta inverse constrained reinforcement learning: Convergence ","element":"span"},{"text":"guarantee and generalization analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-104","text":"Shicheng Liu and Minghui Zhu. Learning multi-agent behaviors from distributed and streaming ","element":"span"},{"text":"demonstrations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-105","text":"Shicheng Liu and Minghui Zhu. In-trajectory inverse reinforcement learning: Learn incrementally ","element":"span"},{"text":"before an ongoing trajectory terminates. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-1","text":"Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. ","element":"span"},{"text":"Communication-efficient learning of deep networks from decentralized data. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 20th International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", volume 54, pp. 1273–1282. PMLR, 2017.","element":"span"}],[{"id":"id-46","text":"Pierre Ménard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q- ","element":"span"},{"text":"learning: Correcting the bias without forgetting. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 7609–7618. PMLR, 2021.","element":"span"}],[{"id":"id-69","text":"Yifei Min, Jiafan He, Tianhao Wang, and Quanquan Gu. Cooperative multi-agent reinforcement ","element":"span"},{"text":"learning: Asynchronous communication and linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 24785–24811. PMLR, 2023.","element":"span"}],[{"id":"id-99","text":"Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim ","element":"span"},{"text":"Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 1928–1937. PMLR, 2016.","element":"span"}],[{"id":"id-49","text":"Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáˇc. Sarah: A novel method for machine ","element":"span"},{"text":"learning problems using stochastic recursive gradient. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 2613–2621. PMLR, 2017.","element":"span"}],[{"id":"id-61","text":"Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Annals of Statistics","element":"span"},{"text":", 44(2):660 – 681, 2016.","element":"span"}],[{"id":"id-65","text":"Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with ","element":"span"},{"text":"loglog (t) switching cost. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 18031–18061. PMLR, 2022.","element":"span"}],[{"id":"id-95","text":"Han Shen, Kaiqing Zhang, Mingyi Hong, and Tianyi Chen. Towards understanding asynchronous ","element":"span"},{"text":"advantage actor-critic: Convergence and linear speedup. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-57","text":"Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. Pessimistic q-learning for offline ","element":"span"},{"text":"reinforcement learning: Towards optimal sample complexity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 19967–20025. PMLR, 2022.","element":"span"}],[{"id":"id-50","text":"Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample ","element":"span"},{"text":"complexities for solving markov decision processes with a generative model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 31, 2018.","element":"span"}],[{"id":"id-51","text":"Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster ","element":"span"},{"text":"algorithms for solving markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Naval Research Logistics (NRL)","element":"span"},{"text":", 70(5):423–442, 2023.","element":"span"}],[{"id":"id-86","text":"Jun Sun, Gang Wang, Georgios B Giannakis, Qinmin Yang, and Zaiyue Yang. Finite-time analysis of ","element":"span"},{"text":"decentralized temporal-difference learning with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 4485–4495. PMLR, 2020.","element":"span"}],[{"id":"id-0","text":"R Sutton and A Barto. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning: An Introduction","element":"span"},{"text":". MIT Press, 2018.","element":"span"}],[{"id":"id-87","text":"Hoi-To Wai. On the convergence of consensus algorithms with markovian noise and gradient bias. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2020 59th IEEE Conference on Decision and Control (CDC)","element":"span"},{"text":", pp. 4897–4902. IEEE, 2020.","element":"span"}],[{"id":"id-55","text":"Hoi-To Wai, Mingyi Hong, Zhuoran Yang, Zhaoran Wang, and Kexin Tang. Variance reduced policy ","element":"span"},{"text":"evaluation with smooth function approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-52","text":"Martin J Wainwright. ","element":"span"},{"text":"Variance-reduced ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-learning is minimax optimal. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1906.04697","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-88","text":"Gang Wang, Songtao Lu, Georgios Giannakis, Gerald Tesauro, and Jian Sun. Decentralized td ","element":"span"},{"text":"tracking with linear function approximation and its finite-time analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:13762–13772, 2020.","element":"span"}],[{"id":"id-63","text":"Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with ","element":"span"},{"text":"linear function approximation under adaptivity constraints. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:13524–13536, 2021.","element":"span"}],[{"id":"id-6","text":"C. J. C. H. Watkins. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Learning from Delayed Rewards","element":"span"},{"text":". PhD thesis, King’s College, Oxford, 1989.","element":"span"}],[{"id":"id-4","text":"Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear ","element":"span"},{"text":"speedup and beyond. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 37157–37216, 2023.","element":"span"}],[{"id":"id-15","text":"Jiin Woo, Laixi Shi, Gauri Joshi, and Yuejie Chi. Federated offline reinforcement learning: Col- ","element":"span"},{"text":"laborative single-policy coverage suffices. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-74","text":"Zhaoxian Wu, Han Shen, Tianyi Chen, and Qing Ling. Byzantine-resilient decentralized policy ","element":"span"},{"text":"evaluation with linear function approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 69: 3839–3853, 2021.","element":"span"}],[{"id":"id-56","text":"Tengyu Xu, Zhe Wang, Yi Zhou, and Yingbin Liang. Reanalysis of variance reduced temporal ","element":"span"},{"text":"difference learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2001.01898","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-60","text":"Yuling Yan, Gen Li, Yuxin Chen, and Jianqing Fan. The efficacy of pessimism in asynchronous ","element":"span"},{"text":"q-learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Information Theory","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-45","text":"Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pp. 1576–1584. PMLR, 2021.","element":"span"}],[{"id":"id-81","text":"Tong Yang, Shicong Cen, Yuting Wei, Yuxin Chen, and Yuejie Chi. Federated natural policy gradient ","element":"span"},{"text":"methods for multi-task reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2311.00201","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-58","text":"Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal offline reinforcement learning via double ","element":"span"},{"text":"variance reduction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:7677–7688, 2021.","element":"span"}],[{"id":"id-42","text":"Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement ","element":"span"},{"text":"learning without domain knowledge using value function bounds. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 7304–7312. PMLR, 2019.","element":"span"}],[{"id":"id-89","text":"Sihan Zeng, Thinh T Doan, and Justin Romberg. Finite-time analysis of decentralized stochastic ","element":"span"},{"text":"approximation with applications in multi-agent and multi-task learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2021 60th IEEE Conference on Decision and Control (CDC)","element":"span"},{"text":", pp. 2641–2646. IEEE, 2021.","element":"span"}],[{"id":"id-82","text":"Chenyu Zhang, Han Wang, Aritra Mitra, and James Anderson. Finite-time analysis of on-policy ","element":"span"},{"text":"heterogeneous federated reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2401.15273","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-67","text":"Haochen Zhang, Zhong Zheng, and Lingzhou Xue. Gap-dependent bounds for federated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"-learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.02859","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-11","text":"Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning ","element":"span"},{"text":"via reference-advantage decomposition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33: 15198–15207, 2020.","element":"span"}],[{"id":"id-43","text":"Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? ","element":"span"},{"text":"a near-optimal algorithm escaping the curse of horizon. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pp. 4528–4531. PMLR, 2021.","element":"span"}],[{"id":"id-64","text":"Zihan Zhang, Yuhang Jiang, Yuan Zhou, and Xiangyang Ji. Near-optimal regret bounds for multi- ","element":"span"},{"text":"batch reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:24586– 24596, 2022.","element":"span"}],[{"id":"id-9","text":"Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online ","element":"span"},{"text":"reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2307.13586","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-80","text":"Fangyuan Zhao, Xuebin Ren, Shusen Yang, Peng Zhao, Rui Zhang, and Xinxin Xu. Federated ","element":"span"},{"text":"multi-objective reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Information Sciences","element":"span"},{"text":", 624:811–832, 2023.","element":"span"}],[{"id":"id-13","text":"Zhong Zheng, Fengyu Gao, Lingzhou Xue, and Jing Yang. Federated q-learning: Linear regret ","element":"span"},{"text":"speedup with low communication cost. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2024a.","element":"span"}],[{"id":"id-66","text":"Zhong Zheng, Haochen Zhang, and Lingzhou Xue. Gap-dependent bounds for q-learning using ","element":"span"},{"text":"reference-advantage decomposition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2410.07574","element":"span"},{"text":", 2024b.","element":"span"}],[{"id":"id-44","text":"Runlong Zhou, Zhang Zihan, and Simon Shaolei Du. Sharp variance-dependent bounds in reinforce- ","element":"span"},{"text":"ment learning: Best of both worlds in stochastic and deterministic environments. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pp. 42878–42914. PMLR, 2023.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Organization of the appendix. ","element":"span"},{"text":"In the appendix, we first provide two tables to summarize the","element":"span"}],[{"text":"notations in Section ","element":"span"},{"href":"#id-18","referenceIndex":90,"text":"A. ","element":"a"},{"text":"Section ","element":"span"},{"href":"#id-5","referenceIndex":114,"text":"B ","element":"a"},{"text":"provides related works. Section ","element":"span"},{"href":"#id-35","referenceIndex":184,"text":"C ","element":"a"},{"text":"presents numerical results.","element":"span"}],[{"text":"Section ","element":"span"},{"href":"#id-36","referenceIndex":229,"text":"D ","element":"a"},{"text":"provides basic facts of FedQ-Advantage and a lemma of concentration inequalities for","element":"span"}],[{"text":"regret analysis. Section ","element":"span"},{"href":"#id-30","referenceIndex":344,"text":"E ","element":"a"},{"text":"presents the proof of Theorem ","element":"span"},{"href":"#id-31","text":"4.1 ","element":"a"},{"text":"(regret). Section ","element":"span"},{"href":"#id-34","referenceIndex":635,"text":"F ","element":"a"},{"text":"presents the proof of","element":"span"}],[{"text":"Theorem ","element":"span"},{"href":"#id-32","text":"4.2 ","element":"a"},{"text":"(communication cost).","element":"span"}],[{"id":"id-18","text":"A ","element":"span"},{"text":"N","element":"span"},{"text":"OTATION TABLES","element":"span"}],[{"text":"In this section, we provide two notation tables to enhance the readability of the paper. The notations","element":"span"}],[{"text":"are categorized into two groups: one group consists of global variables utilized for central server","element":"span"}],[{"text":"aggregation, while the other group consists of local variables employed for agent training. First,","element":"span"}],[{"text":"Table 2: Global Variables","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1586,"height":1762,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/15-0.png","element":"img"}],[{"text":"we introduce some local quantities. For ","element":"span"},{"style":{"height":14},"width":199.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/15-1.png","element":"img","alt":" f : S → R","inline":true},{"text":", letting ","element":"span"},{"style":{"height":25.61},"width":566.12,"height":64.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/15-2.png","element":"img","alt":" Akm,s,a,h(f) = �nm,kj=1 f(sk,m,jh+1 )×","inline":true}],[{"style":{"height":20.07},"width":349.22,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/15-3.png","element":"img","alt":"I[(s, a)k,m,jh = (s, a)]","inline":true,"padRight":true},{"text":"as the summation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"on the next states for all the visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"}],[{"text":"Table 3: Local Variables","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1586,"height":631,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-0.png","element":"img"}],[{"text":"for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":". When there is no ambiguity, we will use the simplified notation ","element":"span"},{"style":{"height":20.3},"width":370.23,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-1.png","element":"img","alt":" Akm(f) = Akm,s,a,h(f).","inline":true}],[{"text":"Then, we let ","element":"span"},{"style":{"height":22.87},"width":1344.02,"height":57.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-2.png","element":"img","alt":" nm,kh (s, a) = Akm(1), µm,kh,ref(s, a) = Akm(V ref,kh+1 ), µm,kh,adv(s, a) = Akm(V adv,kh+1 ),","inline":true}],[{"style":{"height":22.47},"width":916.66,"height":56.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-3.png","element":"img","alt":"µm,kh,val(s, a) = Akm(V kh+1), σm,kh,ref(s, a) = Akm([V ref,kh+1 ]2)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":22.87},"width":486.02,"height":57.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-4.png","element":"img","alt":" σm,kh,adv(s, a) = Akm([V adv,kh+1 ]2)","inline":true},{"text":". For","element":"span"}],[{"text":"these functions of ","element":"span"},{"style":{"height":20.07},"width":187.07,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-5.png","element":"img","alt":" (s, a), nm,kh","inline":true,"padRight":true},{"text":"is the local count of visits for agent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"in round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", and the remaining","element":"span"}],[{"text":"ones are local summations related to the reference function, the estimated advantage functions, and","element":"span"}],[{"text":"the estimated value functions for visits.","element":"span"}],[{"text":"Accordingly, we define some global quantities. ","element":"span"},{"text":"First, we focus on visiting counts. ","element":"span"},{"text":"We let","element":"span"}],[{"style":{"height":24.25},"width":561.4,"height":60.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/16-6.png","element":"img","alt":"N kh(s, a) = �k′:tk′h 0","inline":true},{"text":", we know ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":17.9},"width":185.36,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/21-20.png","element":"img","alt":"kh(s, a) > 1","inline":true,"padRight":true},{"text":"and then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"height":17.9},"width":438.92,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/21-21.png","element":"img","alt":"kh(s, a) ≥ Y 1h (s, a) ≥ MH","inline":true},{"text":". There- ","element":"span"},{"text":"fore, ","element":"span"},{"style":{"height":19.92},"width":783.77,"height":49.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/21-22.png","element":"img","alt":" ˆnk+1h (s, a) ≤ nkh(s, a) + M ≤ (1 + 2H )nkh(s, a).","inline":true}],[{"text":"If ","element":"span"},{"style":{"height":17.9},"width":853.61,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/21-23.png","element":"img","alt":" (1 − 1/H)nkh(s, a) < ˜nkh(s, a) ≤ (1 + 1/H)nkh(s, a)","inline":true},{"text":", then according to Equation ","element":"span"},{"href":"#id-25","text":"(2) ","element":"a"},{"text":"we ","element":"span"},{"text":"have:","element":"span"}],[{"style":{"width":"90%"},"width":1439,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/21-24.png","element":"img"}],[{"text":"(e) For ","element":"span"},{"style":{"height":16},"width":255.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-0.png","element":"img","alt":" t ≤ Th(s, a)−2","inline":true},{"text":", there exists a round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"satisfying ","element":"span"},{"style":{"height":19.23},"width":614.68,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-1.png","element":"img","alt":" tkh(s, a) = t+1 and tk+1h (s, a) = t+2.","inline":true,"padRight":true},{"text":"Then according to Equation ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":19.23},"width":793.29,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-2.png","element":"img","alt":" yt+1h (s, a) = ˆnk+1h (s, a) ≥ (1 + 1/H)nkh(s, a) =","inline":true},{"style":{"height":17.5},"width":312.8,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-3.png","element":"img","alt":"(1 + 1/H)yth(s, a)","inline":true},{"text":". Moreover, according to (d), we have ","element":"span"},{"style":{"height":19.23},"width":460.65,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-4.png","element":"img","alt":" yt+1h (s, a) = ˆnk+1h (s, a) ≤","inline":true},{"style":{"height":17.9},"width":676.47,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-5.png","element":"img","alt":"(1 + 2/H)nkh(s, a) = (1 + 2/H)yth(s, a).","inline":true}],[{"text":"For ","element":"span"},{"style":{"height":16},"width":283.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-6.png","element":"img","alt":" t = Th(s, a) − 1","inline":true},{"text":", we have ","element":"span"},{"style":{"height":19.59},"width":924.06,"height":48.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-7.png","element":"img","alt":" yThh (s, a) = ˆnK+1h (s, a) ≤ (1 + 2/H)nKh (s, a) = (1 +","inline":true,"padRight":true},{"text":"2","element":"span"},{"style":{"height":19.59},"width":271.29,"height":48.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-8.png","element":"img","alt":"/H)yTh−1h (s, a).","inline":true}],[{"style":{"width":"91%"},"width":1456,"height":409,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-9.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":30.89},"width":787.55,"height":77.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-10.png","element":"img","alt":"Y t−1h (s,a)yt−1h (s,a) ≤ 2H, then for t (2 ≤ t ≤ Th(s, a) − 1","inline":true},{"text":"), according to (e), we have ","element":"span"},{"style":{"height":17.5},"width":170.71,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-11.png","element":"img","alt":" yth(s, a) ≥","inline":true},{"style":{"height":19.23},"width":455.29,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-12.png","element":"img","alt":"(1 + 1/H)yt−1h (s, a). Then:","inline":true}],[{"style":{"width":"84%"},"width":1339,"height":774,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-13.png","element":"img"}],[{"text":"For Equation ","element":"span"},{"href":"#id-112","text":"(17)","element":"a"},{"text":", according to (e), we have ","element":"span"},{"style":{"height":19.23},"width":742.44,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-14.png","element":"img","alt":" yth(s, a) ≤ (1 + 2/H)h−1yt−h+1h (s, a) for any","inline":true},{"style":{"height":16},"width":245.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-15.png","element":"img","alt":"h ∈ [H]. Then:","inline":true}],[{"style":{"width":"95%"},"width":1511,"height":747,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/22-16.png","element":"img"}],[{"style":{"width":"84%"},"width":1342,"height":246,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-0.png","element":"img"}],[{"text":"The last inequality is because ","element":"span"},{"style":{"height":17.9},"width":539.12,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-1.png","element":"img","alt":" y1h(s, a) = Y 1h (s, a) ≤ MH + M","inline":true,"padRight":true},{"text":"according to (c). Moreover, ","element":"span"},{"text":"according to Equation ","element":"span"},{"href":"#id-112","text":"(17)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"94%"},"width":1502,"height":1552,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-2.png","element":"img"}],[{"text":"The last inequality is because of ","element":"span"},{"style":{"height":21.97},"width":427.61,"height":54.92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-3.png","element":"img","alt":"�s,a,h Y Th−1h (s, a) ≤ T0","inline":true,"padRight":true},{"text":"according to the algorithm. Because ","element":"span"},{"style":{"height":17.23},"width":123.3,"height":43.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-4.png","element":"img","alt":" T0 ≤ ˆT","inline":true,"padRight":true},{"text":"from (a), we have:","element":"span"}],[{"style":{"width":"42%"},"width":679,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-5.png","element":"img"}],[{"text":"(j) According to Equation ","element":"span"},{"href":"#id-113","text":"(10)","element":"a"},{"text":", we know ","element":"span"},{"style":{"height":17.9},"width":141.1,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-6.png","element":"img","alt":" Qkh(s, a)","inline":true,"padRight":true},{"text":"is non-increasing with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Then ","element":"span"},{"text":"based on the update rule Equation ","element":"span"},{"href":"#id-114","text":"(13) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":17.9},"width":140.66,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-7.png","element":"img","alt":" V kh (s, a)","inline":true,"padRight":true},{"text":"and the update rule Equation ","element":"span"},{"href":"#id-24","text":"(14) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":20.07},"width":181.09,"height":50.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-8.png","element":"img","alt":"V ref,kh (s, a)","inline":true},{"text":", we find that they are also non-increasing with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/23-9.png","element":"img"}],[{"id":"id-115","text":"Next, we provide Lemma ","element":"span"},{"href":"#id-115","referenceIndex":251,"text":"D.2 ","element":"a"},{"text":"that discusses the weighted sum of all the steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma D.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any non-negative weight sequence ","element":"span"},{"style":{"height":19.18},"width":551.54,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-0.png","element":"img","alt":" {ωh(s, a)}s,a,h and any α ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that","element":"span"}],[{"style":{"width":"83%"},"width":1323,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and","element":"span"}],[{"style":{"width":"83%"},"width":1327,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":10.8},"width":98.79,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-3.png","element":"img","alt":" α = 1","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that","element":"span"}],[{"style":{"width":"78%"},"width":1242,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and","element":"span"}],[{"style":{"width":"79%"},"width":1267,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Here, ","element":"span"},{"style":{"height":20.07},"width":445.86,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-6.png","element":"img","alt":" tk,m,jh = tkh(sk,m,jh , ak,m,jh ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-115","referenceIndex":251,"style":{"fontStyle":"italic"},"text":"D.2. ","element":"a"},{"text":"According to Equation ","element":"span"},{"href":"#id-116","text":"(15)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":20.07},"width":561.59,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-7.png","element":"img","alt":" (sk,m,jh , ak,m,jh , h) ∈ S × A × [H]","inline":true,"padRight":true},{"text":"and","element":"span"}],[{"style":{"height":20.07},"width":179.06,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-8.png","element":"img","alt":"tk,m,jh > 1,","inline":true}],[{"style":{"width":"56%"},"width":903,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-9.png","element":"img"}],[{"text":"Therefore, we only need to prove the first and the third inequalities.","element":"span"}],[{"text":"We first provide two conclusions. For ","element":"span"},{"style":{"height":13.6},"width":418.2,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-10.png","element":"img","alt":" 1 ≤ x ≤ 4 and 0 < α < 1","inline":true},{"text":", it holds:","element":"span"}],[{"id":"id-117","style":{"width":"78%"},"width":1239,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-11.png","element":"img"}],[{"text":"Next, we go back to the proof. For any ","element":"span"},{"href":"#id-26","referenceIndex":246,"style":{"height":16},"width":425.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-12.png","element":"img","alt":" (s, a, h) ∈ S × A × [H]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"href":"#id-117","style":{"height":16},"width":309.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-13.png","element":"img","alt":" 2 ≤ t ≤ Th(s, a)","inline":true},{"text":", let","element":"span"}],[{"style":{"height":29.46},"width":228.87,"height":73.64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-14.png","element":"img","alt":"x = Y th(s,a)Y t−1h (s,a)","inline":true},{"text":". According to (f) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":13.2},"width":184.9,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-15.png","element":"img","alt":" 1 ≤ x ≤ 4","inline":true},{"text":". Using Equation ","element":"span"},{"href":"#id-117","text":"(18)","element":"a"},{"text":", and","element":"span"}],[{"text":"Equation ","element":"span"},{"href":"#id-117","text":"(19)","element":"a"},{"text":", it holds:","element":"span"}],[{"id":"id-118","style":{"width":"79%"},"width":1267,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-16.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-121","style":{"width":"74%"},"width":1188,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/24-17.png","element":"img"}],[{"text":"Now,","element":"span"}],[{"style":{"width":"84%"},"width":1339,"height":693,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-0.png","element":"img"}],[{"text":"In the last equality, ","element":"span"},{"style":{"height":20.07},"width":727.17,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-1.png","element":"img","alt":" I[(sk,m,jh , ak,m,jh ) = (s, a), tkh(s, a) = t] = 1","inline":true,"padRight":true},{"text":"if and only if ","element":"span"},{"style":{"height":20.07},"width":296.15,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-2.png","element":"img","alt":" (sk,m,jh , ak,m,jh ) =","inline":true}],[{"style":{"height":22.44},"width":1589.12,"height":56.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-3.png","element":"img","alt":"(s, a) and k in stage t of (s, a, h), so �k,m,j I[(sk,m,jh , ak,m,jh ) = (s, a), tkh(s, a) = t] = yth(s, a). In","inline":true}],[{"text":"this case, ","element":"span"},{"style":{"height":19.23},"width":373.32,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-4.png","element":"img","alt":" N kh(s, a) = Y t−1h (s, a)","inline":true,"padRight":true},{"text":"and then we have :","element":"span"}],[{"id":"id-119","style":{"width":"87%"},"width":1389,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-5.png","element":"img"}],[{"text":"Summing Equation ","element":"span"},{"href":"#id-118","text":"(20) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":16},"width":749.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-6.png","element":"img","alt":" 2 ≤ t ≤ Th(s, a), for any 0 < α < 1, we have:","inline":true}],[{"id":"id-120","style":{"width":"95%"},"width":1517,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-7.png","element":"img"}],[{"text":"Combining Equation ","element":"span"},{"href":"#id-119","text":"(22) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-120","text":"(23)","element":"a"},{"text":", we can finish the proof of the first inequality.","element":"span"}],[{"text":"In Equation ","element":"span"},{"href":"#id-119","text":"(22)","element":"a"},{"text":", let ","element":"span"},{"style":{"height":13.6},"width":259.9,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-8.png","element":"img","alt":" α = 1, we have:","inline":true}],[{"id":"id-122","style":{"width":"88%"},"width":1399,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-9.png","element":"img"}],[{"text":"Summing Equation ","element":"span"},{"href":"#id-121","text":"(21) ","element":"a"},{"text":"for ","element":"span"},{"style":{"height":16},"width":273.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-10.png","element":"img","alt":" 2 ≤ t ≤ Th(s, a)","inline":true,"padRight":true},{"text":", we have:","element":"span"}],[{"id":"id-123","style":{"width":"90%"},"width":1429,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-11.png","element":"img"}],[{"text":"Combining Equation ","element":"span"},{"href":"#id-122","text":"(24) ","element":"a"},{"text":"with Equation ","element":"span"},{"href":"#id-123","text":"(25)","element":"a"},{"text":", we finish the proof of the third inequality. Then we","element":"span"}],[{"text":"finish the proof.","element":"span"}],[{"text":"Next, we provide auxiliary lemmas.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma D.3. ","element":"span"},{"text":"(Azuma-Hoeffding Inequality) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose ","element":"span"},{"style":{"height":18},"width":150.55,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-12.png","element":"img","alt":" {Xk}∞k=0 ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a martingale and ","element":"span"},{"style":{"height":16},"width":256.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-13.png","element":"img","alt":" |Xk − Xk−1| ≤","inline":true}],[{"style":{"height":14.79},"width":203.2,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-14.png","element":"img","alt":"ck, ∀k ∈ N+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"almost surely. Then for any positive integers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and any positive real number ","element":"span"},{"style":{"height":13.6},"width":152.09,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-15.png","element":"img","alt":" ϵ, it holds","inline":true}],[{"style":{"fontStyle":"italic"},"text":"that:","element":"span"}],[{"style":{"width":"47%"},"width":755,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-16.png","element":"img"}],[{"id":"id-125","style":{"fontWeight":"bold"},"text":"Lemma D.4. ","element":"span"},{"text":"(Lemma 10 of ","element":"span"},{"href":"#id-11","referenceIndex":77,"text":"Zhang et al. ","element":"a"},{"href":"#id-11","referenceIndex":77,"text":"(2020)","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":18},"width":160.37,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-17.png","element":"img","alt":" {Mn}∞n=0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a martingale such that","element":"span"}],[{"style":{"height":13.19},"width":166.14,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-18.png","element":"img","alt":"M0 = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":354.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-19.png","element":"img","alt":" |Mn − Mn−1| ≤ c","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"href":"#id-11","referenceIndex":77,"style":{"height":18.17},"width":709.35,"height":45.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-20.png","element":"img","alt":" V arn = �nk=1 E[(Mk − Mk−1)2|Fk−1]","inline":true},{"style":{"fontStyle":"italic"},"text":", where","element":"span"}],[{"style":{"height":16},"width":496.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-21.png","element":"img","alt":"Fk−1 = σ(M0, M1, ..., Mk−1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then for any positive integer ","element":"span"},{"style":{"height":14.4},"width":298.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-22.png","element":"img","alt":" n and any ϵ, p > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that:","element":"span"}],[{"style":{"width":"83%"},"width":1317,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/25-23.png","element":"img"}],[{"text":"At the end of this section, we provide a lemma of concentration inequalities.","element":"span"}],[{"id":"id-124","style":{"fontWeight":"bold"},"text":"Lemma D.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":227.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-0.png","element":"img","alt":" ι = log(2/p)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":174.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-1.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Using ","element":"span"},{"style":{"height":16},"width":191.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-2.png","element":"img","alt":" ∀(s, a, h, k)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"as the simplified notation","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":590.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-3.png","element":"img","alt":" ∀(s, a, h, k) ∈ S × A × [H] × [K]","inline":true},{"style":{"fontStyle":"italic"},"text":". For any function ","element":"span"},{"style":{"height":14},"width":201.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-4.png","element":"img","alt":" f : S → R","inline":true},{"style":{"fontStyle":"italic"},"text":", we denote ","element":"span"},{"style":{"height":16.79},"width":204.05,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-5.png","element":"img","alt":" Vs,a,h(f) =","inline":true}],[{"style":{"height":18.18},"width":353.7,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-6.png","element":"img","alt":"Ps,a,hf 2 − (Ps,a,hf)2","inline":true},{"style":{"fontStyle":"italic"},"text":". Next, we define the following events.","element":"span"}],[{"style":{"width":"80%"},"width":1270,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"in which ","element":"span"},{"style":{"height":10},"width":40.94,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-8.png","element":"img","alt":" χ1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the abbreviation for","element":"span"}],[{"style":{"width":"87%"},"width":1387,"height":665,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"in which ","element":"span"},{"style":{"height":10},"width":40.94,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-10.png","element":"img","alt":" χ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the abbreviation for","element":"span"}],[{"style":{"width":"92%"},"width":1472,"height":641,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Here, ","element":"span"},{"style":{"height":17.9},"width":396.5,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-12.png","element":"img","alt":" λkh(s) = I[N kh(s) < N0]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":18.97},"width":420.54,"height":47.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-13.png","element":"img","alt":" N kh(s) = �a∈A N kh(s, a)","inline":true},{"style":{"fontStyle":"italic"},"text":". Especially, ","element":"span"},{"style":{"height":20.57},"width":362.16,"height":51.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-14.png","element":"img","alt":" λkH+1(s) = 0. �k,m,j","inline":true}],[{"style":{"fontStyle":"italic"},"text":"is the abbreviation of ","element":"span"},{"style":{"height":25.61},"width":338.35,"height":64.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-15.png","element":"img","alt":"�Kk=1�Mm=1�nm,kj=1 ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". We will also use the abbreviation later.","element":"span"}],[{"style":{"width":"85%"},"width":1362,"height":368,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/26-16.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Here,","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"height":22.4},"width":283.39,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-0.png","element":"img","alt":" (s, a, h, t) = �","inline":true}],[{"style":{"width":"98%"},"width":1558,"height":1449,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5. ","element":"a"},{"text":"First, we will prove with probability at least ","element":"span"},{"style":{"height":17.39},"width":308.41,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-2.png","element":"img","alt":" 1 − SAT 21 /Hp, E1","inline":true,"padRight":true},{"text":"holds. The","element":"span"}],[{"text":"sequence ","element":"span"},{"style":{"height":24.56},"width":610.06,"height":61.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-3.png","element":"img","alt":" {V ⋆h+1(s(k,m,j)lih+1 ) − Ps,a,hV ⋆h+1}i∈N+","inline":true,"padRight":true},{"text":"is a martingale sequence with its absolute values","element":"span"}],[{"text":"bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":". Then according to Azuma-Hoeffding inequality, for any ","element":"span"},{"style":{"height":16},"width":157.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-4.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"text":", with probability","element":"span"}],[{"text":"at least ","element":"span"},{"style":{"height":14},"width":90.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-5.png","element":"img","alt":" 1 − p","inline":true},{"text":", it holds for given ","element":"span"},{"style":{"height":17.9},"width":394.76,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-6.png","element":"img","alt":" nkh(s, a) = n ∈ N+ that:","inline":true}],[{"style":{"width":"54%"},"width":867,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-7.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":16},"width":129.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-8.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":", we have ","element":"span"},{"style":{"height":19.63},"width":247.74,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-9.png","element":"img","alt":" nkh(s, a) ∈ [ T1H ]","inline":true},{"text":". Considering all the possible combinations ","element":"span"},{"style":{"height":16},"width":206.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-10.png","element":"img","alt":" (s, a, h, k) ∈","inline":true}],[{"style":{"height":19.63},"width":654.8,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-11.png","element":"img","alt":"S × A × [H] × [ T1H ] and nkh(s, a) ∈ [ T1H ]","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":17.38},"width":250.17,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-12.png","element":"img","alt":" 1 − SAT 21 /Hp","inline":true},{"text":", it holds simulta-","element":"span"}],[{"text":"neously for all ","element":"span"},{"style":{"height":16},"width":620.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-13.png","element":"img","alt":" (s, a, h, k) ∈ S × A × [H] × [K] that:","inline":true}],[{"style":{"width":"64%"},"width":1021,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-14.png","element":"img"}],[{"text":"This conclusion also holds for for ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-15.png","element":"img","alt":" E6","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-16.png","element":"img","alt":" E7","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":32.49},"width":733.29,"height":81.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-17.png","element":"img","alt":" {(Ps,a,h − 1s(k,m,j)lih+1 )(Vklih+1 − Vref,klih+1 )}i∈N+","inline":true,"padRight":true},{"text":"is a","element":"span"}],[{"text":"martingale sequence with its absolute values bounded by ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", and ","element":"span"},{"style":{"height":32.49},"width":497.42,"height":81.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-18.png","element":"img","alt":" {(Ps,a,h − 1s(k,m,j)lih+1 )(Vklih+1 −","inline":true}],[{"style":{"height":23.76},"width":242.85,"height":59.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-19.png","element":"img","alt":"Vref,klih+1 )2}i∈N+","inline":true,"padRight":true},{"text":"is a martingale sequence with its absolute values bounded by ","element":"span"},{"style":{"height":13.79},"width":84.16,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/27-20.png","element":"img","alt":" 2H2.","inline":true}],[{"text":"Next, ","element":"span"},{"text":"we will prove with probability at least ","element":"span"},{"style":{"height":14.8},"width":299.5,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-0.png","element":"img","alt":" 1 − SAT1p, E3","inline":true,"padRight":true},{"text":"holds. ","element":"span"},{"style":{"height":16.79},"width":186.69,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-1.png","element":"img","alt":"{(Ps,a,h −","inline":true}],[{"text":"1","element":"span"}],[{"style":{"width":"86%"},"width":1366,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-2.png","element":"img"}],[{"text":"Hoeffding inequality, for any ","element":"span"},{"style":{"height":16},"width":179.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-3.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":99.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-4.png","element":"img","alt":" 1 − p","inline":true},{"text":", it holds for a given","element":"span"}],[{"style":{"height":17.9},"width":418.6,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-5.png","element":"img","alt":"N kh(s, a) = N ∈ N+ that:","inline":true}],[{"style":{"width":"49%"},"width":786,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-6.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":19.63},"width":546.04,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-7.png","element":"img","alt":" k ∈ [K], we have N kh(s, a) ∈ [ T1H ]","inline":true},{"text":". Considering all the possible combinations ","element":"span"},{"style":{"height":16},"width":221.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-8.png","element":"img","alt":" (s, a, h, N) ∈","inline":true}],[{"style":{"height":19.63},"width":309.83,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-9.png","element":"img","alt":"S ×A×[H]×[ T1H ]","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":14.8},"width":181.84,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-10.png","element":"img","alt":" 1−SAT1p","inline":true},{"text":", it holds simultaneously for all ","element":"span"},{"style":{"height":16},"width":206.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-11.png","element":"img","alt":" (s, a, h, k) ∈","inline":true}],[{"style":{"height":16},"width":403.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-12.png","element":"img","alt":"S × A × [H] × [K] that:","inline":true}],[{"style":{"width":"56%"},"width":901,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-13.png","element":"img"}],[{"text":"This conclusion also holds for ","element":"span"},{"style":{"height":13.2},"width":118.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-14.png","element":"img","alt":" E4, E12","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-15.png","element":"img","alt":" E15","inline":true,"padRight":true},{"text":"because of the similar martingale structures as","element":"span"}],[{"text":"follows. For ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-16.png","element":"img","alt":" E4","inline":true},{"text":", the sequence ","element":"span"},{"style":{"height":32.32},"width":633.08,"height":80.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-17.png","element":"img","alt":" {(Ps,a,h − 1s(k,m,j)Lih+1 )(Vref,kLih+1 )2}i∈N+","inline":true,"padRight":true},{"text":"is a martingale sequence with","element":"span"}],[{"text":"its absolute values bounded by ","element":"span"},{"style":{"height":13.38},"width":52.37,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-18.png","element":"img","alt":" H2","inline":true},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-19.png","element":"img","alt":" E12","inline":true},{"text":", the sequence ","element":"span"},{"style":{"height":32.32},"width":549.02,"height":80.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-20.png","element":"img","alt":" {(Ps,a,h − 1s(k,m,j)Lih+1 )λkLih+1}i∈N+","inline":true,"padRight":true},{"text":"is","element":"span"}],[{"text":"a martingale sequence with its absolute values bounded by ","element":"span"},{"text":"1","element":"span"},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-21.png","element":"img","alt":" E15","inline":true},{"text":", the sequence ","element":"span"},{"style":{"height":16.79},"width":173.75,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-22.png","element":"img","alt":" {(Ps,a,h −","inline":true}],[{"text":"1","element":"span"}],[{"style":{"height":32.32},"width":545.92,"height":80.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-23.png","element":"img","alt":"s(k,m,j)Lih+1 )(Vref,kLih+1 − V ⋆h+1)}i∈N+","inline":true,"padRight":true},{"text":"is a martingale sequence with its absolute values bounded by ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":".","element":"span"}],[{"text":"Now, we will prove, with probability at least ","element":"span"},{"style":{"height":14},"width":160.07,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-24.png","element":"img","alt":" 1 − p, E8","inline":true,"padRight":true},{"text":"holds. Because of (i) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1,","element":"a"}],[{"text":"we can append multiple 0s to the summation such that there are ","element":"span"},{"style":{"height":13.19},"width":39.29,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-25.png","element":"img","alt":" T1","inline":true,"padRight":true},{"text":"terms. Since the sequence","element":"span"}],[{"style":{"height":26.84},"width":797.33,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-26.png","element":"img","alt":"{(Psk,m,jh ,ak,m,jh ,h − 1sk,m,jh+1 )λkh+1(sk,m,jh+1 )}h,k,m,j","inline":true,"padRight":true},{"text":"can be reordered chronologically to a martingale","element":"span"}],[{"text":"sequence with its absolute values bounded by ","element":"span"},{"text":"1","element":"span"},{"text":", it is still a martingale sequence with its absolute","element":"span"}],[{"text":"values bounded by ","element":"span"},{"text":"1 ","element":"span"},{"text":"after appending some 0 terms. According to Azuma-Hoeffding inequality, for","element":"span"}],[{"text":"any ","element":"span"},{"style":{"height":16},"width":157.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-27.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":14},"width":90.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-28.png","element":"img","alt":" 1 − p","inline":true},{"text":", it holds that:","element":"span"}],[{"style":{"width":"64%"},"width":1025,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-29.png","element":"img"}],[{"text":"Similarly, the conclusion also holds for ","element":"span"},{"style":{"height":13.2},"width":316.61,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-30.png","element":"img","alt":" E9, E11, E13 and E14","inline":true,"padRight":true},{"text":"because of their similar martingale struc-","element":"span"}],[{"text":"tures as follows. For ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-31.png","element":"img","alt":" E9","inline":true},{"text":", the sequence ","element":"span"},{"style":{"height":26.84},"width":992.24,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-32.png","element":"img","alt":" {(1+2/H)h(Psk,m,jh ,ak,m,jh ,h−1sk,m,jh+1 )(V kh+1−V ⋆h+1)}k,m,j,h","inline":true}],[{"text":"can be reordered to a martingale sequence with the absolute values bounded by ","element":"span"},{"text":"18","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-33.png","element":"img","alt":" E11","inline":true},{"text":",","element":"span"}],[{"text":"the sequence ","element":"span"},{"style":{"height":27.09},"width":1291.22,"height":67.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-34.png","element":"img","alt":" {(1 + 2/H)h−1I[tk,m,jh > 1](Psk,m,jh ,ak,m,jh ,h − 1sk,m,jh+1 )(V ⋆h+1 − V πkh+1)}k,m,j,h","inline":true,"padRight":true},{"text":"can","element":"span"}],[{"text":"be reordered to a martingale sequence with its absolute values bounded by ","element":"span"},{"text":"9","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-35.png","element":"img","alt":" E13","inline":true},{"text":", the","element":"span"}],[{"text":"sequence ","element":"span"},{"style":{"height":24.56},"width":779.12,"height":61.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-36.png","element":"img","alt":" {Psk,m,jh ,ak,m,jh ,h(V ⋆h+1) − V ⋆h+1(sk,m,jh+1 )}k,m,j,h","inline":true,"padRight":true},{"text":"can be reordered to a martingale se-","element":"span"}],[{"text":"quence with the absolute values bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":". For ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-37.png","element":"img","alt":" E14","inline":true},{"text":", the sequence ","element":"span"},{"style":{"height":22.68},"width":435.49,"height":56.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-38.png","element":"img","alt":" {Psk,m,jh ,ak,m,jh ,h(V ⋆h+1)2 −","inline":true}],[{"style":{"height":21.67},"width":390.67,"height":54.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-39.png","element":"img","alt":"(V ⋆h+1(sk,m,jh+1 ))2}k,m,j,h","inline":true,"padRight":true},{"text":"can be reordered to a martingale sequence with its absolute values bounded","element":"span"}],[{"text":"by ","element":"span"},{"style":{"height":13.78},"width":64.24,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-40.png","element":"img","alt":" H2.","inline":true}],[{"href":"#id-125","referenceIndex":279,"text":"Now","element":"a"},{"text":", we will prove with probability at least ","element":"span"},{"style":{"height":17.39},"width":398.29,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-41.png","element":"img","alt":" 1−SAT1(HT 31 +1)p, E2 ","inline":true,"padRight":true},{"text":"holds. According to the Lemma","element":"span"}],[{"href":"#id-125","referenceIndex":279,"text":"D.4 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":23.29},"width":111.78,"height":58.22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-42.png","element":"img","alt":" ϵ = 1T 21","inline":true,"padRight":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":17.41},"width":104.4,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-43.png","element":"img","alt":" p ← p2","inline":true},{"text":", we have that with probability at least ","element":"span"},{"style":{"height":17.39},"width":328.02,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-44.png","element":"img","alt":" 1 − (NH2T 21 + 1)p","inline":true},{"text":", it","element":"span"}],[{"text":"holds for a given ","element":"span"},{"style":{"height":17.9},"width":418.6,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-45.png","element":"img","alt":" N kh(s, a) = N ∈ N+ that:","inline":true}],[{"style":{"width":"53%"},"width":842,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-46.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":19.63},"width":548.4,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-47.png","element":"img","alt":" k ∈ [K], we have N kh(s, a) ∈ [ T1H ]","inline":true},{"text":". Considering all the possible combination ","element":"span"},{"style":{"height":16},"width":221.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-48.png","element":"img","alt":" (s, a, h, N) ∈","inline":true}],[{"style":{"height":19.63},"width":329.56,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-49.png","element":"img","alt":"S × A × [H] × [ T1H ]","inline":true},{"text":", then with probability at least ","element":"span"},{"style":{"height":17.39},"width":371.09,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/28-50.png","element":"img","alt":" 1 − SAT1(HT 31 + 1)p","inline":true},{"text":", it holds simultaneously","element":"span"}],[{"text":"for all ","element":"span"},{"style":{"height":16},"width":620.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-0.png","element":"img","alt":" (s, a, h, k) ∈ S × A × [H] × [K] that:","inline":true}],[{"style":{"width":"60%"},"width":961,"height":169,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-1.png","element":"img"}],[{"text":"Similarly, with probability at least ","element":"span"},{"style":{"height":17.39},"width":616.06,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-2.png","element":"img","alt":" 1 − SAT 21 (4HT 31 + 1)/Hp, E5 holds.","inline":true}],[{"text":"Finally, we will prove, with probability at least ","element":"span"},{"style":{"height":17.38},"width":334.2,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-3.png","element":"img","alt":" 1 − SAT 21 /Hp, E10","inline":true,"padRight":true},{"text":"holds. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h, t","element":"span"},{"text":") ","element":"span"},{"text":"is the","element":"span"}],[{"text":"summation for all the visits to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in stage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", which is a martingale sequence with the order","element":"span"}],[{"text":"assigned chronologically. According to Azuma-Hoeffding Inequality, for any ","element":"span"},{"style":{"height":16},"width":175.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-4.png","element":"img","alt":" p ∈ (0, 1)","inline":true},{"text":", with","element":"span"}],[{"text":"probability at least ","element":"span"},{"style":{"height":14},"width":90.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-5.png","element":"img","alt":" 1 − p","inline":true},{"text":", it holds for a given ","element":"span"},{"style":{"height":17.5},"width":387.42,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-6.png","element":"img","alt":" yth(s, a) = y ∈ N+ that:","inline":true}],[{"text":"��","element":"span"}],[{"text":"For any ","element":"span"},{"style":{"height":19.63},"width":480.87,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-7.png","element":"img","alt":" t ∈ [Th(s, a)], yth(s, a) ∈ [ T1H ]","inline":true},{"text":". Considering all combination of ","element":"span"},{"style":{"height":16},"width":449.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-8.png","element":"img","alt":" (s, a, h, y) ∈ S × A × H ×","inline":true}],[{"style":{"height":19.63},"width":66.01,"height":49.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-9.png","element":"img","alt":"[ T1H ]","inline":true},{"text":", with probability at least ","element":"span"},{"style":{"height":17.38},"width":246.88,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-10.png","element":"img","alt":" 1 − SAT 21 /Hp","inline":true},{"text":", it holds simultaneously for any ","element":"span"},{"style":{"height":16},"width":363.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-11.png","element":"img","alt":" (s, a, h) ∈ S × A × H","inline":true}],[{"text":"and any ","element":"span"},{"style":{"height":16},"width":263.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-12.png","element":"img","alt":" t ∈ [T1/H] that:","inline":true}],[{"style":{"width":"66%"},"width":1055,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-13.png","element":"img"}],[{"id":"id-30","text":"E ","element":"span"},{"text":"P","element":"span"},{"text":"ROOF OF ","element":"span"},{"text":"T","element":"span"},{"text":"HEOREM ","element":"span"},{"href":"#id-31","text":"4.1","element":"a"}],[{"text":"In this section, we provide the proof of Theorem ","element":"span"},{"href":"#id-31","text":"4.1. ","element":"a"},{"text":"Throughout this section, we will discuss under","element":"span"}],[{"text":"the event ","element":"span"},{"style":{"height":20.8},"width":288.59,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-14.png","element":"img","alt":"�15i=1 Ei and show","inline":true}],[{"id":"id-29","style":{"width":"98%"},"width":1566,"height":356,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":32.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-16.png","element":"img","alt":" Ei","inline":true},{"text":"s are the events in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5 ","element":"a"},{"text":"which shows that ","element":"span"},{"style":{"height":20.8},"width":671.72,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-17.png","element":"img","alt":" P(�15i=1 Ei) ≥ 1 − (4SAT 51 + SAHT 41 +","inline":true}],[{"style":{"height":17.39},"width":473.25,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-18.png","element":"img","alt":"5SAT 21 /H + 5SAT1 + 5)p","inline":true},{"text":". ","element":"span"},{"text":"Thus, showing Equation ","element":"span"},{"href":"#id-29","text":"(26) ","element":"a"},{"text":"will complete the proof. ","element":"span"},{"text":"Before","element":"span"}],[{"text":"we start, we introduce some stage-wise notations. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":24.25},"width":609.18,"height":60.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/29-19.png","element":"img","alt":" ˜µref,kh (s, a) = �k′:tk′h tkh","inline":true},{"text":", we have the following relationships:","element":"span"}],[{"style":{"width":"31%"},"width":506,"height":255,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-2.png","element":"img"}],[{"text":"In this case, we have ","element":"span"},{"style":{"height":20.07},"width":486.93,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-3.png","element":"img","alt":"˜Qk+1,1h (s, a) = Qk+1,1h (s, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":486.94,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-4.png","element":"img","alt":"˜Qk+1,2h (s, a) = Qk+1,2h (s, a)","inline":true},{"text":". There-","element":"span"}],[{"text":"fore, based on the update rule Equation ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":", for ","element":"span"},{"style":{"height":19.23},"width":357.8,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-5.png","element":"img","alt":" tk+1h (s, a) > tkh(s, a)","inline":true},{"text":", we have ","element":"span"},{"style":{"height":19.23},"width":230.92,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-6.png","element":"img","alt":" Qk+1h (s, a) =","inline":true}],[{"style":{"height":20.07},"width":692.89,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-7.png","element":"img","alt":"min{ ˜Qk+1,1h (s, a), ˜Qk+1,2h (s, a), Qkh(s, a)}","inline":true},{"text":". Since these stage-wise notations ","element":"span"},{"style":{"height":20.47},"width":355.04,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-8.png","element":"img","alt":" ˜µref,kh , ˜µadv,kh and ˜µval,kh","inline":true}],[{"text":"have the same value for different rounds in the same stage, for ","element":"span"},{"style":{"height":19.23},"width":354.92,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-9.png","element":"img","alt":" tk+1h (s, a) = tkh(s, a)","inline":true},{"text":", we have","element":"span"}],[{"href":"#id-21","style":{"height":20.07},"width":439.03,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-10.png","element":"img","alt":"˜Qk+1,1h (s, a) = ˜Qk,1h (s, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":439.04,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-11.png","element":"img","alt":" ˜Qk+1,2h (s, a) = ˜Qk,2h (s, a)","inline":true},{"text":". According to the update rule Equa-","element":"span"}],[{"text":"tion ","element":"span"},{"href":"#id-21","text":"(7)","element":"a"},{"text":", in this case we have ","element":"span"},{"style":{"height":19.23},"width":374.15,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-12.png","element":"img","alt":" Qk+1h (s, a) = Qkh(s, a)","inline":true},{"text":". In each stage, using mathematical induction,","element":"span"}],[{"text":"we c","element":"span"},{"href":"#id-21","text":"an ","element":"a"},{"text":"find that for any ","element":"span"},{"style":{"height":14.79},"width":124.49,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-13.png","element":"img","alt":" k ∈ N+","inline":true},{"text":", it holds: ","element":"span"},{"id":"id-129","style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":24.4},"width":1516.71,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-14.png","element":"img","alt":"k+1h (s, a) = Iîtk+1h = 1óH + Iîtk+1h > 1ómin{ ˜Qk+1,1h (s, a), ˜Qk+1,2h (s, a), Qkh(s, a)}. (27)","inline":true}],[{"text":"Here, ","element":"span"},{"style":{"height":19.23},"width":72.48,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-15.png","element":"img","alt":" tk+1h","inline":true,"padRight":true},{"text":"is the abbreviation of ","element":"span"},{"style":{"height":19.23},"width":163.3,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-16.png","element":"img","alt":" tk+1h (s, a)","inline":true},{"text":". Since ","element":"span"},{"style":{"height":17.9},"width":141.11,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-17.png","element":"img","alt":" Qkh(s, a)","inline":true,"padRight":true},{"text":"is non-increasing with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", in","element":"span"}],[{"text":"the following Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1, ","element":"a"},{"text":"we will give a lower bound of ","element":"span"},{"style":{"height":17.9},"width":150.6,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-18.png","element":"img","alt":" Qkh(s, a).","inline":true}],[{"id":"id-126","style":{"fontWeight":"bold"},"text":"Lemma E.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-19.png","element":"img","alt":"�7i=1 Ei","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"it holds that for any ","element":"span"},{"style":{"height":16},"width":365.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-20.png","element":"img","alt":" (s, a, h, k) ∈ S × A ×","inline":true}],[{"style":{"height":16},"width":179.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-21.png","element":"img","alt":"[H] × [K]:","inline":true}],[{"style":{"width":"21%"},"width":341,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then we have ","element":"span"},{"style":{"height":17.9},"width":938.49,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-23.png","element":"img","alt":" V kh (s) ≥ V ⋆h (s) for any (s, a, h, k) ∈ S × A × [H] × [K].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We first claim that based on the event ","element":"span"},{"style":{"height":13.19},"width":120.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-24.png","element":"img","alt":" E3 ∩ E4","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"it holds for any ","element":"span"},{"style":{"height":16},"width":206.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-25.png","element":"img","alt":" (s, a, h, k) ∈","inline":true}],[{"style":{"height":16},"width":393.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-26.png","element":"img","alt":"S × A × [H] × [K] that","inline":true}],[{"id":"id-127","style":{"width":"82%"},"width":1301,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-27.png","element":"img"}],[{"text":"and based on the event ","element":"span"},{"style":{"height":16},"width":966.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-28.png","element":"img","alt":" E6 ∩ E7, for any (s, a, h, k) ∈ S × A × [H] × [K], we have:","inline":true}],[{"id":"id-128","style":{"width":"86%"},"width":1366,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/30-29.png","element":"img"}],[{"text":"We will prove Equation ","element":"span"},{"href":"#id-127","text":"(28) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-128","text":"(29) ","element":"a"},{"text":"at the end of the proof for Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1. ","element":"a"},{"text":"Combining","element":"span"}],[{"text":"Equation ","element":"span"},{"href":"#id-127","text":"(28) ","element":"a"},{"text":"with the event ","element":"span"},{"style":{"height":16},"width":883.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-0.png","element":"img","alt":" E2, for any (s, a, h, k) ∈ S × A × [H] × [K], we have:","inline":true}],[{"style":{"width":"79%"},"width":1268,"height":582,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-1.png","element":"img"}],[{"text":"Similarly, combining Equation ","element":"span"},{"href":"#id-128","text":"(29) ","element":"a"},{"text":"with the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-2.png","element":"img","alt":" E5","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"for any ","element":"span"},{"style":{"height":16},"width":365.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-3.png","element":"img","alt":" (s, a, h, k) ∈ S × A ×","inline":true}],[{"style":{"height":16},"width":327.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-4.png","element":"img","alt":"[H] × [K], we have:","inline":true}],[{"style":{"width":"62%"},"width":995,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-5.png","element":"img"}],[{"text":"Therefore, according to the definition of ","element":"span"},{"style":{"height":20.07},"width":952.18,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-6.png","element":"img","alt":"˜bk,2h (s, a), for any (s, a, h, k) ∈ S × A × [H] × [K], it holds","inline":true}],[{"text":"that:","element":"span"}],[{"id":"id-135","style":{"width":"62%"},"width":988,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-7.png","element":"img"}],[{"text":"Now we use mathematical induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"to prove ","element":"span"},{"style":{"height":17.9},"width":351.8,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-8.png","element":"img","alt":" Qkh(s, a) ≥ Q⋆h(s, a)","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":16},"width":215.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-9.png","element":"img","alt":" (s, a, h, k) ∈","inline":true}],[{"style":{"height":16},"width":326.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-10.png","element":"img","alt":"S × A × [H] × [K]","inline":true},{"text":". For ","element":"span"},{"style":{"height":17.9},"width":550.05,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-11.png","element":"img","alt":" k = 1, Q1h(s, a) = H ≥ Q⋆h(s, a)","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":16},"width":398.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-12.png","element":"img","alt":" (s, a, h) ∈ S × A × [H]","inline":true},{"text":". For","element":"span"}],[{"style":{"height":13.2},"width":95.13,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-13.png","element":"img","alt":"k ≥ 2","inline":true},{"text":", assume we already have ","element":"span"},{"style":{"height":17.9},"width":1087.25,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-14.png","element":"img","alt":" Qk′h (s, a) ≥ Q⋆h(s, a) for any (s, a, h, k′) ∈ S × A × [H] × [k − 1],","inline":true}],[{"text":"then we will prove for any ","element":"span"},{"style":{"height":17.9},"width":720.19,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-15.png","element":"img","alt":" (s, a, h) ∈ S×A×[H], Qkh(s, a) ≥ Q⋆h(s, a)","inline":true},{"text":". According to Equation ","element":"span"},{"href":"#id-129","referenceIndex":365,"text":"(27)","element":"a"},{"text":",","element":"span"}],[{"text":"the following relationship holds:","element":"span"}],[{"style":{"width":"80%"},"width":1278,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-16.png","element":"img"}],[{"text":"Then for any given ","element":"span"},{"style":{"height":16},"width":392.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-17.png","element":"img","alt":" (s, a, h) ∈ S × A × [H]","inline":true},{"text":", we have the following four cases:","element":"span"}],[{"text":"(a) If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":17.9},"width":714.51,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-18.png","element":"img","alt":"kh(s, a) = 1, then Qkh(s, a) = H ≥ Q⋆h(s, a).","inline":true}],[{"text":"(b) If ","element":"span"},{"style":{"height":19.23},"width":648.61,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-19.png","element":"img","alt":" tkh(s, a) > 1 and Qkh(s, a) = Qk−1h (s, a)","inline":true},{"text":", then the conclusion holds.","element":"span"}],[{"text":"(c) If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":20.47},"width":1207.38,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-20.png","element":"img","alt":"kh(s, a) > 1 and Qkh(s, a) = ˜Qk,1h (s, a) = rh(s, a) + ˜µval,kh /nkh + ˜bk,1h (s, a).","inline":true}],[{"text":"Because of Equation ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", we have the following equality:","element":"span"}],[{"style":{"width":"35%"},"width":557,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-21.png","element":"img"}],[{"text":"Then we have:","element":"span"}],[{"id":"id-131","style":{"width":"89%"},"width":1422,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-22.png","element":"img"}],[{"text":"According to the definition of ","element":"span"},{"style":{"height":16},"width":194.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-23.png","element":"img","alt":" li(s, a, h, k)","inline":true},{"text":", we know ","element":"span"},{"style":{"height":14.79},"width":119.59,"height":36.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-24.png","element":"img","alt":" kli < k","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":17.9},"width":217.53,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-25.png","element":"img","alt":" i ∈ [nkh(s, a)]","inline":true},{"text":". Then ","element":"span"},{"href":"#id-114","style":{"height":22.16},"width":205.15,"height":55.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-26.png","element":"img","alt":" Qklih (s, a) ≥","inline":true}],[{"style":{"height":16.52},"width":141.11,"height":41.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-27.png","element":"img","alt":"Q⋆h(s, a)","inline":true,"padRight":true},{"text":"based on the induction. Therefore, according to the update rule Equation ","element":"span"},{"href":"#id-114","text":"(13) ","element":"a"},{"text":"and Equa-","element":"span"}],[{"text":"tion ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":17.9},"width":796.55,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-28.png","element":"img","alt":" (s, h) ∈ S × [H] and any i ∈ [nkh(s, a)], we have:","inline":true}],[{"id":"id-132","style":{"width":"80%"},"width":1272,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-29.png","element":"img"}],[{"text":"and for any ","element":"span"},{"style":{"height":17.9},"width":357.03,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-30.png","element":"img","alt":" i ∈ [nkh(s, a)] it holds:","inline":true}],[{"id":"id-134","style":{"width":"63%"},"width":1005,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/31-31.png","element":"img"}],[{"text":"Combining Equation ","element":"span"},{"href":"#id-131","text":"(31) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-132","text":"(32)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"89%"},"width":1417,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-0.png","element":"img"}],[{"text":"The last inequality is because ","element":"span"},{"style":{"height":24.4},"width":307.57,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-1.png","element":"img","alt":"˜bk,1h =»2H2ι/nkh ","inline":true,"padRight":true},{"text":"and the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-2.png","element":"img","alt":" E1","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5.","element":"a"}],[{"text":"(d) If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"height":17.9},"width":196.29,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-3.png","element":"img","alt":"kh(s, a) > 1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":20.47},"width":1162.74,"height":51.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-4.png","element":"img","alt":"kh(s, a) = ˜Qk,2h (s, a) = rh(s, a) + ˜µref,k+1h /N k+1h + ˜µadv,k+1h /nk+1h +","inline":true}],[{"style":{"height":20.07},"width":152.42,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-5.png","element":"img","alt":"˜bk,2h (s, a)","inline":true},{"text":". We have that","element":"span"}],[{"style":{"width":"20%"},"width":325,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-6.png","element":"img"}],[{"style":{"height":21.2},"width":832.9,"height":53.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-7.png","element":"img","alt":"= ˜µref,kh /N kh + ˜µadv,kh /nkh + ˜bk,2h (s, a) − Ps,a,hV ⋆h+1","inline":true,"padRight":true},{"text":"=","element":"span"}],[{"id":"id-136","style":{"width":"97%"},"width":1547,"height":394,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-8.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":21.67},"width":89.53,"height":54.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-9.png","element":"img","alt":" V ref,kh+1 ","inline":true,"padRight":true},{"text":"is non-increasing with regard to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"based on (j) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"id":"id-133","style":{"width":"79%"},"width":1257,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-10.png","element":"img"}],[{"text":"Based on Equation ","element":"span"},{"href":"#id-133","text":"(35)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-134","text":"(33)","element":"a"},{"text":", and Equation ","element":"span"},{"href":"#id-135","text":"(30)","element":"a"},{"text":", we know each term in Equation ","element":"span"},{"href":"#id-136","text":"(34) ","element":"a"},{"text":"is","element":"span"}],[{"text":"nonnegative. Therefore, in this case ","element":"span"},{"href":"#id-134","style":{"height":17.9},"width":412.98,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-11.png","element":"img","alt":" Qkh(s, a) − Q⋆h(s, a) ≥ 0.","inline":true}],[{"text":"In summary, we prove the conclusion that ","element":"span"},{"href":"#id-127","style":{"height":17.9},"width":924.6,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-12.png","element":"img","alt":" Qkh(s, a) ≥ Q⋆h(s, a) for any (s, a, h, k) ∈ S × A × [H] ×","inline":true}],[{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"]","element":"span"},{"text":". The only thing left is to prove Equation ","element":"span"},{"href":"#id-127","text":"(28) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-128","text":"(29)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof of Equation ","element":"span"},{"href":"#id-127","style":{"fontWeight":"bold"},"text":"(28) ","element":"a"},{"style":{"fontWeight":"bold"},"text":"and Equation ","element":"span"},{"href":"#id-128","style":{"fontWeight":"bold"},"text":"(29)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"For any given ","element":"span"},{"style":{"height":16},"width":610.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-13.png","element":"img","alt":" (s, a, h, k) ∈ S × A × [H] × [K], let:","inline":true}],[{"id":"id-137","style":{"width":"88%"},"width":1400,"height":502,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-14.png","element":"img"}],[{"text":"Without ambiguity, we will use the abbreviations ","element":"span"},{"style":{"height":14.4},"width":233.92,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-15.png","element":"img","alt":" χ3, χ4, and χ5","inline":true,"padRight":true},{"text":"in the following proof.","element":"span"}],[{"text":"First, we focus on bounding ","element":"span"},{"style":{"height":16},"width":64.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-16.png","element":"img","alt":" |χ3|","inline":true},{"text":". Using the definition of ","element":"span"},{"style":{"height":20.07},"width":330.43,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-17.png","element":"img","alt":" ˜vref,kh (s, a), we have:","inline":true}],[{"id":"id-138","style":{"width":"94%"},"width":1505,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/32-18.png","element":"img"}],[{"text":"Summing Equation ","element":"span"},{"href":"#id-137","text":"(36)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-137","text":"(37)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-137","text":"(38) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-138","text":"(39)","element":"a"},{"text":", we can find that:","element":"span"}],[{"id":"id-141","style":{"width":"84%"},"width":1343,"height":199,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-0.png","element":"img"}],[{"text":"Because of the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-1.png","element":"img","alt":" E4","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"we know:","element":"span"}],[{"id":"id-139","style":{"width":"62%"},"width":990,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-2.png","element":"img"}],[{"text":"Next, we focus on bounding ","element":"span"},{"style":{"height":16},"width":64.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-3.png","element":"img","alt":" |χ4|","inline":true},{"text":". Using the absolute value inequality, it holds that: ","element":"span"},{"style":{"height":56.63},"width":71.28,"height":141.57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-4.png","element":"img","alt":"��N kh�","inline":true}],[{"style":{"width":"57%"},"width":905,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-5.png","element":"img"}],[{"text":"Then we have: :","element":"span"}],[{"id":"id-140","style":{"width":"85%"},"width":1353,"height":461,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-6.png","element":"img"}],[{"text":"The last inequality is because of the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-7.png","element":"img","alt":" E3","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5.","element":"a"}],[{"text":"For ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-8.png","element":"img","alt":" χ5","inline":true},{"text":", according to the Cauchy-Schwarz Inequality, we have ","element":"span"},{"style":{"height":14},"width":125.86,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-9.png","element":"img","alt":" χ5 ≤ 0.","inline":true}],[{"text":"Applying the upper bound of ","element":"span"},{"style":{"height":10},"width":40.94,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-10.png","element":"img","alt":" χ3","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-139","text":"(41)","element":"a"},{"text":", ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-11.png","element":"img","alt":" χ4","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-140","text":"(42) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-12.png","element":"img","alt":" χ5","inline":true,"padRight":true},{"text":"to Equation ","element":"span"},{"href":"#id-141","text":"(40)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"58%"},"width":928,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-13.png","element":"img"}],[{"text":"Then we finish the proof of Equation ","element":"span"},{"href":"#id-127","text":"(28)","element":"a"},{"text":". The proof for Equation ","element":"span"},{"href":"#id-128","text":"(29) ","element":"a"},{"text":"is similar, in which we just","element":"span"}],[{"text":"need to substitute ","element":"span"},{"href":"#id-127","style":{"height":17.9},"width":674.88,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-14.png","element":"img","alt":" N kh(s, a) with nkh(s, a) and H2 with 2H2.","inline":true}],[{"style":{"width":"1%"},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-15.png","element":"img"}],[{"text":"With Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1, ","element":"a"},{"text":"Equation ","element":"span"},{"href":"#id-114","text":"(13) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":16},"width":581.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-16.png","element":"img","alt":" (s, h, k) ∈ S × [H] × [K], we have:","inline":true}],[{"id":"id-143","style":{"width":"76%"},"width":1211,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-17.png","element":"img"}],[{"text":"The following lemma gives a viable value of ","element":"span"},{"style":{"height":13.19},"width":48.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-18.png","element":"img","alt":" N0","inline":true,"padRight":true},{"text":"to learn the reference function ","element":"span"},{"style":{"height":20.07},"width":142.31,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-19.png","element":"img","alt":" V ref,kh (s)","inline":true},{"text":". Denote","element":"span"}],[{"style":{"height":20.07},"width":381.84,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-20.png","element":"img","alt":"V REFh (s) = V ref,K+1h (s)","inline":true,"padRight":true},{"text":"as the final value of the reference function ","element":"span"},{"style":{"height":20.07},"width":151.81,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-21.png","element":"img","alt":" V ref,kh (s).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma E.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-22.png","element":"img","alt":"�7i=1 Ei","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"it holds for any ","element":"span"},{"style":{"height":16},"width":467.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-23.png","element":"img","alt":" h ∈ [H] and β ∈ (0, H] that:","inline":true}],[{"style":{"width":"73%"},"width":1161,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/33-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"In addition, letting","element":"span"}],[{"style":{"width":"48%"},"width":769,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"we have that for any ","element":"span"},{"style":{"height":16},"width":284.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-1.png","element":"img","alt":" (s, h) ∈ S × [H],","inline":true}],[{"style":{"width":"50%"},"width":801,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We claim that for any non-negative weight sequence ","element":"span"},{"style":{"height":19.76},"width":512.43,"height":49.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-3.png","element":"img","alt":" {ωk,m,j}k,m,j and any h ∈ [H],","inline":true}],[{"id":"id-142","style":{"width":"95%"},"width":1509,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-4.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":19.18},"width":875.1,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-5.png","element":"img","alt":" ||ω||∞ = maxk,m,j ωk,m,j and ||ω||1 = �k,m,j ωk,m,j","inline":true},{"text":". If we have proved Equation ","element":"span"},{"href":"#id-142","text":"(44)","element":"a"},{"text":", then","element":"span"}],[{"text":"letting ","element":"span"},{"style":{"height":20.07},"width":697.18,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-6.png","element":"img","alt":" ωk,m,j = I[V kh (sk,m,jh ) − V ⋆h (sk,m,jh ) ≥ β]","inline":true},{"text":", according to Equation ","element":"span"},{"href":"#id-142","text":"(44) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-143","text":"(43)","element":"a"},{"text":",","element":"span"}],[{"text":"we have:","element":"span"}],[{"style":{"width":"82%"},"width":1312,"height":329,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-7.png","element":"img"}],[{"text":"Letting ","element":"span"},{"style":{"height":16.7},"width":787.46,"height":41.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-8.png","element":"img","alt":" b = 72H2√SAHι and c = 8MSAH3, we have:","inline":true}],[{"style":{"width":"28%"},"width":454,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-9.png","element":"img"}],[{"text":"Solving the inequality, we have:","element":"span"}],[{"style":{"width":"32%"},"width":522,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-10.png","element":"img"}],[{"text":"Then:","element":"span"}],[{"style":{"width":"79%"},"width":1266,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-11.png","element":"img"}],[{"text":"Therefore, for any ","element":"span"},{"style":{"height":16},"width":130.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-12.png","element":"img","alt":" h ∈ [H]","inline":true},{"text":", it holds that:","element":"span"}],[{"style":{"width":"91%"},"width":1443,"height":325,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-13.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":17.9},"width":101.89,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-14.png","element":"img","alt":" V kh (s)","inline":true,"padRight":true},{"text":"is non-increasing with regard to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-15.png","element":"img","alt":" �7i=1 Ei","inline":true,"padRight":true},{"text":"according to Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1 ","element":"a"},{"text":"and","element":"span"}],[{"text":"(j) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"style":{"height":17.9},"width":250.7,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-16.png","element":"img","alt":" V kh (s) − V ⋆h (s)","inline":true,"padRight":true},{"text":"is also non-increasing. Before we update the reference function at","element":"span"}],[{"style":{"height":20.07},"width":502.41,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-17.png","element":"img","alt":"(s, h), V ref,kh (s) = V 1h (s) = H","inline":true},{"text":". Therefore, if the reference function at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":") ","element":"span"},{"text":"is not updated in the","element":"span"}],[{"text":"algorithm, ","element":"span"},{"style":{"height":17.9},"width":228.66,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-18.png","element":"img","alt":" V REFh (s) = H","inline":true},{"text":". Next, we discuss the situation in which the reference function at ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, h","element":"span"},{"text":")","element":"span"}],[{"text":"is updated in FedQ-Advantage. When we update the reference function at the end of round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", we","element":"span"}],[{"text":"have ","element":"span"},{"style":{"height":24.4},"width":1116.68,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-19.png","element":"img","alt":"�k′,m,j:k′≤k Iîsk′,m,jh = só≥ N0 and thus 0 ≤ V kh (s) − V ⋆h (s) < β","inline":true},{"text":". Therefore, for the final","element":"span"}],[{"text":"value ","element":"span"},{"style":{"height":20.07},"width":570.6,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-20.png","element":"img","alt":" V REFh (s) = V ref,k+1h (s) = V k+1h (s)","inline":true},{"text":", it holds that ","element":"span"},{"style":{"height":17.9},"width":438.94,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-21.png","element":"img","alt":" 0 ≤ V REFh (s) − V ⋆h (s) < β","inline":true},{"text":". Then we have","element":"span"}],[{"style":{"height":17.9},"width":226.75,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-22.png","element":"img","alt":"V REFh (s) = H","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":17.9},"width":519.56,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-23.png","element":"img","alt":" V ⋆h (s) ≤ V REFh (s) ≤ V ⋆h (s) + β","inline":true,"padRight":true},{"text":"under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/34-24.png","element":"img","alt":" �7i=1 Ei","inline":true},{"text":". Now, we only need to","element":"span"}],[{"text":"prove Equation ","element":"span"},{"href":"#id-142","text":"(44)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof of Equation ","element":"span"},{"href":"#id-142","style":{"fontWeight":"bold"},"text":"(44)","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"According to the update rule Equation ","element":"span"},{"href":"#id-114","text":"(13) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-129","referenceIndex":365,"text":"(27)","element":"a"},{"text":", for ","element":"span"},{"style":{"height":16},"width":140.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-0.png","element":"img","alt":" h ∈ [H],","inline":true}],[{"text":"we have:","element":"span"}],[{"style":{"width":"88%"},"width":1402,"height":384,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-1.png","element":"img"}],[{"text":"and according to the Bellman equality Equation ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"97%"},"width":1545,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-2.png","element":"img"}],[{"text":"Combined these two inequalities, it holds for any ","element":"span"},{"style":{"height":16},"width":210.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-3.png","element":"img","alt":" h ∈ [H] that:","inline":true}],[{"id":"id-144","style":{"width":"100%"},"width":1586,"height":821,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-4.png","element":"img"}],[{"text":"For the first term in Equation ","element":"span"},{"href":"#id-144","text":"(46)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"97%"},"width":1545,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"k,m,j","element":"span"}],[{"id":"id-161","style":{"width":"57%"},"width":914,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-6.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":20.07},"width":1206.08,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-7.png","element":"img","alt":" (s, a, h) ∈ S × A × [H], I[(sk,m,jh , ak,m,jh ) = (s, a), tkh(s, a) = 1] = 1","inline":true,"padRight":true},{"text":"if and only if","element":"span"}],[{"style":{"height":20.07},"width":661.98,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-8.png","element":"img","alt":"(sk,m,jh , ak,m,jh ) = (s, a) and tkh(s, a) = 1","inline":true},{"text":". Therefore, ","element":"span"},{"style":{"height":22.44},"width":733.44,"height":56.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-9.png","element":"img","alt":" �k,m,j I[(sk,m,jh , ak,m,jh ) = (s, a), tkh(s, a) =","inline":true}],[{"style":{"height":17.9},"width":222.95,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-10.png","element":"img","alt":"0] = Y 1h (s, a)","inline":true},{"text":". Because of (c) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"style":{"width":"89%"},"width":1414,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-11.png","element":"img"}],[{"text":"Then it holds for any ","element":"span"},{"style":{"height":16},"width":210.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-12.png","element":"img","alt":" h ∈ [H] that:","inline":true}],[{"id":"id-151","style":{"width":"89%"},"width":1427,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/35-13.png","element":"img"}],[{"text":"For the second term in Equation ","element":"span"},{"href":"#id-144","text":"(46)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"98%"},"width":1562,"height":602,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-0.png","element":"img"}],[{"text":"Let:","element":"span"}],[{"style":{"width":"87%"},"width":1389,"height":292,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-1.png","element":"img"}],[{"text":"We have:","element":"span"}],[{"id":"id-152","style":{"width":"79%"},"width":1267,"height":254,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-2.png","element":"img"}],[{"text":"Next, we will explore the relationship between the norm of ","element":"span"},{"style":{"height":19.76},"width":637.94,"height":49.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-3.png","element":"img","alt":" {ωk,m,j}k,m,j and {˜ωk,m,j}k,m,j. For a","inline":true}],[{"text":"given triple ","element":"span"},{"style":{"height":16},"width":176.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-4.png","element":"img","alt":" (k′, m′, j′)","inline":true},{"text":", according to the definition of ","element":"span"},{"style":{"height":20.07},"width":357.72,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-5.png","element":"img","alt":" li(sk,m,jh , ak,m,jh , h, k)","inline":true},{"text":", ","element":"span"},{"style":{"height":23.53},"width":337.76,"height":58.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-6.png","element":"img","alt":"�nkhi=1 I[(k, m, j)li =","inline":true}],[{"style":{"height":20.07},"width":1418.12,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-7.png","element":"img","alt":"(k′, m′, j′), tk,m,jh > 1]) = 1 if and only if (sk,m,jh , ak,m,jh ) = (sk′,m′,j′h , ak′,m′,j′h )","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.6},"width":76.63,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-8.png","element":"img","alt":" 1 <","inline":true}],[{"style":{"height":20.07},"width":725.91,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-9.png","element":"img","alt":"tkh(sk,m,jh , ak,m,jh ) = tk′h (sk,m,jh , ak,m,jh ) + 1","inline":true},{"text":". In this case, we have ","element":"span"},{"style":{"height":20.07},"width":384.52,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-10.png","element":"img","alt":" tk′h (sk,m,jh , ak,m,jh ) > 0","inline":true,"padRight":true},{"text":"and","element":"span"}],[{"style":{"height":23.9},"width":729.43,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-11.png","element":"img","alt":"nkh(sk,m,jh , ak,m,jh ) = ytk′hh (sk′,m′,j′h , ak′,m′,j′h )","inline":true},{"text":". Then for a given triple ","element":"span"},{"style":{"height":16},"width":176.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-12.png","element":"img","alt":" (k′, m′, j′)","inline":true},{"text":", it holds that: ","element":"span"},{"text":"� ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k,m,j","element":"span"}],[{"style":{"width":"95%"},"width":1520,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-13.png","element":"img"}],[{"style":{"height":23.9},"width":472.96,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-14.png","element":"img","alt":"= ytk′h +1h (sk′,m′,j′h , ak′,m′,j′h ).","inline":true}],[{"text":"Then according to (e) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"it holds that:","element":"span"}],[{"style":{"width":"93%"},"width":1475,"height":139,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-15.png","element":"img"}],[{"text":"Then we have:","element":"span"}],[{"style":{"width":"87%"},"width":1393,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/36-16.png","element":"img"}],[{"text":"We also have:","element":"span"}],[{"style":{"width":"95%"},"width":1522,"height":1403,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/37-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-145","style":{"width":"73%"},"width":1159,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/37-1.png","element":"img"}],[{"text":"Then we have:","element":"span"}],[{"id":"id-146","style":{"width":"91%"},"width":1447,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/37-2.png","element":"img"}],[{"text":"For the coefficient ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, t","element":"span"},{"text":")","element":"span"},{"text":", we have the following properties:","element":"span"}],[{"style":{"width":"97%"},"width":1548,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/37-3.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"93%"},"width":1475,"height":359,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/37-4.png","element":"img"}],[{"text":"Because ","element":"span"},{"style":{"height":17.5},"width":129.14,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-0.png","element":"img","alt":" yth(s, a)","inline":true,"padRight":true},{"text":"is increasing for ","element":"span"},{"style":{"height":16},"width":341.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-1.png","element":"img","alt":" 1 ≤ t ≤ Th(s, a) − 1","inline":true},{"text":", given the equation Equation ","element":"span"},{"href":"#id-145","text":"(50)","element":"a"},{"text":", when the","element":"span"}],[{"text":"weights ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, t","element":"span"},{"text":") ","element":"span"},{"text":"concentrates on former terms, we can obtain the larger value of th","element":"span"},{"href":"#id-145","text":"e rig","element":"a"},{"text":"ht term in","element":"span"}],[{"text":"Equation ","element":"span"},{"href":"#id-146","text":"(51)","element":"a"},{"text":". There exists some positive integer ","element":"span"},{"style":{"height":16},"width":286.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-2.png","element":"img","alt":" t0 ≤ Th(s, a) − 1","inline":true,"padRight":true},{"text":"satisfying:","element":"span"}],[{"id":"id-147","style":{"width":"79%"},"width":1256,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-3.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"31%"},"width":504,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-4.png","element":"img"}],[{"text":"Then according to Equation ","element":"span"},{"href":"#id-145","text":"(52)","element":"a"},{"text":", we have","element":"span"}],[{"id":"id-150","style":{"width":"82%"},"width":1314,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-5.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":286.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-6.png","element":"img","alt":" t0 ≤ Th(s, a) − 1","inline":true},{"text":", according to (e) and Equation ","element":"span"},{"href":"#id-112","text":"(16) ","element":"a"},{"text":"in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"id":"id-148","style":{"width":"98%"},"width":1564,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-7.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":13.2},"width":105.4,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-8.png","element":"img","alt":" t0 ≥ 2","inline":true},{"text":", we also have:","element":"span"}],[{"id":"id-149","style":{"width":"92%"},"width":1467,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-9.png","element":"img"}],[{"text":"The last inequality is because of (f) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"Then according to Equation ","element":"span"},{"href":"#id-147","text":"(54)","element":"a"},{"text":", it holds that:","element":"span"}],[{"style":{"width":"54%"},"width":862,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-10.png","element":"img"}],[{"text":"Applying inequalities Equation ","element":"span"},{"href":"#id-148","text":"(56) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-149","text":"(57) ","element":"a"},{"text":"to Equation ","element":"span"},{"href":"#id-150","text":"(55)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"60%"},"width":957,"height":436,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-11.png","element":"img"}],[{"text":"Here, the first inequality is because of Equation ","element":"span"},{"href":"#id-112","text":"(16)","element":"a"},{"text":". The last inequality uses Equation ","element":"span"},{"href":"#id-149","text":"(57) ","element":"a"},{"text":"and","element":"span"}],[{"style":{"height":19.37},"width":189.49,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-12.png","element":"img","alt":"1 + 2H ≤ 3.","inline":true}],[{"text":"If ","element":"span"},{"style":{"height":17.9},"width":598.05,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-13.png","element":"img","alt":" t0 = 1, then q(s, a) ≤ ||ω||∞y2h(s, a)","inline":true},{"text":". Therefore, according to Equation ","element":"span"},{"href":"#id-150","text":"(55)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"79%"},"width":1258,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-14.png","element":"img"}],[{"text":"Based on (e) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"it holds:","element":"span"}],[{"style":{"width":"82%"},"width":1308,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/38-15.png","element":"img"}],[{"text":"Therefore, for any ","element":"span"},{"style":{"height":16},"width":286.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-0.png","element":"img","alt":" t0 ≤ Th(s, a) − 1","inline":true,"padRight":true},{"text":"defined in Equation ","element":"span"},{"href":"#id-147","text":"(54)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-153","style":{"width":"99%"},"width":1584,"height":389,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-1.png","element":"img"}],[{"text":"The last inequality uses the Cauchy-Schwarz inequality.","element":"span"}],[{"text":"Based on Equation ","element":"span"},{"href":"#id-144","text":"(45)","element":"a"},{"text":", by applying Equation ","element":"span"},{"href":"#id-151","text":"(48)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-152","text":"(49) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-153","text":"(58)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":16},"width":140.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-2.png","element":"img","alt":" h ∈ [H],","inline":true}],[{"text":"we have:","element":"span"}],[{"id":"id-154","style":{"width":"90%"},"width":1433,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"||","element":"span"},{"style":{"height":19.37},"width":705.42,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-4.png","element":"img","alt":"˜ω||∞ ≤ (1 + 2H )||ω||∞, and ||˜ω||1 = ||ω||1.","inline":true}],[{"text":"Using Equation ","element":"span"},{"href":"#id-154","text":"(59)","element":"a"},{"text":", with induction on ","element":"span"},{"style":{"height":14},"width":321.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-5.png","element":"img","alt":" h = H, H − 1, ..., 1","inline":true},{"text":", we can prove that for any ","element":"span"},{"style":{"height":16},"width":141.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-6.png","element":"img","alt":" h ∈ [H]:","inline":true}],[{"id":"id-155","style":{"width":"96%"},"width":1525,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.37},"width":532.23,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-8.png","element":"img","alt":" Ch = (1 + H2 )(1 + 2H )H−h − H2","inline":true,"padRight":true},{"text":". Note that ","element":"span"},{"style":{"height":13.2},"width":158.57,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-9.png","element":"img","alt":" Ch ≤ 4H","inline":true},{"text":", based on Equation ","element":"span"},{"href":"#id-155","text":"(60)","element":"a"},{"text":", it holds for","element":"span"}],[{"text":"any ","element":"span"},{"style":{"height":16},"width":210.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-10.png","element":"img","alt":" h ∈ [H] that:","inline":true}],[{"style":{"width":"83%"},"width":1319,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-11.png","element":"img"}],[{"text":"Therefore, we finish the proof of Equation ","element":"span"},{"href":"#id-142","text":"(44)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-12.png","element":"img"}],[{"text":"Next, we go back to the proof of Equation ","element":"span"},{"href":"#id-29","text":"(26)","element":"a"},{"text":". In the following content, ","element":"span"},{"style":{"height":19.14},"width":119.86,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-13.png","element":"img","alt":"�k,m,j ","inline":true,"padRight":true},{"text":"is the simplified no-","element":"span"}],[{"text":"tation of ","element":"span"},{"href":"#id-29","style":{"height":25.61},"width":586.09,"height":64.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-14.png","element":"img","alt":"�Kk=1�Mm=1�nm,kj=1 . N kh, nkh, Li, li","inline":true,"padRight":true},{"text":"represent simplified notations for ","element":"span"},{"style":{"height":20.07},"width":317.65,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-15.png","element":"img","alt":" N kh(sk,m,jh , ak,m,jh ),","inline":true}],[{"style":{"height":20.07},"width":1124.77,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-16.png","element":"img","alt":"nkh(sk,m,jh , ak,m,jh ), Li(sk,m,jh , ak,m,jh , h) and li(sk,m,jh , ak,m,jh , h, h, k)","inline":true,"padRight":true},{"text":"respectively.","element":"span"}],[{"text":"For ","element":"span"},{"style":{"height":16},"width":335.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-17.png","element":"img","alt":" h ∈ [H + 1], denote:","inline":true}],[{"style":{"width":"38%"},"width":604,"height":269,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-18.png","element":"img"}],[{"text":"Here, ","element":"span"},{"href":"#id-126","referenceIndex":368,"style":{"height":19.37},"width":1487.92,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-19.png","element":"img","alt":" δkH+1 = ζkH+1 = 0. Because V ⋆h (s) = supπ V πh (s), we have δkh ≤ ζkh for any h ∈ [H + 1]. In","inline":true}],[{"text":"addition, as ","element":"span"},{"style":{"height":17.9},"width":791.38,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-20.png","element":"img","alt":" V kh (s) ≥ V ⋆h (s) for all (s, h, k) ∈ S × [H] × [K]","inline":true},{"text":", according to Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"style":{"width":"1%"},"width":23,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-21.png","element":"img"}],[{"text":"Regret","element":"span"},{"style":{"height":22.4},"width":181.98,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-22.png","element":"img","alt":"(T) = �","inline":true}],[{"style":{"width":"80%"},"width":1284,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-23.png","element":"img"}],[{"text":"Thus, we only need to bound ","element":"span"},{"style":{"height":20.4},"width":235.08,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-24.png","element":"img","alt":"�Kk=1 ζk1 . Let:","inline":true}],[{"id":"id-170","style":{"width":"94%"},"width":1503,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/39-25.png","element":"img"}],[{"style":{"width":"70%"},"width":1122,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-0.png","element":"img"}],[{"id":"id-177","style":{"height":19.72},"width":119.24,"height":49.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-1.png","element":"img","alt":"ϵkh+1 =","inline":true}],[{"id":"id-183","style":{"width":"97%"},"width":1548,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.37},"width":467.93,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-3.png","element":"img","alt":" ψkH+1 = ϵkH+1 = ϕkH+1 = 0.","inline":true,"padRight":true},{"text":"According to the update rule Equation ","element":"span"},{"href":"#id-129","referenceIndex":365,"text":"(27)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"88%"},"width":1396,"height":441,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-4.png","element":"img"}],[{"text":"Also using Equation ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"height":26.29},"width":1569.6,"height":65.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-5.png","element":"img","alt":"V πkh (sk,m,jh ) = Qπkh (sk,m,jh , ak,m,jh ) ≥ I[tk,m,jh > 1]Ärh(sk,m,jh , ak,m,jh ) + Psk,m,jh ,ak,m,jh ,hV πkh+1ä.","inline":true}],[{"text":"Then with Equation ","element":"span"},{"href":"#id-135","text":"(30)","element":"a"},{"text":", it holds that:","element":"span"}],[{"style":{"width":"7%"},"width":123,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-6.png","element":"img"}],[{"style":{"height":18.12},"width":82.05,"height":45.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-7.png","element":"img","alt":"ζkh =","inline":true}],[{"id":"id-160","style":{"width":"96%"},"width":1530,"height":656,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-8.png","element":"img"}],[{"text":"Here ","element":"span"},{"text":"P ","element":"span"},{"text":"is the simplified notation for ","element":"span"},{"style":{"height":20.1},"width":240.7,"height":50.24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-9.png","element":"img","alt":" Psk,m,jh ,ak,m,jh ,h","inline":true},{"text":". Because the reference function is non-increasing","element":"span"}],[{"text":"based on (j) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":23.76},"width":579.77,"height":59.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-10.png","element":"img","alt":" Vref,klih+1 (s) ≥ V REFh+1(s) for any s ∈ S","inline":true,"padRight":true},{"text":"and any positive integer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":",","element":"span"}],[{"style":{"height":26.64},"width":806.33,"height":66.61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-11.png","element":"img","alt":"Psk,m,jh ,ak,m,jh ,hVref,klih+1 ≥ Psk,m,jh ,ak,m,jh ,hV REFh+1 and","inline":true}],[{"id":"id-156","style":{"width":"76%"},"width":1220,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-12.png","element":"img"}],[{"text":"According to the definition of ","element":"span"},{"style":{"height":19.5},"width":385.48,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-13.png","element":"img","alt":" δkh+1, ψkh+1, ϵkh+1, ϕkh+1 ","inline":true,"padRight":true},{"text":"and Equation ","element":"span"},{"href":"#id-156","text":"(65) ","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-157","style":{"width":"95%"},"width":1521,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/40-14.png","element":"img"}],[{"id":"id-158","style":{"width":"85%"},"width":1353,"height":296,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-0.png","element":"img"}],[{"text":"and","element":"span"}],[{"id":"id-159","style":{"width":"90%"},"width":1441,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-1.png","element":"img"}],[{"text":"Summing Equation ","element":"span"},{"href":"#id-157","text":"(66)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-158","text":"(67) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-159","text":"(68)","element":"a"},{"text":", we can bound the second term in Equa-","element":"span"}],[{"text":"tion ","element":"span"},{"href":"#id-160","text":"(64) ","element":"a"},{"text":"as follows:","element":"span"}],[{"style":{"width":"91%"},"width":1455,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"=1","element":"span"}],[{"style":{"width":"98%"},"width":1568,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":"=1","element":"span"}],[{"text":"Together with Equation ","element":"span"},{"href":"#id-160","text":"(64)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"85%"},"width":1362,"height":280,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-4.png","element":"img"}],[{"text":"Summing the above inequality for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", .., K","element":"span"},{"text":", we have:","element":"span"}],[{"id":"id-163","style":{"width":"100%"},"width":1587,"height":614,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-5.png","element":"img"}],[{"text":"The first conclusion has been proved in Equation ","element":"span"},{"href":"#id-161","text":"(47)","element":"a"},{"text":", and we will prove the second conclusion in","element":"span"}],[{"text":"Lemma ","element":"span"},{"href":"#id-162","referenceIndex":523,"text":"E.3 ","element":"a"},{"text":"in the last subsection. Applying the two conclusions to Equation ","element":"span"},{"href":"#id-163","text":"(69)","element":"a"},{"text":", it holds:","element":"span"}],[{"style":{"width":"88%"},"width":1409,"height":489,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/41-6.png","element":"img"}],[{"text":"Here, the last inequality is because ","element":"span"},{"style":{"height":19.5},"width":215.61,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-0.png","element":"img","alt":" δkh+1 ≤ ζkh+1","inline":true},{"text":". By recursion on ","element":"span"},{"style":{"height":14},"width":382.81,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-1.png","element":"img","alt":" H, H − 1, H − 2 . . . , 1","inline":true},{"text":", with","element":"span"}],[{"style":{"height":19.37},"width":322.26,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-2.png","element":"img","alt":"ζKH+1 = 0, we have:","inline":true}],[{"id":"id-168","style":{"width":"92%"},"width":1472,"height":838,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-3.png","element":"img"}],[{"text":"Here, the second inequality is because ","element":"span"},{"href":"#id-164","style":{"height":19.37},"width":585.38,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-4.png","element":"img","alt":" (1 + 2H )h−1 ≤ (1 + 2H )H ≤ e2 < 9","inline":true},{"text":". Based on the Lemma","element":"span"}],[{"href":"#id-165","referenceIndex":532,"text":"E.4, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-166","text":"E.5, ","element":"a"},{"text":"Lemma ","element":"span"},{"href":"#id-167","referenceIndex":592,"text":"E.6, ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-164","text":"E.7 ","element":"a"},{"text":"provided in the last subsection, we have:","element":"span"}],[{"style":{"width":"85%"},"width":1348,"height":619,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"=1","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"style":{"height":29.36},"width":1540.11,"height":73.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-6.png","element":"img","alt":"��SAH2T1ι +�βSAH2T1ι +�β2SAH2T1ι + SAH114 T141 ι34 + S32 AH3�N0 log(T1)ι�.","inline":true}],[{"text":"Inserting these relationships into Equation ","element":"span"},{"href":"#id-168","text":"(71)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"97%"},"width":1543,"height":462,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-7.png","element":"img"}],[{"text":"In the last step, we use ","element":"span"},{"style":{"height":16},"width":648.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/42-8.png","element":"img","alt":" T1 ≤ (2 + 2/H)MT + MSAH(H + 1)","inline":true,"padRight":true},{"text":"according to (i) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"This","element":"span"}],[{"text":"finishes the proof of Theorem ","element":"span"},{"href":"#id-31","text":"4.1","element":"a"}],[{"text":"E.1 ","element":"span"},{"text":"P","element":"span"},{"text":"ROOF OF SOME INDIVIDUAL COMPONENT","element":"span"}],[{"text":"This subsection collects the proof of some individual components for Theorem ","element":"span"},{"href":"#id-31","text":"4.1.","element":"a"}],[{"id":"id-162","style":{"fontWeight":"bold"},"text":"Lemma E.3 ","element":"span"},{"text":"(Proof of Equation ","element":"span"},{"href":"#id-163","text":"(70)","element":"a"},{"text":")","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-0.png","element":"img","alt":"�15i=1 Ei","inline":true},{"style":{"fontStyle":"italic"},"text":", we have that Equation ","element":"span"},{"href":"#id-163","text":"(70) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"95%"},"width":1514,"height":782,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-1.png","element":"img"}],[{"text":"For ","element":"span"},{"text":"a ","element":"span"},{"text":"given ","element":"span"},{"text":"triple ","element":"span"},{"style":{"height":16},"width":176.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-2.png","element":"img","alt":"(k′, m′, j′)","inline":true},{"text":", ","element":"span"},{"text":"according ","element":"span"},{"text":"to ","element":"span"},{"text":"the ","element":"span"},{"text":"definition ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":20.07},"width":357.72,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-3.png","element":"img","alt":"li(sk,m,jh , ak,m,jh , h, k)","inline":true},{"text":",","element":"span"}],[{"style":{"height":23.9},"width":1583.99,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-4.png","element":"img","alt":"�nkhi=1 I[(k, m, j)li = (k′, m′, j′), tk,m,jh > 1]) = 1 if and only if (sk,m,jh , ak,m,jh ) =","inline":true}],[{"style":{"height":20.07},"width":316.86,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-5.png","element":"img","alt":"(sk′,m′,j′h , ak′,m′,j′h )","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":806.44,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-6.png","element":"img","alt":" 1 < tkh(sk,m,jh , ak,m,jh ) = tk′h (sk,m,jh , ak,m,jh ) + 1","inline":true},{"text":". In this case, we have","element":"span"}],[{"style":{"height":23.9},"width":796.51,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-7.png","element":"img","alt":"nkh(sk,m,jh , ak,m,jh ) = ytk′hh (sk′,m′,j′h , ak′,m′,j′h ) and","inline":true,"padRight":true},{"text":"� ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k,m,j","element":"span"}],[{"style":{"width":"95%"},"width":1520,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-8.png","element":"img"}],[{"style":{"height":23.9},"width":472.96,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-9.png","element":"img","alt":"= ytk′h +1h (sk′,m′,j′h , ak′,m′,j′h ).","inline":true}],[{"text":"Then according to (f) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1:","element":"a"}],[{"style":{"width":"94%"},"width":1495,"height":143,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-10.png","element":"img"}],[{"text":"Therefore, since ","element":"span"},{"style":{"height":19.5},"width":219.75,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-11.png","element":"img","alt":" V k′h+1 ≥ V ⋆h+1 ","inline":true,"padRight":true},{"text":"according to Lemma ","element":"span"},{"href":"#id-126","referenceIndex":368,"text":"E.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"style":{"width":"77%"},"width":1224,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/43-12.png","element":"img"}],[{"text":"Next, we will give lemmas on the upper bounds of each term in Equation ","element":"span"},{"href":"#id-168","text":"(71)","element":"a"},{"text":".","element":"span"}],[{"id":"id-165","style":{"fontWeight":"bold"},"text":"Lemma E.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-0.png","element":"img","alt":" E8","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"it holds that:","element":"span"}],[{"style":{"width":"71%"},"width":1135,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For any ","element":"span"},{"style":{"height":18.97},"width":976.14,"height":47.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-2.png","element":"img","alt":" (s, h, k) ∈ S ×[H]×[K], if N kh(s) = �a∈A N kh(s, a) ≥ N0","inline":true},{"text":", the reference function","element":"span"}],[{"style":{"height":20.07},"width":142.31,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-3.png","element":"img","alt":"V ref,kh (s)","inline":true,"padRight":true},{"text":"is updated to its final value with ","element":"span"},{"style":{"height":20.07},"width":595.57,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-4.png","element":"img","alt":" V ref,kh (s) = V REFh (s). If N kh(s) < N0","inline":true},{"text":", since the reference","element":"span"}],[{"text":"function is non-increasing and ","element":"span"},{"style":{"height":20.07},"width":229.23,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-5.png","element":"img","alt":" V ref,1h (s) = H","inline":true},{"text":", we have ","element":"span"},{"style":{"height":20.07},"width":487.68,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-6.png","element":"img","alt":" 0 ≤ V ref,kh (s) − V REFh (s) ≤ H","inline":true},{"text":". Combining","element":"span"}],[{"text":"two cases, for any ","element":"span"},{"style":{"height":16},"width":435.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-7.png","element":"img","alt":" (s, h, k) ∈ S × [H] × [K]","inline":true},{"text":", it holds that ","element":"span"},{"style":{"height":20.07},"width":601.7,"height":50.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-8.png","element":"img","alt":" 0 ≤ V ref,kh (s) − V REFh (s) ≤ Hλkh(s)","inline":true},{"text":",","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":17.9},"width":393.08,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-9.png","element":"img","alt":" λkh(s) = I[N kh(s) < N0]","inline":true,"padRight":true},{"text":"is defined in the event ","element":"span"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-10.png","element":"img","alt":" E8","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5. ","element":"a"},{"text":"The conclusion also holds","element":"span"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"because ","element":"span"},{"style":{"height":21.54},"width":413.88,"height":53.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-11.png","element":"img","alt":" V ref,kH+1(s) = V REFH+1(s) = 0","inline":true},{"text":". Then for any ","element":"span"},{"style":{"height":16},"width":540.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-12.png","element":"img","alt":" (s, a, h, k) ∈ S × A × [H] × [K]","inline":true}],[{"text":"we have:","element":"span"}],[{"id":"id-169","style":{"width":"72%"},"width":1154,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-13.png","element":"img"}],[{"text":"Applying Equation ","element":"span"},{"href":"#id-169","text":"(72) ","element":"a"},{"text":"to the definition of ","element":"span"},{"style":{"height":19.5},"width":85.08,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-14.png","element":"img","alt":" ψkh+1 ","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-170","text":"(61)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"94%"},"width":1498,"height":757,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-15.png","element":"img"}],[{"text":"According ","element":"span"},{"text":"to ","element":"span"},{"text":"the ","element":"span"},{"text":"definition ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":20.07},"width":333.24,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-16.png","element":"img","alt":"Li(sk,m,jh , ak,m,jh , h)","inline":true},{"text":", ","element":"span"},{"text":"for ","element":"span"},{"text":"a ","element":"span"},{"text":"given ","element":"span"},{"text":"triple ","element":"span"},{"style":{"height":16},"width":176.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-17.png","element":"img","alt":"(k′, m′, j′)","inline":true},{"text":",","element":"span"}],[{"style":{"height":23.9},"width":963.53,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-18.png","element":"img","alt":"�N khi=1 I[(k, m, j)Li = (k′, m′, j′), tk,m,jh > 1] = 1","inline":true,"padRight":true},{"text":"if and only if ","element":"span"},{"style":{"height":20.07},"width":319.78,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-19.png","element":"img","alt":" (sk,m,jh , ak,m,jh ) =","inline":true}],[{"style":{"height":20.07},"width":316.86,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-20.png","element":"img","alt":"(sk′,m′,j′h , ak′,m′,j′h )","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.07},"width":831.01,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-21.png","element":"img","alt":" 1 ≤ tk′h (sk,m,jh , ak,m,jh ) < tkh(sk,m,jh , ak,m,jh )","inline":true},{"text":". ","element":"span"},{"text":"Then we have","element":"span"}],[{"style":{"height":20.07},"width":1089.35,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-22.png","element":"img","alt":"tk,m,jh = tkh(sk′,m′,j′h , ak′,m′,j′h ). For t > tk′h (sk′,m′,j′h , ak′,m′,j′h ) ≥ 1","inline":true},{"text":", we also have:","element":"span"}],[{"id":"id-171","style":{"width":"89%"},"width":1412,"height":325,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-23.png","element":"img"}],[{"text":"Let:","element":"span"}],[{"style":{"width":"65%"},"width":1041,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-24.png","element":"img"}],[{"text":"Then, since ","element":"span"},{"style":{"height":19.37},"width":743.47,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-25.png","element":"img","alt":" (1 + 2H )h−1 ≤ (1 + 2H )H < e2 < 9, we have:","inline":true}],[{"id":"id-172","style":{"width":"90%"},"width":1427,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/44-26.png","element":"img"}],[{"text":"Applying Equation ","element":"span"},{"href":"#id-171","text":"(73)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"96%"},"width":1524,"height":633,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-0.png","element":"img"}],[{"text":"According to (f) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"for any ","element":"span"},{"style":{"height":16},"width":688.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-1.png","element":"img","alt":" (s, a, h) ∈ S × A × [H], t ∈ [2, Th(s, a)]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":145.06,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-2.png","element":"img","alt":" 1 ≤ p ≤","inline":true}],[{"style":{"height":17.5},"width":289.84,"height":43.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-3.png","element":"img","alt":"yth(s, a), we have:","inline":true}],[{"style":{"width":"51%"},"width":814,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-4.png","element":"img"}],[{"text":"Then it holds that:","element":"span"}],[{"style":{"width":"64%"},"width":1020,"height":615,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-5.png","element":"img"}],[{"text":"Applying the inequality of the coefficient ","element":"span"},{"style":{"height":15.59},"width":148.82,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-6.png","element":"img","alt":" Wk′,m′,j′","inline":true,"padRight":true},{"text":"to Equation ","element":"span"},{"href":"#id-172","text":"(74)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-173","style":{"width":"96%"},"width":1534,"height":720,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/45-7.png","element":"img"}],[{"text":"The last inequality is because of the ev","element":"span"},{"href":"#id-173","text":"ent ","element":"a"},{"style":{"height":13.19},"width":37.03,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-0.png","element":"img","alt":" E8","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5. ","element":"a"},{"text":"Next, we will bound the term","element":"span"}],[{"style":{"height":22.8},"width":456.06,"height":56.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-1.png","element":"img","alt":"�Hh=1�k,m,j λkh+1(sk,m,jh+1 )","inline":true,"padRight":true},{"text":"in Equation ","element":"span"},{"href":"#id-173","text":"(75)","element":"a"},{"text":". We have:","element":"span"}],[{"id":"id-176","style":{"width":"86%"},"width":1367,"height":417,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-2.png","element":"img"}],[{"text":"For any state ","element":"span"},{"style":{"height":16},"width":282.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-3.png","element":"img","alt":" (s, h) ∈ S × [H]","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":135.91,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-4.png","element":"img","alt":" k ∈ [K]","inline":true},{"text":", there exists the largest positive integer ","element":"span"},{"style":{"height":13.19},"width":36.75,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-5.png","element":"img","alt":" k0","inline":true,"padRight":true},{"text":"such that","element":"span"}],[{"style":{"height":19.49},"width":221.68,"height":48.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-6.png","element":"img","alt":"N k0h (s) < N0","inline":true},{"text":". Then for any ","element":"span"},{"style":{"height":16},"width":198.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-7.png","element":"img","alt":" h ∈ [H + 1]","inline":true},{"text":", it holds that","element":"span"}],[{"id":"id-174","style":{"width":"88%"},"width":1401,"height":510,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-8.png","element":"img"}],[{"text":"However, according to the definition of ","element":"span"},{"style":{"height":19.49},"width":281.72,"height":48.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-9.png","element":"img","alt":" N k0h (s), we have:","inline":true}],[{"style":{"width":"68%"},"width":1089,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-10.png","element":"img"}],[{"text":"The last inequality is because of (f) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"Combined with Equation ","element":"span"},{"href":"#id-174","text":"(77)","element":"a"},{"text":", we have: ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-11.png","element":"img","alt":"�","inline":true},{"style":{"height":10},"width":77.8,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-12.png","element":"img","alt":"k,m,j","inline":true},{"style":{"height":24.4},"width":456.35,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-13.png","element":"img","alt":"IîN kh(s) < N0, sk,m,jh = só","inline":true}],[{"id":"id-175","style":{"width":"83%"},"width":1317,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-14.png","element":"img"}],[{"text":"Here, the last inequality holds because ","element":"span"},{"style":{"height":14.4},"width":123.71,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-15.png","element":"img","alt":" β ≤ H","inline":true},{"text":". Applying the inequality Equation ","element":"span"},{"href":"#id-175","text":"(78) ","element":"a"},{"text":"to Equa-","element":"span"}],[{"text":"tion ","element":"span"},{"href":"#id-176","text":"(76)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"51%"},"width":823,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-16.png","element":"img"}],[{"text":"Therefore, we bound the term ","element":"span"},{"style":{"height":22.8},"width":456.06,"height":56.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-17.png","element":"img","alt":"�Hh=1�k,m,j λkh+1(sk,m,jh+1 )","inline":true},{"text":". Back to Equation ","element":"span"},{"href":"#id-173","text":"(75)","element":"a"},{"text":", we have that","element":"span"}],[{"id":"id-166","style":{"width":"85%"},"width":1358,"height":254,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/46-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma E.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":13.19},"width":136.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-0.png","element":"img","alt":" E9 ∩ E10","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds:","element":"span"}],[{"style":{"width":"65%"},"width":1041,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"According to the definition of ","element":"span"},{"style":{"height":19.5},"width":75.3,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-2.png","element":"img","alt":" ϵkh+1 ","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-177","referenceIndex":493,"text":"(62)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-178","style":{"width":"93%"},"width":1489,"height":666,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-3.png","element":"img"}],[{"text":"For ","element":"span"},{"text":"a ","element":"span"},{"text":"given ","element":"span"},{"text":"triple ","element":"span"},{"style":{"height":16},"width":176.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-4.png","element":"img","alt":"(k′, m′, j′)","inline":true},{"text":", ","element":"span"},{"text":"according ","element":"span"},{"text":"to ","element":"span"},{"text":"the ","element":"span"},{"text":"definition ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":20.07},"width":357.72,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-5.png","element":"img","alt":"li(sk,m,jh , ak,m,jh , h, k)","inline":true},{"text":",","element":"span"}],[{"style":{"height":23.9},"width":1589.16,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-6.png","element":"img","alt":"�nkhi=1 I[(k, m, j)li = (k′, m′, j′)]) = 1 if and only if (sk,m,jh , ak,m,jh ) = (sk′,m′,j′h , ak′,m′,j′h )","inline":true}],[{"text":"and ","element":"span"},{"style":{"height":20.07},"width":738.21,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-7.png","element":"img","alt":" tkh(sk,m,jh , ak,m,jh ) = tk′h (sk,m,jh , ak,m,jh ) + 1","inline":true},{"text":". In this case, we have ","element":"span"},{"style":{"height":20.07},"width":350.16,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-8.png","element":"img","alt":" nkh(sk,m,jh , ak,m,jh ) =","inline":true}],[{"style":{"height":23.9},"width":458.06,"height":59.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-9.png","element":"img","alt":"ytk′hh (sk′,m′,j′h , ak′,m′,j′h ) and:","inline":true}],[{"style":{"width":"88%"},"width":1407,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-10.png","element":"img"}],[{"text":"Let:","element":"span"}],[{"style":{"width":"54%"},"width":864,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-11.png","element":"img"}],[{"text":"Applying the equation to Equation ","element":"span"},{"href":"#id-178","text":"(79)","element":"a"},{"text":", it holds that:","element":"span"}],[{"id":"id-179","style":{"width":"90%"},"width":1430,"height":656,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/47-12.png","element":"img"}],[{"text":"The term in Equation ","element":"span"},{"href":"#id-179","text":"(80) ","element":"a"},{"text":"is a summation of non-martingale difference, and we cannot directly use","element":"span"}],[{"text":"Azuma-Hoeffding inequality to bound it. Therefore, we split the term with a constant coefficient","element":"span"}],[{"style":{"height":19.37},"width":100.46,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-0.png","element":"img","alt":"1 + 1H ","inline":true,"padRight":true},{"text":", which can be bounded directly by Azuma-Hoeffding inequality. According to the event ","element":"span"},{"style":{"height":13.19},"width":79.8,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-1.png","element":"img","alt":" E9 in","inline":true}],[{"text":"Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"we can bound the second term in Equation ","element":"span"},{"href":"#id-179","text":"(81) ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16},"width":194.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-2.png","element":"img","alt":" 18H√2T1ι.","inline":true}],[{"text":"We claim that for any ","element":"span"},{"href":"#id-26","referenceIndex":246,"style":{"height":16},"width":233.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-3.png","element":"img","alt":" t ∈ [Th(s, a)]","inline":true},{"text":", it holds that ","element":"span"},{"style":{"height":16},"width":293.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-4.png","element":"img","alt":" |y(s, a, h, t)| ≤ 9","inline":true,"padRight":true},{"text":"with the proof as follows.","element":"span"}],[{"text":"According to (e) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"since ","element":"span"},{"style":{"height":19.37},"width":983.38,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-5.png","element":"img","alt":" (1 + 2H )H ≤ 9, for any h ∈ [H] and 1 ≤ t ≤ Th(s, a) − 2, we","inline":true}],[{"text":"have ","element":"span"},{"href":"#id-26","referenceIndex":246,"style":{"height":19.37},"width":1510.97,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-6.png","element":"img","alt":" 0 ≤ y(s, a, h, t) ≤ (1+ 2H )h−1 1H ≤ 9H . For t = Th(s, a)−1, we have −9 ≤ −(1+ 2H )h−1(1+","inline":true}],[{"text":"1","element":"span"}],[{"style":{"height":19.37},"width":756.69,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-7.png","element":"img","alt":"H ) ≤ y(s, a, h, Th − 1) ≤ (1 + 2H )h−1 1H ≤ 9H","inline":true,"padRight":true},{"text":". For ","element":"span"},{"style":{"height":16},"width":200.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-8.png","element":"img","alt":" t = Th(s, a)","inline":true},{"text":", since ","element":"span"},{"style":{"height":19.59},"width":262.19,"height":48.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-9.png","element":"img","alt":" yTh+1h (s, a) = 0","inline":true},{"text":", we have","element":"span"}],[{"style":{"height":19.37},"width":801.55,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-10.png","element":"img","alt":"y(s, a, h, Th) = −(1 + 2H )h−1(1 + 1H ) ∈ [−9, 0].","inline":true}],[{"text":"Now we will deal with the first term in Equation ","element":"span"},{"href":"#id-179","text":"(81)","element":"a"},{"text":":","element":"span"}],[{"style":{"width":"90%"},"width":1434,"height":530,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-11.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"47%"},"width":755,"height":374,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-12.png","element":"img"}],[{"text":"Here, ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"height":26.83},"width":1489.6,"height":67.09,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-13.png","element":"img","alt":" V (s, a, h, t) = �k,m,j(Ps,a,h−1sk,m,jh+1 )(V kh+1−V ⋆h+1)I[(sk,m,jh , ak,m,jh ) = (s, a), tkh(s, a) =","inline":true}],[{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"]","element":"span"},{"text":", which is defined in the event ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-14.png","element":"img","alt":" E10","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5. ","element":"a"},{"text":"Then based on the event ","element":"span"},{"style":{"height":13.6},"width":216,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-15.png","element":"img","alt":" E10, we have:","inline":true}],[{"style":{"width":"33%"},"width":531,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-16.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":277.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-17.png","element":"img","alt":" |y(s, a, h, t)| ≤ 9","inline":true},{"text":", it holds that:","element":"span"}],[{"style":{"width":"82%"},"width":1313,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-18.png","element":"img"}],[{"text":"Because of Equation ","element":"span"},{"href":"#id-112","text":"(16) ","element":"a"},{"text":"in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have:","element":"span"}],[{"id":"id-181","style":{"width":"83%"},"width":1317,"height":455,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/48-19.png","element":"img"}],[{"text":"The second inequality uses Cauchy-Schwarz Inequality. Similarly, it also holds:","element":"span"}],[{"id":"id-180","style":{"width":"78%"},"width":1250,"height":516,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-0.png","element":"img"}],[{"text":"Here, the t","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"hird ","element":"a"},{"text":"inequality uses Cauchy-Schwarz Inequality. Inequality Equation ","element":"span"},{"href":"#id-180","text":"(83) ","element":"a"},{"text":"is because of (h)","element":"span"}],[{"text":"in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"Similarly, since ","element":"span"},{"href":"#id-180","style":{"height":19.37},"width":784.78,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-1.png","element":"img","alt":" y(s, a, t, Th) = −(1 + 2H )h−1(1 + 1H ), we have:","inline":true}],[{"id":"id-182","style":{"width":"74%"},"width":1175,"height":443,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-2.png","element":"img"}],[{"text":"Here, the last inequality uses Cauchy-Schwarz Inequality. The last inequality is because of (h) in","element":"span"}],[{"text":"Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1.","element":"a"}],[{"text":"Using the upper bound of ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-3.png","element":"img","alt":" C1","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-181","text":"(82)","element":"a"},{"text":", ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-4.png","element":"img","alt":" C2","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-180","text":"(84) ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-5.png","element":"img","alt":" C3","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-182","text":"(85)","element":"a"},{"text":", we can bound","element":"span"}],[{"text":"the first term in Equation ","element":"span"},{"href":"#id-179","text":"(81) ","element":"a"},{"text":"with ","element":"span"},{"href":"#id-181","style":{"height":19.67},"width":538.42,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-6.png","element":"img","alt":" O(√SAH2T1ι + SAH52 (Mι)12 )","inline":true},{"text":". Then combined with the event","element":"span"}],[{"style":{"height":13.19},"width":37.03,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-7.png","element":"img","alt":"E9","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"it hold","element":"span"},{"href":"#id-179","text":"s th","element":"a"},{"text":"at:","element":"span"}],[{"style":{"width":"82%"},"width":1311,"height":189,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-8.png","element":"img"}],[{"id":"id-167","style":{"fontWeight":"bold"},"text":"Lemma E.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-9.png","element":"img","alt":" E11","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"it holds that:","element":"span"}],[{"style":{"width":"42%"},"width":669,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Based on the definition of ","element":"span"},{"style":{"height":19.5},"width":82.87,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-11.png","element":"img","alt":" ϕkh+1 ","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-183","text":"(63) ","element":"a"},{"text":"and the event ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-12.png","element":"img","alt":" E11","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"we have:","element":"span"}],[{"id":"id-164","style":{"width":"92%"},"width":1460,"height":427,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/49-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma E.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under the event ","element":"span"},{"style":{"height":20.8},"width":125.5,"height":51.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-0.png","element":"img","alt":"�15i=1 Ei","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"style":{"fontStyle":"italic"},"text":"D.5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"we have:","element":"span"}],[{"style":{"width":"46%"},"width":736,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"=1","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"style":{"height":29.36},"width":1540.11,"height":73.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-2.png","element":"img","alt":"��SAH2T1ι +�βSAH2T1ι +�β2SAH2T1ι + SAH114 T141 ι34 + S32 AH3�N0 log(T1)ι�.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"43%"},"width":692,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"=1","element":"span"}],[{"id":"id-184","style":{"width":"98%"},"width":1568,"height":347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-4.png","element":"img"}],[{"text":"Next, we will bound the first term in Equation ","element":"span"},{"href":"#id-184","text":"(86)","element":"a"},{"text":". Based on Equation ","element":"span"},{"href":"#id-141","text":"(40)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-189","style":{"width":"78%"},"width":1246,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-5.png","element":"img"}],[{"text":"According to the upper bound given in Equation ","element":"span"},{"href":"#id-139","text":"(41) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-140","text":"(42)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-187","style":{"width":"85%"},"width":1348,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-6.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":20.07},"width":332.47,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-7.png","element":"img","alt":" V ref,kh (s) ≥ V REFh (s)","inline":true},{"text":", we have ","element":"span"},{"href":"#id-137","style":{"height":21.67},"width":423.64,"height":54.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-8.png","element":"img","alt":" Ps,a,hV ref,kh+1 ≥ Ps,a,hV REFh+1","inline":true},{"text":". Then according to Equation ","element":"span"},{"href":"#id-169","text":"(72)","element":"a"},{"text":",","element":"span"}],[{"text":"using the definition of ","element":"span"},{"style":{"height":10},"width":40.93,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-9.png","element":"img","alt":" χ5","inline":true,"padRight":true},{"text":"Equation ","element":"span"},{"href":"#id-137","text":"(38)","element":"a"},{"text":", it holds that:","element":"span"}],[{"id":"id-186","style":{"width":"78%"},"width":1244,"height":750,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-10.png","element":"img"}],[{"text":"The last inequality is because of Equation ","element":"span"},{"href":"#id-169","text":"(72)","element":"a"},{"text":". According to Equation ","element":"span"},{"href":"#id-176","text":"(76) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-175","text":"(78)","element":"a"},{"text":", we","element":"span"}],[{"text":"have:","element":"span"}],[{"id":"id-185","style":{"width":"97%"},"width":1548,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/50-11.png","element":"img"}],[{"text":"Because of the event ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-0.png","element":"img","alt":" E12","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5 ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-185","text":"(90)","element":"a"},{"text":", back to Equation ","element":"span"},{"href":"#id-186","text":"(89)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-188","style":{"width":"74%"},"width":1179,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-1.png","element":"img"}],[{"text":"Applying inequalities Equation ","element":"span"},{"href":"#id-187","text":"(88) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-188","text":"(91) ","element":"a"},{"text":"to Equation ","element":"span"},{"href":"#id-189","text":"(87)","element":"a"},{"text":", we have:","element":"span"}],[{"id":"id-190","style":{"width":"99%"},"width":1585,"height":563,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-2.png","element":"img"}],[{"text":"Because for any ","element":"span"},{"style":{"height":20.07},"width":758.73,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-3.png","element":"img","alt":" s ∈ S, V ref,kh (s) ≥ V REFh (s) ≥ V ⋆h (s), we have:","inline":true}],[{"text":"0 ","element":"span"},{"style":{"height":21.67},"width":1552.94,"height":54.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-4.png","element":"img","alt":" ≤ V ref,kh+1 (s) − V ⋆h+1(s) = (V ref,kh+1 (s) − V ⋆h+1(s))λkh+1(s) + (V ref,kh+1 (s) − V ⋆h+1(s))(1 − λkh+1(s)).","inline":true}],[{"text":"If ","element":"span"},{"style":{"height":19.5},"width":228.28,"height":48.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-5.png","element":"img","alt":" λkh+1(s) = 0","inline":true},{"text":", the reference function is updated and we have ","element":"span"},{"style":{"height":21.67},"width":430.29,"height":54.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-6.png","element":"img","alt":" V ref,kh+1 (s) − V ⋆h+1(s) ≤ β","inline":true},{"text":"; if","element":"span"}],[{"style":{"height":19.5},"width":207.04,"height":48.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-7.png","element":"img","alt":"λkh+1(s) = 1","inline":true},{"text":", then we have ","element":"span"},{"style":{"height":21.67},"width":637.85,"height":54.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-8.png","element":"img","alt":" V ref,kh+1 (s) − V ⋆h+1(s) ≤ H = Hλkh+1(s)","inline":true},{"text":". Therefore, we have:","element":"span"}],[{"id":"id-193","style":{"width":"72%"},"width":1144,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-9.png","element":"img"}],[{"text":"Combined with the inequality Equation ","element":"span"},{"href":"#id-185","text":"(90)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"91%"},"width":1452,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-10.png","element":"img"}],[{"text":"and then back to Equation ","element":"span"},{"href":"#id-190","text":"(92)","element":"a"},{"text":", it holds:","element":"span"}],[{"style":{"width":"100%"},"width":1586,"height":936,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/51-11.png","element":"img"}],[{"text":"In the last inequality, we use Cauchy-Schwarz Inequality.","element":"span"}],[{"text":"Next we will bound ","element":"span"},{"style":{"height":21.96},"width":496.77,"height":54.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-0.png","element":"img","alt":"�s,a,h Vs,a,h(V ⋆h+1)Y Thh (s, a)","inline":true},{"text":". Because ","element":"span"},{"style":{"height":17.99},"width":229.85,"height":44.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-1.png","element":"img","alt":" V ⋆H+1(s) = 0","inline":true},{"text":", removing the term","element":"span"}],[{"style":{"height":22.44},"width":328.02,"height":56.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-2.png","element":"img","alt":"�k,m,j V ⋆1 (sk,m,j1 )2","inline":true},{"text":", we have the following inequality:","element":"span"}],[{"style":{"width":"57%"},"width":911,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-3.png","element":"img"}],[{"text":"Because of the event ","element":"span"},{"style":{"height":13.19},"width":52.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-4.png","element":"img","alt":" E14","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5, ","element":"a"},{"text":"then we have:","element":"span"}],[{"id":"id-191","style":{"width":"83%"},"width":1317,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-5.png","element":"img"}],[{"text":"According to Equation ","element":"span"},{"href":"#id-130","text":"(1)","element":"a"},{"text":", for any ","element":"span"},{"style":{"height":14},"width":262.78,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-6.png","element":"img","alt":" a ∈ A, we have:","inline":true}],[{"style":{"width":"59%"},"width":943,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-7.png","element":"img"}],[{"text":"Therefore, we have:","element":"span"}],[{"id":"id-192","style":{"width":"95%"},"width":1521,"height":631,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-8.png","element":"img"}],[{"text":"Here, the first inequality is because ","element":"span"},{"style":{"height":24.56},"width":654.16,"height":61.39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-9.png","element":"img","alt":" V ⋆h (sk,m,jh ), Psk,m,jh ,ak,m,jh ,h(V ⋆h+1) ≤ H","inline":true},{"text":". The last step is because","element":"span"}],[{"text":"of the event ","element":"span"},{"style":{"height":13.19},"width":52.94,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-10.png","element":"img","alt":" E13","inline":true,"padRight":true},{"text":"in Lemma ","element":"span"},{"href":"#id-124","referenceIndex":283,"text":"D.5. ","element":"a"},{"text":"Summing Equation ","element":"span"},{"href":"#id-191","text":"(96) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-192","text":"(97) ","element":"a"},{"text":"up, we have:","element":"span"}],[{"id":"id-195","style":{"width":"100%"},"width":1586,"height":822,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/52-11.png","element":"img"}],[{"text":"Now we successfully bound the first term in Equation ","element":"span"},{"href":"#id-184","text":"(86)","element":"a"},{"text":". For the second term, according to the","element":"span"}],[{"text":"definition of ","element":"span"},{"style":{"height":20.47},"width":339.75,"height":51.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-0.png","element":"img","alt":" ˜vadv,kh (s, a), we have:","inline":true}],[{"style":{"width":"91%"},"width":1444,"height":279,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-1.png","element":"img"}],[{"text":"The last inequality is because for any ","element":"span"},{"style":{"height":20.07},"width":564.52,"height":50.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-2.png","element":"img","alt":" s ∈ S, V ref,kh (s) ≥ V kh (s) ≥ V ⋆h (s)","inline":true},{"text":". Using Equation ","element":"span"},{"href":"#id-193","text":"(94) ","element":"a"},{"text":"and","element":"span"}],[{"text":"Cauchy-Schwarz inequality, we have:","element":"span"}],[{"id":"id-194","style":{"width":"91%"},"width":1448,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-3.png","element":"img"}],[{"text":"Similar to Equation ","element":"span"},{"href":"#id-185","text":"(90)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"52%"},"width":840,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-4.png","element":"img"}],[{"text":"Back to Equation ","element":"span"},{"href":"#id-194","text":"(99)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"51%"},"width":820,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-5.png","element":"img"}],[{"text":"Then using Lemma ","element":"span"},{"href":"#id-115","referenceIndex":251,"text":"D.2, ","element":"a"},{"text":"we have:","element":"span"}],[{"id":"id-196","style":{"width":"88%"},"width":1405,"height":557,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-6.png","element":"img"}],[{"text":"For the third term in Equation ","element":"span"},{"href":"#id-189","text":"(87)","element":"a"},{"text":", according to Lemma ","element":"span"},{"href":"#id-115","referenceIndex":251,"text":"D.2, ","element":"a"},{"text":"we have:","element":"span"}],[{"style":{"width":"85%"},"width":1352,"height":704,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/53-7.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"81%"},"width":1298,"height":188,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-0.png","element":"img"}],[{"text":"Summing the four inequalities, we can bound the third term in Equation ","element":"span"},{"href":"#id-189","text":"(87) ","element":"a"},{"text":"with:","element":"span"}],[{"id":"id-197","style":{"width":"69%"},"width":1108,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-1.png","element":"img"}],[{"text":"Applying the upper bound Equation ","element":"span"},{"href":"#id-195","text":"(98)","element":"a"},{"text":", Equation ","element":"span"},{"href":"#id-196","text":"(100) ","element":"a"},{"text":"and Equation ","element":"span"},{"href":"#id-197","text":"(101) ","element":"a"},{"text":"to Equation ","element":"span"},{"href":"#id-184","text":"(86)","element":"a"},{"text":", we","element":"span"}],[{"text":"have:","element":"span"}],[{"style":{"width":"46%"},"width":736,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"=1","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"style":{"height":29.36},"width":1540.11,"height":73.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-3.png","element":"img","alt":"��SAH2T1ι +�βSAH2T1ι +�β2SAH2T1ι + SAH114 T141 ι34 + S32 AH3�N0 log(T1)ι�.","inline":true}],[{"style":{"width":"1%"},"width":28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-4.png","element":"img"}],[{"id":"id-34","text":"F ","element":"span"},{"text":"P","element":"span"},{"text":"ROOF OF ","element":"span"},{"text":"T","element":"span"},{"text":"HEOREM ","element":"span"},{"href":"#id-32","text":"4.2","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Because of (e) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1, ","element":"a"},{"text":"we have: ","element":"span"},{"style":{"height":22.4},"width":146.13,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-5.png","element":"img","alt":"ˆT ≥�","inline":true}],[{"style":{"width":"60%"},"width":954,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-6.png","element":"img"}],[{"text":"The last inequality is because ","element":"span"},{"style":{"height":17.9},"width":272.4,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-7.png","element":"img","alt":" y1h(s, a) ≥ MH","inline":true,"padRight":true},{"text":"according to (c) in Lemma ","element":"span"},{"href":"#id-26","referenceIndex":246,"text":"D.1. ","element":"a"},{"text":"Using Jensen’s","element":"span"}],[{"text":"inequality, we have:","element":"span"}],[{"id":"id-33","style":{"width":"100%"},"width":1586,"height":439,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-8.png","element":"img"}],[{"text":"Because ","element":"span"},{"style":{"height":15.19},"width":231.95,"height":37.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-9.png","element":"img","alt":" usyn = TRUE","inline":true},{"text":", in each round, there exists at least one triple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"such that the triggering","element":"span"}],[{"text":"condition is met on it, the total number of rounds is at most the total times of triggering conditions","element":"span"}],[{"text":"met for ","element":"span"},{"style":{"height":16},"width":362.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-10.png","element":"img","alt":" ∀(s, a, h) ∈ (S, A, H)","inline":true},{"text":". Next, we will discuss the times of triggering conditions met for each","element":"span"}],[{"text":"triple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":". If the triggering condition for ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"is met at round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", the increase of visits to","element":"span"}],[{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"is between ","element":"span"},{"style":{"height":17.9},"width":370.82,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-11.png","element":"img","alt":" ckh(s, a) and Mckh(s, a)","inline":true},{"text":". We will discuss how many times the triggering condition","element":"span"}],[{"text":"can be met at most in one stage for each ","element":"span"},{"style":{"height":16},"width":349.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-12.png","element":"img","alt":" (s, a, h) ∈ (S, A, H).","inline":true}],[{"text":"1. In the first stage of ","element":"span"},{"style":{"height":17.9},"width":349.56,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-13.png","element":"img","alt":" (s, a, h), ckh(s, a) = 1","inline":true},{"text":". Then FedQ-Advantage will meet at most ","element":"span"},{"style":{"fontStyle":"italic"},"text":"MH ","element":"span"},{"text":"times the triggering condition for ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"text":"2. In the stage ","element":"span"},{"style":{"height":19.92},"width":1253.84,"height":49.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-14.png","element":"img","alt":" t (2 ≤ t ≤ Th(s, a)) of (s, a, h), when ˜nkh(s, a) ≤ (1− 1H )yt−1h (s, a) for round","inline":true}],[{"style":{"width":"42%"},"width":675,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-15.png","element":"img"}],[{"text":"Assume in this case, it meets ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"times the corresponding triggering condition at the round ","element":"span"},{"style":{"height":15.59},"width":307.65,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-16.png","element":"img","alt":"k1 < k2 < ... < kp","inline":true},{"text":". For any ","element":"span"},{"style":{"height":16},"width":104.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-17.png","element":"img","alt":" i ∈ [p]","inline":true},{"text":", since ","element":"span"},{"style":{"height":13.2},"width":230.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-18.png","element":"img","alt":" ki ≥ ki−1 + 1","inline":true},{"text":", and ","element":"span"},{"style":{"height":13.19},"width":31.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-19.png","element":"img","alt":" ki","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":72.93,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-20.png","element":"img","alt":" ki−1","inline":true,"padRight":true},{"text":"are in the same stage, we have ","element":"span"},{"style":{"height":20.82},"width":236.01,"height":52.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-21.png","element":"img","alt":" ˆnki−1+1h ≤ ˜nkih ","inline":true,"padRight":true},{"text":". Especially, we know ","element":"span"},{"style":{"height":21.73},"width":612.69,"height":54.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-22.png","element":"img","alt":" ˆnkp−1+1h ≤ ˜nkph ≤ (1 − 1H )yt−1h (s, a).","inline":true,"padRight":true},{"text":"For any ","element":"span"},{"style":{"height":16},"width":110.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-23.png","element":"img","alt":" i ∈ [p]","inline":true},{"text":", since the triggering condition is met at the round ","element":"span"},{"style":{"height":13.19},"width":31.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/54-24.png","element":"img","alt":" ki","inline":true},{"text":", the increase of the","element":"span"}],[{"style":{"width":"91%"},"width":1443,"height":874,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-0.png","element":"img"}],[{"text":"3. In the stage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"height":19.92},"width":1230.56,"height":49.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-1.png","element":"img","alt":" (2 ≤ t ≤ Th(s, a)) of (s, a, h), when ˜nkh(s, a) > (1− 1H )yt−1h (s, a) for round","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"style":{"height":19.92},"width":735.83,"height":49.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-2.png","element":"img","alt":"kh(s, a) = ⌊ 1MH nkh(s, a)⌋ = ⌊ 1MH yt−1h (s, a)⌋.","inline":true}],[{"text":"Assume it meets ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"times the triggering condition for ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"in this case. For ","element":"span"},{"style":{"height":13.2},"width":87.52,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-3.png","element":"img","alt":" t ≥ 2","inline":true},{"text":", there exists a positive integer ","element":"span"},{"style":{"height":19.23},"width":1080.54,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-4.png","element":"img","alt":" r such that rMH ≤ yt−1h (s, a) < (r+1)MH and then ckh(s, a) = r.","inline":true,"padRight":true},{"text":"When the triggering condition of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"is met for one time, the increase in the visits is at least ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":". After it is met for ","element":"span"},{"style":{"height":14},"width":88.02,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-5.png","element":"img","alt":" q − 1","inline":true,"padRight":true},{"text":"times, we have ","element":"span"},{"style":{"height":19.92},"width":627.57,"height":49.79,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-6.png","element":"img","alt":" r(q − 1) ≤ 2H yt−1h (s, a) < 2(r + 1)M","inline":true},{"text":". Here, the first inequality is because ","element":"span"},{"style":{"height":19.23},"width":339.63,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-7.png","element":"img","alt":" yth/yt−1h ≤ 1 + 2/H","inline":true},{"text":", and the second one is because ","element":"span"},{"style":{"height":19.23},"width":415.08,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-8.png","element":"img","alt":"yt−1h (s, a) < (r + 1)MH","inline":true},{"text":". Therefore, we know ","element":"span"},{"style":{"height":14},"width":213.91,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-9.png","element":"img","alt":" q ≤ 4M + 1.","inline":true}],[{"text":"Combining the three cases, the total times of triggering conditions met for given triple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a, h","element":"span"},{"text":") ","element":"span"},{"text":"is at","element":"span"}],[{"text":"most:","element":"span"}],[{"style":{"width":"50%"},"width":801,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-10.png","element":"img"}],[{"text":"Therefore, combined with the inequality Equation ","element":"span"},{"href":"#id-33","text":"(102)","element":"a"},{"text":", we have:","element":"span"}],[{"style":{"width":"69%"},"width":1105,"height":485,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-11.png","element":"img"}],[{"text":"The last equality is because ","element":"span"},{"style":{"height":18.83},"width":822.35,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-12.png","element":"img","alt":"ˆT = MT and log(1 + x) ≥ x/2 for any 0 < x ≤ 1","inline":true},{"text":". The last inequality","element":"span"}],[{"text":"also holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 1","element":"span"},{"text":".","element":"span"}],[{"text":"For ","element":"span"},{"style":{"height":15.59},"width":248.33,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-13.png","element":"img","alt":" usyn = FALSE","inline":true},{"text":", in each round, all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"agents meet the trigger condition, then the round number","element":"span"}],[{"text":"is at least:","element":"span"}],[{"style":{"width":"79%"},"width":1254,"height":151,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.18795/images/55-14.png","element":"img"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]