36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2405.19697","publisher":"arxiv","paperJSON":{"title":"Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity","paperID":"2405.19697","avgLineHeight":11.92,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. ","element":"span"},{"text":"Moreover, we design both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms enjoy the convergence rate ","element":"span"},{"style":{"height":19.2},"width":134.7,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-0.png","element":"img","alt":" O�ϵ−1�","inline":true},{"text":". To extend the applicability, a stochastic version of the model-free algorithm is proposed, along with results on its iteration and sample complexity. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.","element":"span"}]]},{"heading":"1 INTRODUCTION","paragraphs":[[{"text":"Bilevel optimization, aiming to solve problems with a hierarchical structure, achieves success in a wide range of machine learning applications, e.g., hyper-parameter optimization (","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"Franceschi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":22,"text":"2017","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":23,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":99,"text":"Ye et al.","element":"a"},{"text":",","element":"span"}],[{"text":"Proceedings of the 28","element":"span"},{"style":{"height":6.8},"width":27.5,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-1.png","element":"img","alt":"th ","inline":true,"padRight":true},{"text":"International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, Mai Khao, Thailand. PMLR: Volume 258. Copyright 2025 by the author(s).","element":"span"}],[{"href":"#id-2","referenceIndex":99,"text":"2023","element":"a"},{"text":"), meta-learning (","element":"span"},{"href":"#id-3","referenceIndex":7,"text":"Bertinetto et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":7,"text":"2019","element":"a"},{"text":"), computer vision (","element":"span"},{"href":"#id-4","referenceIndex":57,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":57,"text":"2021a","element":"a"},{"text":"), neural architecture search (","element":"span"},{"href":"#id-5","referenceIndex":56,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":56,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":92,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":92,"text":"2022","element":"a"},{"text":"), adversarial training (","element":"span"},{"href":"#id-7","referenceIndex":91,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":91,"text":"2021","element":"a"},{"text":"), reinforcement learning (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"Thoma ","element":"a"},{"href":"#id-10","referenceIndex":89,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"2024","element":"a"},{"text":"), data poisoning (","element":"span"},{"href":"#id-11","referenceIndex":63,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":63,"text":"2024b","element":"a"},{"text":"). To address the bilevel optimization problems, a line of research has emerged recently (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi and Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":62,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":62,"text":"2020","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":58,"text":"2021b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":41,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":41,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":36,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":36,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":46,"text":"Kwon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":46,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"Shen and Chen","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-19","referenceIndex":33,"text":"Hao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":33,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":98,"text":"Yao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":98,"text":"2024","element":"a"},{"text":"). Generally, the upper-level (resp. the lower-level) problem optimizes the decision taken by the leader (resp. the follower), which exhibits potential for handling complicated decision-making processes such as Markov decision processes (MDPs).","element":"span"}],[{"text":"Reinforcement learning (RL) (","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"Sutton and Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":88,"text":"Szepesvári","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":88,"text":"2022","element":"a"},{"text":") serves as an effective way of learning to make sequential decisions in MDPs, and has seen plenty of applications (","element":"span"},{"href":"#id-23","referenceIndex":82,"text":"Silver et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":82,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":6,"text":"Berner et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":6,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-25","referenceIndex":69,"text":"Mirhoseini et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":69,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":68,"text":"Miki et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":68,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":86,"text":"Sun ","element":"a"},{"href":"#id-27","referenceIndex":86,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":86,"text":"2024","element":"a"},{"text":"). The central task of RL is to find the optimal policy that maximizes the expected cumulative rewards in an MDP. Bilevel RL enriches the framework of RL by considering a two-level problem: the follower solves a standard RL problem within an environment parameterized by the decision variable taken by the leader; meanwhile, the leader optimizes the decision variable based on the response policy from the lower level. Recently, bilevel RL has gained increasing attention in practice, including RL from human feedback, (","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"2017","element":"a"},{"text":"), inverse RL (","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"Brown et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"2019","element":"a"},{"text":"), and reward shaping (","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"text":"In this paper, we focus on the bilevel reinforcement learning problem:","element":"span"}],[{"id":"id-41","style":{"width":"77%"},"width":732,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":30,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-3.png","element":"img","alt":" Π","inline":true,"padRight":true},{"text":"is the policy set of interest, the upper-level function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is defined on ","element":"span"},{"style":{"height":10.8},"width":130.07,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-4.png","element":"img","alt":" Rn × Π","inline":true},{"text":", the univariate function ","element":"span"},{"style":{"height":10},"width":26,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/0-5.png","element":"img","alt":" ϕ","inline":true,"padRight":true},{"text":"is called the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hyper-objective","element":"span"},{"text":", and the gradient","element":"span"}],[{"id":"id-50","text":"Table 1: Comparison among bilevel reinforcement learning algorithms.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"85%"},"width":1672,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-0.png","element":"img"}],[{"text":"of ","element":"span"},{"style":{"height":15.6},"width":76.74,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-1.png","element":"img","alt":" ϕ(x)","inline":true,"padRight":true},{"text":"is referred to as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hyper-gradient ","element":"span"},{"text":"(","element":"span"},{"href":"#id-31","referenceIndex":73,"text":"Pedregosa","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":73,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"Grazzi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-33","referenceIndex":95,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":95,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-34","referenceIndex":11,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":11,"text":"2024","element":"a"},{"text":") if it exists. The approximate implicit differentiation (AID) based method which resorts to the hyper-gradient has become flourishing recently (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi and ","element":"a"},{"href":"#id-12","referenceIndex":25,"text":"Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"Arbel and Mairal","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"Dagréou et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":59,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":59,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-39","referenceIndex":37,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":37,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":97,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":97,"text":"2025","element":"a"},{"text":"). Specifically, in each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"outer iteration ","element":"span"},{"text":"of the AID-based method, one implements an inexact hyper-gradient descent step ","element":"span"},{"style":{"height":15.6},"width":391.78,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-2.png","element":"img","alt":" xk+1 = xk − β �∇ϕ(xk),","inline":true,"padRight":true},{"text":"where the estimator ","element":"span"},{"style":{"height":16},"width":131.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-3.png","element":"img","alt":"�∇ϕ(xk)","inline":true,"padRight":true},{"text":"of the hyper-gradient is obtained with the help of a few ","element":"span"},{"style":{"fontStyle":"italic"},"text":"inner iterations","element":"span"},{"text":". We consider the extension of AID-based methods to the bilevel reinforcement learning problem (","element":"span"},{"href":"#id-41","text":"1","element":"a"},{"text":"). Note that AID-based methods depend on the lower-level strong convexity (","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"Khanduri et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-43","referenceIndex":40,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":40,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"2022","element":"a"},{"text":") or uniform Polyak–Łojasiewicz (PL) condition (","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"Huang","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":") to ensure the existence of the hyper-gradient. However, the lower-level problem in (","element":"span"},{"href":"#id-41","text":"1","element":"a"},{"text":")—always an RL problem—is inherently non-convex even with strongly-convex regularization (","element":"span"},{"href":"#id-46","referenceIndex":1,"text":"Agarwal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":1,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"Lan","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"2023","element":"a"},{"text":"), and only the non-uniform PL condition has been established (","element":"span"},{"href":"#id-48","referenceIndex":67,"text":"Mei et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-48","referenceIndex":67,"text":"2020","element":"a"},{"text":"), which renders ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-4.png","element":"img","alt":" ∇ϕ","inline":true,"padRight":true},{"text":"ambiguous, as stated in (","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"Recently, ","element":"span"},{"href":"#id-49","referenceIndex":12,"text":"Chen et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-49","referenceIndex":12,"text":"2022a","element":"a"},{"text":") assumed the convexity of the hyper-objective and proposed a deterministic algorithm for the bilevel RL problem. ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":") presented an AID-based framework, assuming that the lower-level problem satisfies the uniform PL condition and the Hessian non-singularity condition. ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":") proposed a penalty-based bilevel RL algorithm to bypass the requirement of lower-level convexity by constructing two penalty functions. The convergence rate relies on the penalty parameter ","element":"span"},{"style":{"height":11.6},"width":105.74,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-5.png","element":"img","alt":" λ that","inline":true,"padRight":true},{"text":"is at least the order of ","element":"span"},{"style":{"height":13.78},"width":82.43,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-6.png","element":"img","alt":" ϵ−0.5","inline":true},{"text":". ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"Thoma et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"2024","element":"a"},{"text":") designed a stochastic bilevel RL method, achieving the convergence rate ","element":"span"},{"style":{"height":17.38},"width":123.6,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-7.png","element":"img","alt":" O(ϵ−2)","inline":true},{"text":". In this paper, we develop model-based and model-free bilevel RL algorithms and extend the model-free algorithm to stochastic settings, all of which are provable and exhibit an enhanced convergence rate, without the lower-lower convex assumption; see Table ","element":"span"},{"href":"#id-50","text":"1 ","element":"a"},{"text":"for a detailed comparison with ","element":"span"},{"text":"the existing works, where the notation ","element":"span"},{"style":{"height":15.6},"width":74.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-8.png","element":"img","alt":"�O(·)","inline":true,"padRight":true},{"text":"hides logarithmic terms of ","element":"span"},{"style":{"height":13.38},"width":69.96,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-9.png","element":"img","alt":" ϵ−1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Contributions ","element":"span"},{"text":"The main contributions are summarized as follows.","element":"span"}],[{"text":"Firstly, we characterize the hyper-gradient of bilevel RL problem via fully first-order information and unveil its properties by investigating the fixed point equation associated with the entropy-regularized RL problem (","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"Nachum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-52","referenceIndex":24,"text":"Geist et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":24,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-53","referenceIndex":96,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":96,"text":"2019","element":"a"},{"text":"), which extends the spirits in (","element":"span"},{"href":"#id-54","referenceIndex":17,"text":"Christianson","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":17,"text":"1994","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"Grazzi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"2020","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":27,"text":"2021","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":28,"text":"2023","element":"a"},{"text":").","element":"span"}],[{"text":"Secondly, understanding the hyper-gradient enables us to construct its estimators, upon which we devise a model-based bilevel RL algorithm, M-SoBiRL, together with a model-free version, SoBiRL. For broader applicability, we extend SoBiRL in stochastic settings by designing a sampling scheme to estimate the hyper-gradient and a framework Stoc-SoBiRL aided with a momentum technique to accommodate the stochastic hyper-gradient estimator. Specifically, the implementation only requires first-order oracles, which circumvents complicated second-order queries in general AID-based bilevel methods.","element":"span"}],[{"text":"Finally, we offer an analysis to illustrate the efficiency of amortizing the hyper-gradient approximation through outer iterations in M-SoBiRL, i.e., it enjoys the convergence rate ","element":"span"},{"style":{"height":17.39},"width":123.34,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-10.png","element":"img","alt":" O(ϵ−1)","inline":true,"padRight":true},{"text":"with the inner iteration number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"independent of the solution accuracy ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-11.png","element":"img","alt":" ϵ","inline":true},{"text":". In the model-free scenario, an enhanced convergence property is also established. Moreover, investigating the statistical properties of Stoc-SoBiRL, we show its iteration complexity ","element":"span"},{"style":{"height":17.79},"width":148.77,"height":44.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-12.png","element":"img","alt":"�O(ϵ−1.5)","inline":true,"padRight":true},{"text":"and sample complexity ","element":"span"},{"style":{"height":17.79},"width":148.77,"height":44.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/1-13.png","element":"img","alt":"�O(ϵ−3.5)","inline":true},{"text":"; refer to Table ","element":"span"},{"href":"#id-50","text":"1 ","element":"a"},{"text":"for a detailed comparison. To the best of our knowledge, it is the first sample complexity result established for bilevel RL problems. In addition, a synthetic experiment verifies the convergence of M-SoBiRL, and the favorable performance of the proposed SoBiRL is validated on Atari games and Mujoco simulations, which implies that the hyper-gradient is an aggregation of exploitation and exploration.","element":"span"}]]},{"heading":"2 RELATED WORK","paragraphs":[[{"text":"The introduction to related bilevel optimization methods can be found in Appendix ","element":"span"},{"href":"#id-57","text":"A","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Entropy-regularized Reinforcement Learning ","element":"span"},{"text":"Entropy regularization is commonly considered in the RL community. Specifically, the goal of entropy-regularized RL is to maximize the expected reward augmented with the policy entropy, thereby boosting both task success and behavior stochasticity. It facilitates exploration and robustness (","element":"span"},{"href":"#id-58","referenceIndex":104,"text":"Ziebart","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":104,"text":"2010","element":"a"},{"text":"; ","element":"span"},{"href":"#id-59","referenceIndex":30,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":30,"text":"2017","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-61","referenceIndex":48,"text":"Lan","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":48,"text":"2021","element":"a"},{"text":"), smoothens the optimization landscape (","element":"span"},{"href":"#id-62","referenceIndex":2,"text":"Ahmed et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":2,"text":"2019","element":"a"},{"text":"), and enhances the convergence property (","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"Cen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"Zhan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"Lan","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-33","referenceIndex":95,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":95,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-66","referenceIndex":52,"text":"Li ","element":"a"},{"href":"#id-66","referenceIndex":52,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":52,"text":"2024b","element":"a"},{"text":"). Moreover, the policy optimality condition under entropy-regularized setting is equivalent to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"softmax temporal value consistency ","element":"span"},{"text":"in (","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"Nachum ","element":"a"},{"href":"#id-51","referenceIndex":72,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"). Therefore, the term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"soft ","element":"span"},{"text":"is prefixed to quantities in this scenario, also resonating with the concept in (","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"Sutton and Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"2018","element":"a"},{"text":"), where “soft” means that the policy ensures positive probabilities across all state-action pairs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Bilevel Reinforcement Learning ","element":"span"},{"text":"Extensive reinforcement learning applications hold the bilevel structure, e.g., reward shaping (","element":"span"},{"href":"#id-67","referenceIndex":84,"text":"Sorg et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-67","referenceIndex":84,"text":"2010","element":"a"},{"text":"; ","element":"span"},{"href":"#id-68","referenceIndex":102,"text":"Zheng ","element":"a"},{"href":"#id-68","referenceIndex":102,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-68","referenceIndex":102,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-69","referenceIndex":20,"text":"Devidze et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-69","referenceIndex":20,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-70","referenceIndex":29,"text":"Gupta ","element":"a"},{"href":"#id-70","referenceIndex":29,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","referenceIndex":29,"text":"2023","element":"a"},{"text":"), offline reward correction (","element":"span"},{"href":"#id-71","referenceIndex":54,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-71","referenceIndex":54,"text":"2023","element":"a"},{"text":"), preference-based RL (","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-72","referenceIndex":50,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":50,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-73","referenceIndex":74,"text":"Saha et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-73","referenceIndex":74,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":"), apprenticeship learning (","element":"span"},{"href":"#id-74","referenceIndex":4,"text":"Arora and Doshi","element":"a"},{"text":", ","element":"span"},{"href":"#id-74","referenceIndex":4,"text":"2021","element":"a"},{"text":"), Stackleberg games (","element":"span"},{"href":"#id-75","referenceIndex":21,"text":"Fiez et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-75","referenceIndex":21,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-76","referenceIndex":83,"text":"Song et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-76","referenceIndex":83,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":"), etc. In terms of provable bilevel reinforcement learning frameworks, ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":") proposed a policy alignment algorithm demonstrating performance improvements, ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":") designed a penalty-based method, and ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"Thoma et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"2024","element":"a"},{"text":") investigated the problem in the stochastic setting. However, the research related to the hyper-gradient in bilevel RL remains constrained.","element":"span"}]]},{"heading":"3 PROBLEM FORMULATION","paragraphs":[[{"text":"In this section, we will introduce the setting of entropy-regularized Markov decision processes (","element":"span"},{"href":"#id-77","referenceIndex":93,"text":"Williams and ","element":"a"},{"href":"#id-77","referenceIndex":93,"text":"Peng","element":"a"},{"text":", ","element":"span"},{"href":"#id-77","referenceIndex":93,"text":"1991","element":"a"},{"text":"; ","element":"span"},{"href":"#id-58","referenceIndex":104,"text":"Ziebart","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":104,"text":"2010","element":"a"},{"text":"; ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"Nachum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-59","referenceIndex":30,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":30,"text":"2017","element":"a"},{"text":"), and the formulation of bilevel reinforcement learning. Moreover, two specific instances of bilevel RL are introduced. The notation is listed in Appendix ","element":"span"},{"href":"#id-78","text":"B","element":"a"},{"text":".","element":"span"}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Entropy-regularized MDPs","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Discounted Infinite-horizon MDPs ","element":"span"},{"text":"An MDP is characterized by a tuple ","element":"span"},{"style":{"height":16},"width":394.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-0.png","element":"img","alt":" Mτ = (S, A, P, r, γ, τ)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"serving as the state space and the action space, respectively. ","element":"span"},{"text":"In this paper, we restrict the focus to the tabular setting, where both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"are finite, i.e., ","element":"span"},{"style":{"height":16},"width":188.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-1.png","element":"img","alt":" |S| < +∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":192.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-2.png","element":"img","alt":" |A| < +∞","inline":true},{"text":". Furthermore, ","element":"span"},{"style":{"height":14.98},"width":273.36,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-3.png","element":"img","alt":" P ∈ R|S||A|×|S|","inline":true,"padRight":true},{"text":"is the transition matrix with ","element":"span"},{"style":{"height":16},"width":299.67,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-4.png","element":"img","alt":"Psas′ = P(s′|s, a)","inline":true,"padRight":true},{"text":"representing the transition probability from state ","element":"span"},{"style":{"height":10},"width":209.8,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-5.png","element":"img","alt":" s to state s′ ","inline":true,"padRight":true},{"text":"under action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":". The vector ","element":"span"},{"style":{"height":14.99},"width":182.96,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-6.png","element":"img","alt":"r ∈ R|S||A|","inline":true,"padRight":true},{"text":"specifies the reward ","element":"span"},{"style":{"height":9.19},"width":50.02,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-7.png","element":"img","alt":" rsa","inline":true,"padRight":true},{"text":"received when action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is carried out at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Note that ","element":"span"},{"style":{"height":16},"width":168.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-8.png","element":"img","alt":" γ ∈ (0, 1)","inline":true,"padRight":true},{"text":"is the discount factor, adjusting the importance of immediate versus future reward, and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"temperature ","element":"span"},{"style":{"height":14.4},"width":285.72,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-9.png","element":"img","alt":"parameter τ ≥ 0","inline":true,"padRight":true},{"text":"balances regularization and reward.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Soft Value Function and Q-function ","element":"span"},{"text":"A policy ","element":"span"},{"style":{"height":14.98},"width":191.8,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-10.png","element":"img","alt":"π ∈ R|S||A|","inline":true,"padRight":true},{"text":"provides an action selection rule, that is, for any ","element":"span"},{"style":{"height":16},"width":462.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-11.png","element":"img","alt":" s ∈ S, a ∈ A, πsa = π(a|s)","inline":true,"padRight":true},{"text":"is the probability of performing action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", and we denote ","element":"span"},{"style":{"height":16.58},"width":161.3,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-12.png","element":"img","alt":" πs ∈ R|A|","inline":true,"padRight":true},{"text":"to be the corresponding distribution over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", which means ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-13.png","element":"img","alt":" πs","inline":true,"padRight":true},{"text":"belongs to the probability simplex ","element":"span"},{"style":{"height":21.2},"width":772.95,"height":52.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-14.png","element":"img","alt":"∆(A) := {x ∈ R|A| : xi ≥ 0, �|A|i=1 xi = 1}","inline":true},{"text":". Collect- ","element":"span"},{"text":"ing ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-15.png","element":"img","alt":" πs","inline":true,"padRight":true},{"text":"over ","element":"span"},{"style":{"height":11.6},"width":96.6,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-16.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"defines ","element":"span"},{"style":{"height":18.19},"width":480.21,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-17.png","element":"img","alt":" ∆|S|(A) := {π = (πs)s∈S ∈","inline":true},{"style":{"height":18.19},"width":550.74,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-18.png","element":"img","alt":"R|S||A| : πs ∈ ∆(A) for s ∈ S}","inline":true,"padRight":true},{"text":"as the feasible set of policies. For simplicity, the above two sets are abbreviated as ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-19.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.19},"width":73.92,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-20.png","element":"img","alt":" ∆|S|","inline":true},{"text":", respectively. A policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-21.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"induces a ","element":"span"},{"style":{"height":14.98},"width":235.76,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-22.png","element":"img","alt":" P π ∈ R|S|×|S|","inline":true},{"text":", measuring the transition probability from one state to another, i.e., ","element":"span"},{"style":{"height":16.74},"width":404.85,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-23.png","element":"img","alt":" P πss′ = �a πsaPsas′. To","inline":true,"padRight":true},{"text":"promote stochasticity and discourage premature convergence to sub-optimal policies (","element":"span"},{"href":"#id-62","referenceIndex":2,"text":"Ahmed et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":2,"text":"2019","element":"a"},{"text":"), it commonly resorts to the entropy function ","element":"span"},{"style":{"height":16},"width":76.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-24.png","element":"img","alt":" h(·),","inline":true}],[{"style":{"width":"85%"},"width":799,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.54},"width":462.61,"height":43.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-26.png","element":"img","alt":" Rm+ := {z ∈ Rm : zi > 0}","inline":true},{"text":". Consequently, the ","element":"span"},{"style":{"height":17.39},"width":511.06,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-27.png","element":"img","alt":"soft value function V π ∈ R|S|","inline":true,"padRight":true},{"text":"represents the expected discounted reward augmented with policy entropy, i.e.,","element":"span"}],[{"style":{"width":"83%"},"width":787,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-28.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"height":11.6},"width":93.41,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-29.png","element":"img","alt":" s ∈ S","inline":true},{"text":", where the expectation is taken over the trajectory ","element":"span"},{"style":{"height":16},"width":658.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-30.png","element":"img","alt":" (s0, a0 ∼ π(·|s0), s1 ∼ P(·|s0, a0), . . .)","inline":true},{"text":". Given the initial state distribution ","element":"span"},{"style":{"height":22.25},"width":284.41,"height":55.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-31.png","element":"img","alt":" ρ ∈ R|S|+ ∩ ∆ (S)","inline":true},{"text":", the ob- ","element":"span"},{"text":"jective of RL is to find the optimal policy ","element":"span"},{"style":{"height":14.98},"width":156.5,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-32.png","element":"img","alt":" π ∈ ∆|S|","inline":true,"padRight":true},{"text":"that solves the problem:","element":"span"}],[{"id":"id-79","style":{"width":"76%"},"width":722,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-33.png","element":"img"}],[{"text":"The soft Q-function ","element":"span"},{"style":{"height":17.38},"width":215.18,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-34.png","element":"img","alt":" Qπ ∈ R|S||A| ","inline":true,"padRight":true},{"text":"associated with policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-35.png","element":"img","alt":"π","inline":true},{"text":", couples with ","element":"span"},{"style":{"height":10.8},"width":51.1,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-36.png","element":"img","alt":" V π ","inline":true,"padRight":true},{"text":"in the following fashion,","element":"span"}],[{"style":{"width":"99%"},"width":939,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/2-37.png","element":"img"}],[{"text":"It is worth noting that the optimization problem (","element":"span"},{"href":"#id-79","text":"2","element":"a"},{"text":") admits a unique ","element":"span"},{"style":{"height":14.8},"width":391.73,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-0.png","element":"img","alt":" optimal soft policy π∗","inline":true,"padRight":true},{"text":"independent of ","element":"span"},{"style":{"height":16},"width":48.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-1.png","element":"img","alt":" ρ (","inline":true},{"href":"#id-51","referenceIndex":72,"text":"Nachum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"). The corresponding ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimal soft value function ","element":"span"},{"text":"(resp. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimal soft Q-function","element":"span"},{"text":") is denoted by ","element":"span"},{"style":{"height":18.6},"width":488.78,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-2.png","element":"img","alt":" V ∗ := V π∗ (resp. Q∗ := Qπ∗","inline":true},{"text":"). If necessary, we add a subscript to clarify the environment in which the policy is evaluated, e.g., ","element":"span"},{"style":{"height":17.18},"width":271.16,"height":42.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-3.png","element":"img","alt":" V πMτ and QπMτ .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bilevel RL Formulation","element":"span"}],[{"text":"The single-level reinforcement learning task in Section ","element":"span"},{"href":"#id-80","text":"3.1 ","element":"a"},{"text":"trains an agent with fixed rewards. In contrast, this work considers the scenario where the reward function is parameterized by a decision variable, a task dubbed “policy alignment” in (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"). Generally, we parameterize the reward ","element":"span"},{"style":{"height":18.19},"width":235.82,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-4.png","element":"img","alt":" r(x) ∈ R|S||A|","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":11.6},"width":124.3,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-5.png","element":"img","alt":" x ∈ Rn","inline":true},{"text":", which results in a parameterized MDP ","element":"span"},{"style":{"height":16},"width":519.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-6.png","element":"img","alt":"Mτ(x) = (S, A, P, r(x), γ, τ)","inline":true},{"text":". ","element":"span"},{"text":"In parallel with Section ","element":"span"},{"href":"#id-80","text":"3.1","element":"a"},{"text":", given ","element":"span"},{"style":{"height":16},"width":122.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-7.png","element":"img","alt":" Mτ(x)","inline":true},{"text":", we define the soft value function ","element":"span"},{"style":{"height":16},"width":108.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-8.png","element":"img","alt":" V π(x)","inline":true},{"text":", the soft Q-function ","element":"span"},{"style":{"height":16},"width":107.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-9.png","element":"img","alt":" Qπ(x)","inline":true},{"text":", the objective ","element":"span"},{"style":{"height":19.94},"width":182.3,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-10.png","element":"img","alt":"V πMτ (x) (ρ)","inline":true},{"text":", the optimal soft policy ","element":"span"},{"style":{"height":16},"width":97.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-11.png","element":"img","alt":" π∗(x)","inline":true},{"text":", the opti- ","element":"span"},{"text":"mal soft value-function ","element":"span"},{"style":{"height":16},"width":105.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-12.png","element":"img","alt":" V ∗(x)","inline":true},{"text":", and the optimal soft Q-function ","element":"span"},{"style":{"height":16},"width":104.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-13.png","element":"img","alt":" Q∗(x)","inline":true},{"text":", associated with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"text":"Consequently, we formulate the bilevel reinforcement learning problem:","element":"span"}],[{"id":"id-81","style":{"width":"83%"},"width":786,"height":135,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-14.png","element":"img"}],[{"text":"where the upper-level ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is defined on ","element":"span"},{"style":{"height":14.19},"width":226,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-15.png","element":"img","alt":" Rn × R|S||A|.","inline":true}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Applications of Bilevel RL","element":"span"}],[{"text":"We introduce two examples unified in the formulation (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":"). To make a distinction between environments at two levels, we denote the upper-level MDP by ","element":"span"},{"style":{"height":17.63},"width":425.6,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-16.png","element":"img","alt":"¯M¯τ = {S, A, ¯P, ¯r, ¯γ, ¯τ}","inline":true,"padRight":true},{"text":"with an initial state distribution ","element":"span"},{"style":{"height":12.8},"width":23.66,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-17.png","element":"img","alt":" ¯ρ","inline":true},{"text":", and the lower-level MDP by ","element":"span"},{"style":{"height":16},"width":122.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-18.png","element":"img","alt":" Mτ(x)","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-19.png","element":"img","alt":" ρ","inline":true},{"text":". More applications can be found in Appendix ","element":"span"},{"href":"#id-82","text":"K","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reward Shaping ","element":"span"},{"text":"In reinforcement learning, the reward function acts as the guiding signal to motivate agents to achieve specified goals. However, in many cases, the rewards are sparse, which impedes the policy learning, or are partially incorrect, which leads to inaccurate policies. To this end, from the perspective of bilevel RL, it is advisable to shape an auxiliary reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"at the lower level for efficient agent training, while maintaining the original environment at the upper level to align with the initial task evaluation (","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"Hu ","element":"a"},{"href":"#id-30","referenceIndex":38,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"2020","element":"a"},{"text":"), which is established as","element":"span"}],[{"style":{"width":"98%"},"width":926,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-20.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Reinforcement Learning from Human Feedback (RLHF) ","element":"span"},{"text":"The target of RLHF is to learn the intrinsic","element":"span"}],[{"text":"reward function that incorporates expert knowledge, from simple labels only containing human preferences. Drawing from the original framework (","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"2017","element":"a"},{"text":"), ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":") and ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":") have formulated it in a bilevel form, which optimizes a policy under ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"at the lower level, and adjusts ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"to align the preference predicted by the reward model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"with the true labels at the upper level.","element":"span"}],[{"style":{"width":"73%"},"width":687,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-21.png","element":"img"}],[{"text":"where each trajectory ","element":"span"},{"style":{"height":19.91},"width":331.69,"height":49.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-22.png","element":"img","alt":" di = {�sih, aih�}H−1h=0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"text":") is ","element":"span"},{"text":"sampled from the distribution ","element":"span"},{"style":{"height":16},"width":193.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-23.png","element":"img","alt":" ρ (d; π∗(x))","inline":true,"padRight":true},{"text":"generated by the policy ","element":"span"},{"style":{"height":16},"width":96.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-24.png","element":"img","alt":" π∗(x)","inline":true,"padRight":true},{"text":"in the upper-level ","element":"span"},{"style":{"height":16.83},"width":154.05,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-25.png","element":"img","alt":"¯M¯τ, i.e.,","inline":true}],[{"style":{"width":"92%"},"width":872,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-26.png","element":"img"}],[{"text":"and the preference label ","element":"span"},{"style":{"height":16},"width":169.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-27.png","element":"img","alt":" y ∈ {0, 1}","inline":true},{"text":", indicating preference for ","element":"span"},{"style":{"height":13.19},"width":170.6,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-28.png","element":"img","alt":" d1 over d2","inline":true},{"text":", obeys human feedback distribution ","element":"span"},{"style":{"height":16},"width":369.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-29.png","element":"img","alt":"y ∼ Dhuman(y|d1, d2)","inline":true},{"text":". Moreover, ","element":"span"},{"style":{"height":13.19},"width":30.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-30.png","element":"img","alt":" lh","inline":true,"padRight":true},{"text":"is the binary cross-entropy loss, ","element":"span"},{"style":{"height":16},"width":697.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-31.png","element":"img","alt":" lh (d1, d2, y; x) = −y log P (d1 ≻ d2; x) −","inline":true},{"style":{"height":16},"width":425.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-32.png","element":"img","alt":"(1 − y) log P (d2 ≻ d1; x)","inline":true},{"text":", with the preference probability ","element":"span"},{"style":{"height":16},"width":240.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-33.png","element":"img","alt":" P (d1 ≻ d2; x)","inline":true,"padRight":true},{"text":"built by the Bradley–Terry model; see (","element":"span"},{"href":"#id-83","text":"101","element":"a"},{"text":") in the appendix.","element":"span"}]]},{"heading":"4 MODEL-BASED SOFT BILEVEL REINFORCEMENT LEARNING","paragraphs":[[{"text":"Addressing the bilevel RL problem (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":") is challenging from two aspects: 1) analyzing the properties of ","element":"span"},{"style":{"height":15.6},"width":106.62,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-34.png","element":"img","alt":" π∗(x),","inline":true,"padRight":true},{"text":"which has not been well studied but plays a crucial role in AID-based methods; 2) characterizing the hyper-gradient ","element":"span"},{"style":{"height":16},"width":111.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-35.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"text":"which remains unclear. To this end, we take advantage of the specific structure of entropy-regularized RL to investigate ","element":"span"},{"style":{"height":15.6},"width":95.44,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-36.png","element":"img","alt":" π∗(x)","inline":true,"padRight":true},{"text":"and identify ","element":"span"},{"style":{"height":15.6},"width":121.1,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-37.png","element":"img","alt":" ∇ϕ(x).","inline":true,"padRight":true},{"text":"As a result, we propose a model-based algorithm to solve the bilevel RL problem (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":").","element":"span"}],[{"text":"Recall that the optimal soft quantities adhere to the following softmax temporal value consistency conditions (","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"Nachum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"): ","element":"span"},{"style":{"height":16},"width":277.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-38.png","element":"img","alt":" ∀(s, a) ∈ S × A,","inline":true}],[{"id":"id-84","style":{"width":"98%"},"width":927,"height":286,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-39.png","element":"img"}],[{"text":"By differentiating (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":") and incorporating (","element":"span"},{"href":"#id-79","text":"3","element":"a"},{"text":"), we assemble ","element":"span"},{"style":{"height":16},"width":183.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-40.png","element":"img","alt":" {∇π∗sa(x)}","inline":true,"padRight":true},{"text":"in matrix form and obtain","element":"span"}],[{"id":"id-85","style":{"width":"95%"},"width":900,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/3-41.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":17.79},"width":849.04,"height":44.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-0.png","element":"img","alt":" U ∈ R|S||A|×|S| defined by Usas′ := 1 − γPsas′ for","inline":true},{"style":{"height":6.8},"width":105.05,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-1.png","element":"img","alt":"s = s′","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":289.09,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-2.png","element":"img","alt":" Usas′ := −γPsas′","inline":true},{"text":", otherwise. Hence, differentiating ","element":"span"},{"style":{"height":16},"width":96.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-3.png","element":"img","alt":" π∗(x)","inline":true,"padRight":true},{"text":"boils down to considering the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"implicit ","element":"span"},{"style":{"height":16},"width":398.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-4.png","element":"img","alt":"differentiation ∇V ∗(x)","inline":true},{"text":". In view of (","element":"span"},{"href":"#id-84","text":"6","element":"a"},{"text":"), we notice that ","element":"span"},{"style":{"height":16},"width":104.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-5.png","element":"img","alt":"V ∗(x)","inline":true,"padRight":true},{"text":"is a fixed point related to the mapping ","element":"span"},{"style":{"height":14},"width":37.07,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-6.png","element":"img","alt":" φ,","inline":true}],[{"style":{"width":"99%"},"width":937,"height":164,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":94.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-8.png","element":"img","alt":" log(·)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":103.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-9.png","element":"img","alt":" exp(·)","inline":true,"padRight":true},{"text":"are element-wise operations, and ","element":"span"},{"style":{"height":14.98},"width":148.3,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-10.png","element":"img","alt":" 1 ∈ R|S|","inline":true,"padRight":true},{"text":"is an all-ones vector. The structure of ","element":"span"},{"style":{"height":16},"width":97.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-11.png","element":"img","alt":"φ(·, ·)","inline":true},{"text":", in consequence, can facilitate the characterization of ","element":"span"},{"style":{"height":15.6},"width":136.59,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-12.png","element":"img","alt":" ∇V ∗(x)","inline":true},{"text":", as outlined in the following proposition.","element":"span"}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"Proposition 4.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":265.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-13.png","element":"img","alt":" x ∈ Rn, φ(x, ·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a contraction mapping, i.e., ","element":"span"},{"style":{"height":16},"width":419.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-14.png","element":"img","alt":" ∥∇vφ(x, v)∥∞ = γ < 1","inline":true},{"style":{"fontStyle":"italic"},"text":", and the matrix ","element":"span"},{"style":{"height":16},"width":240.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-15.png","element":"img","alt":" I − ∇vφ(x, v)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is invertible. Consequently, ","element":"span"},{"style":{"height":16},"width":104.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-16.png","element":"img","alt":"V ∗(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the unique fixed point of ","element":"span"},{"style":{"height":16},"width":109.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-17.png","element":"img","alt":" φ(x, ·)","inline":true},{"style":{"fontStyle":"italic"},"text":", with a well-defined derivative ","element":"span"},{"style":{"height":16},"width":307.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-18.png","element":"img","alt":" ∇V ∗(x), given by","inline":true}],[{"style":{"width":"91%"},"width":856,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Additionally, ","element":"span"},{"style":{"height":16},"width":261.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-20.png","element":"img","alt":" ∇vφ (x, V ∗(x))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"coincides with the ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-21.png","element":"img","alt":" γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-scaled transition matrix induced by the optimal soft policy ","element":"span"},{"style":{"height":18.59},"width":665.05,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-22.png","element":"img","alt":" π∗(x), i.e., ∇vφ (x, V ∗(x)) = γP π∗(x).","inline":true}],[{"text":"Combining the above proposition with (","element":"span"},{"href":"#id-85","text":"8","element":"a"},{"text":"), we identify the hyper-gradient ","element":"span"},{"style":{"height":16},"width":111.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-23.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"text":"to the problem (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":"); its expression is delayed in (","element":"span"},{"href":"#id-86","text":"24","element":"a"},{"text":") of Appendix ","element":"span"},{"href":"#id-87","text":"G","element":"a"},{"text":". In this manner, we unveil the hyper-gradient in the context of bilevel RL, by harnessing the softmax temporal value consistency and derivatives of a fixed-point equation.","element":"span"}],[{"text":"In light of the characterization of ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-24.png","element":"img","alt":" ∇ϕ","inline":true},{"text":", a model-based soft bilevel reinforcement learning algorithm, called M-SoBiRL, is proposed in Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":". In summary, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th outer iteration performs an inexact hyper-gradient descent step on ","element":"span"},{"style":{"height":9.19},"width":39.78,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-25.png","element":"img","alt":" xk","inline":true},{"text":", with the aid of auxiliary iterates ","element":"span"},{"style":{"height":16},"width":269.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-26.png","element":"img","alt":" (πk, Qk, Vk, wk)","inline":true},{"text":". Specifically, given ","element":"span"},{"style":{"height":10.79},"width":80.86,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-27.png","element":"img","alt":" xk+1","inline":true},{"text":", the inner iterations aim to (approximately) solve the lower-level entropy-regularized MDP problem. To this end, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"soft policy iteration ","element":"span"},{"text":"studied by ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"Haarnoja et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":"); ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"Cen et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"2022","element":"a"},{"text":") inspires us to define the following soft Bellman optimality operator ","element":"span"},{"style":{"height":17.28},"width":151.26,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-28.png","element":"img","alt":" TMτ (x) :","inline":true},{"style":{"height":14.59},"width":291.16,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-29.png","element":"img","alt":"R|S||A| → R|S||A| ","inline":true,"padRight":true},{"text":"associated with ","element":"span"},{"style":{"height":14},"width":132.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-30.png","element":"img","alt":" x ∈ Rn,","inline":true}],[{"style":{"width":"98%"},"width":922,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-31.png","element":"img"}],[{"text":"where the expectation is taken with respect to ","element":"span"},{"style":{"height":6.8},"width":76.12,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-32.png","element":"img","alt":" s′ ∼","inline":true},{"style":{"height":16},"width":164.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-33.png","element":"img","alt":"P(· | s, a)","inline":true},{"text":". Applying this operator iteratively leads to the optimal soft Q-value function, with a linear convergence rate in theory (see Appendix ","element":"span"},{"href":"#id-89","text":"G.4","element":"a"},{"text":"). Therefore, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"inner iterations are invoked in line 7-9 of Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"to estimate ","element":"span"},{"style":{"height":16},"width":170.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-34.png","element":"img","alt":" Q∗ (xk+1)","inline":true},{"text":", and the warm-start strategy is adopted in line 10 of Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"to initialize the next outer iteration with historical information, where it amortizes the computation of ","element":"span"},{"style":{"height":16},"width":104.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-35.png","element":"img","alt":" Q∗(x)","inline":true,"padRight":true},{"text":"through the outer iterations. Additionally, in order to recover ","element":"span"},{"style":{"height":10.79},"width":80.8,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-36.png","element":"img","alt":" πk+1","inline":true,"padRight":true},{"text":"from ","element":"span"},{"style":{"height":14.79},"width":89.59,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-37.png","element":"img","alt":" Qk+1","inline":true},{"text":", we take into account the consistency condition (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":") which involves the softmax mapping as follows,","element":"span"}],[{"style":{"width":"88%"},"width":827,"height":167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-38.png","element":"img"}],[{"text":"It follows from (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":") that ","element":"span"},{"style":{"height":16},"width":525.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-39.png","element":"img","alt":" πk+1 = softmax (Qk+1/τ). As","inline":true,"padRight":true},{"text":"for the approximation of the hyper-gradient (","element":"span"},{"href":"#id-86","text":"24","element":"a"},{"text":") which requires an inverse matrix vector product, we denote","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"height":16},"width":894.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-40.png","element":"img","alt":"k = (I − γP πk)⊤ , bk = U ⊤ diag (πk) ∇πf(xk, πk),","inline":true}],[{"text":"and employ ","element":"span"},{"style":{"height":9.19},"width":45.52,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-41.png","element":"img","alt":" wk","inline":true,"padRight":true},{"text":"to track the quantity ","element":"span"},{"style":{"height":19.23},"width":106.78,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-42.png","element":"img","alt":" A−1k bk","inline":true,"padRight":true},{"text":"in a similar ","element":"span"},{"text":"principle of amortization. Concretely, ","element":"span"},{"style":{"height":9.19},"width":45.52,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-43.png","element":"img","alt":" wk","inline":true,"padRight":true},{"text":"is regarded as the (approximate) solution of the least squares problem, ","element":"span"},{"style":{"height":21.11},"width":431.84,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-44.png","element":"img","alt":"minw∈R|S| 12 ∥Akw − bk∥2","inline":true},{"text":", and implements one gradi- ","element":"span"},{"text":"ent descent step (line 4 of Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":") in each outer iteration. Furthermore, elements of ","element":"span"},{"style":{"height":14},"width":48.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-45.png","element":"img","alt":" Qk","inline":true,"padRight":true},{"text":"are assembled to estimate the soft value function:","element":"span"}],[{"style":{"width":"97%"},"width":912,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-46.png","element":"img"}],[{"text":"which complies with the form of (","element":"span"},{"href":"#id-84","text":"7","element":"a"},{"text":"). Finally, collecting the iterates ","element":"span"},{"style":{"height":16},"width":260.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-47.png","element":"img","alt":" (xk, πk, Vk, wk)","inline":true,"padRight":true},{"text":"yields the model-based hyper-gradient estimator ","element":"span"},{"style":{"height":14.4},"width":143.23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-48.png","element":"img","alt":"�∇ϕ, i.e.,","inline":true}],[{"style":{"width":"97%"},"width":914,"height":159,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-49.png","element":"img"}]]},{"heading":"5 MODEL-FREE SOFT BILEVEL REINFORCEMENT LEARNING","paragraphs":[[{"text":"In many real-world applications, agents lack access to the accurate model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"of the environment, underscoring the need for efficient model-free algorithms (","element":"span"},{"href":"#id-90","referenceIndex":75,"text":"Schulman ","element":"a"},{"href":"#id-90","referenceIndex":75,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-90","referenceIndex":75,"text":"2015","element":"a"},{"text":", ","element":"span"},{"href":"#id-91","referenceIndex":76,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-92","referenceIndex":55,"text":"Lillicrap et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-92","referenceIndex":55,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":70,"text":"2016","element":"a"},{"text":"; ","element":"span"},{"href":"#id-93","referenceIndex":77,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-93","referenceIndex":77,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":"). When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is unknown, two obstacles appear from the hyper-gradient computation (","element":"span"},{"href":"#id-86","text":"24","element":"a"},{"text":"). Specifically, it explicitly relies on the black-box transition matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"to compute ","element":"span"},{"style":{"height":16},"width":223.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-50.png","element":"img","alt":"∇V ∗(x), i.e.,","inline":true}],[{"id":"id-96","style":{"width":"93%"},"width":878,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-51.png","element":"img"}],[{"text":"and the computation involves complicated large-scale matrix multiplications. To circumvent them, we demonstrate how to derive the hyper-gradient in an expectation, which allows us to estimate ","element":"span"},{"style":{"height":16},"width":111.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-52.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"text":"via sampling fully first-order information. Following these developments, we offer a model-free soft bilevel reinforcement learning algorithm.","element":"span"}],[{"text":"In the model-free scenario, we concentrate on the upper-level function","element":"span"}],[{"id":"id-97","style":{"width":"90%"},"width":852,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/4-53.png","element":"img"}],[{"text":"where each trajectory ","element":"span"},{"style":{"height":19.91},"width":334.41,"height":49.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-0.png","element":"img","alt":" di = {�sih, aih�}H−1h=0","inline":true,"padRight":true},{"text":"is indepen- ","element":"span"},{"text":"dently sampled from the trajectory distribution ","element":"span"},{"style":{"height":16},"width":121.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-1.png","element":"img","alt":" ρ (d; π)","inline":true,"padRight":true},{"text":"generated by the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-2.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"in the upper-level ","element":"span"},{"style":{"height":16.83},"width":156.14,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-3.png","element":"img","alt":"¯M¯τ, and","inline":true},{"style":{"height":15.6},"width":94.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-4.png","element":"img","alt":"l(·; x)","inline":true,"padRight":true},{"text":"associated with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is a function of trajectories. Notice that it incorporates the applications in Section ","element":"span"},{"href":"#id-94","text":"3.3","element":"a"},{"text":", with ","element":"span"},{"style":{"height":24.97},"width":639.34,"height":62.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-5.png","element":"img","alt":" I = 1, H = 1, l = −Qπ∗(x)¯M¯τ �s10, a10�","inline":true},{"text":"for reward shaping, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 2","element":"span"},{"text":", finite ","element":"span"},{"style":{"height":16.79},"width":456.41,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-6.png","element":"img","alt":" H, l = Ey [lh (d1, d2, y; x)]","inline":true,"padRight":true},{"text":"for RLHF.","element":"span"}],[{"text":"We extend Proposition ","element":"span"},{"href":"#id-95","text":"4.1 ","element":"a"},{"text":"to absorb the transition matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"in (","element":"span"},{"href":"#id-96","text":"10","element":"a"},{"text":") into an expectation, with the result that the implicit differentiations ","element":"span"},{"style":{"height":16},"width":137.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-7.png","element":"img","alt":" ∇V ∗(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":137.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-8.png","element":"img","alt":" ∇Q∗(x)","inline":true,"padRight":true},{"text":"can be estimated by sampling the reward gradient under the policy ","element":"span"},{"style":{"height":16},"width":107.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-9.png","element":"img","alt":" π∗(x).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proposition 5.1. ","element":"span"},{"style":{"height":14.8},"width":607.49,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-10.png","element":"img","alt":" For any x ∈ Rn, s ∈ S, and a ∈ A,","inline":true}],[{"style":{"width":"98%"},"width":924,"height":242,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-11.png","element":"img"}],[{"text":"Nevertheless, it is still intractable to construct the ","element":"span"},{"style":{"height":16},"width":121.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-12.png","element":"img","alt":"|S| × n","inline":true,"padRight":true},{"text":"matrix ","element":"span"},{"style":{"height":16},"width":138.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-13.png","element":"img","alt":" ∇V ∗(x)","inline":true,"padRight":true},{"text":"based on the above element-wise calculation, not to mention the large-scale matrix multiplications in (","element":"span"},{"href":"#id-86","text":"24","element":"a"},{"text":"). To bypass these matrix computations, we resort to the “log probability trick” (","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"Sutton and Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"2018","element":"a"},{"text":") and the consistency condition (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":"). The subsequent proposition confirms that we can evaluate the hyper-gradient by interacting with the environment and collecting fully first-order information.","element":"span"}],[{"id":"id-117","style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"text":"(Hyper-gradient)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hyper-gradient ","element":"span"},{"style":{"height":16},"width":111.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-14.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of the bilevel reinforcement learning problem ","element":"span"},{"text":"(","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with upper-level function ","element":"span"},{"text":"(","element":"span"},{"href":"#id-97","text":"11","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"can be computed by","element":"span"}],[{"id":"id-98","style":{"width":"94%"},"width":890,"height":243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-15.png","element":"img"}],[{"text":"Essentially, the first term (","element":"span"},{"text":"12","element":"span"},{"text":") in the hyper-gradient contributes to decreasing the upper-level function, and the second term (","element":"span"},{"href":"#id-98","text":"13","element":"a"},{"text":") aggregates the gradient information transmitted from the lower level. Assembling these two directions constructs the hyper-gradient, and updating the upper-level variable ","element":"span"},{"style":{"height":15.6},"width":355.28,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-16.png","element":"img","alt":" x along −∇ϕ(x) will","inline":true,"padRight":true},{"text":"lead to a first-order stationary point (see Theorem ","element":"span"},{"href":"#id-99","text":"7.6","element":"a"},{"text":").","element":"span"}],[{"text":"In line with the above propositions, to facilitate the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th (inexact) hyper-gradient descent step, the construction of a hyper-gradient estimator ","element":"span"},{"style":{"height":16},"width":191.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-17.png","element":"img","alt":"�∇ϕ(xk, πk)","inline":true,"padRight":true},{"text":"is devided into two steps: 1) evaluating the implicit dif-","element":"span"}],[{"text":"ferentiations based on ","element":"span"},{"style":{"height":10},"width":53.32,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-18.png","element":"img","alt":" πk,","inline":true}],[{"style":{"width":"99%"},"width":939,"height":243,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-19.png","element":"img"}],[{"text":"2) absorbing these components into the upper-level sampling process induced by ","element":"span"},{"style":{"height":10},"width":53.32,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-20.png","element":"img","alt":" πk,","inline":true}],[{"id":"id-218","style":{"width":"97%"},"width":912,"height":251,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-21.png","element":"img"}],[{"text":"Consequently, we propose a model-free soft bilevel reinforcement learning algorithm, called SoBiRL, which is outlined in Algorithm ","element":"span"},{"href":"#id-100","text":"1","element":"a"},{"text":". In the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th outer iteration, we search for an approximate optimal soft policy ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-22.png","element":"img","alt":" πk","inline":true,"padRight":true},{"text":"satisfying ","element":"span"},{"style":{"height":20.4},"width":337.73,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-23.png","element":"img","alt":" ∥πk − π∗(xk)∥22 ≤ ϵ","inline":true},{"text":", which can be achieved ","element":"span"},{"text":"by executing ","element":"span"},{"style":{"height":19.2},"width":192.82,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-24.png","element":"img","alt":" O�log ϵ−1�","inline":true},{"text":"iterations of the policy mirror descent algorithm (","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"Lan","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":47,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"Zhan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"2023","element":"a"},{"text":"). Then, we utilize ","element":"span"},{"style":{"height":16},"width":190.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-25.png","element":"img","alt":"�∇ϕ(xk, πk)","inline":true},{"text":", whose estimation only involves first-order oracles, for an inexact hyper-gradient descent step.","element":"span"}],[{"id":"id-100","style":{"width":"100%"},"width":941,"height":609,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-26.png","element":"img"}]]},{"heading":"6 THE STOCHASTIC EXTENSION","paragraphs":[[{"text":"The preceding sections establish bilevel methods through the lens of deterministic optimization. However, in practical implementations of reinforcement learning, quantities are typically approximated via sampling data, e.g., ","element":"span"},{"style":{"height":15.6},"width":189.56,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-27.png","element":"img","alt":"�∇ϕ(xk, πk)","inline":true,"padRight":true},{"text":"defined in an expected form. As a result, the sampling process introduces stochasticity to the algorithm. Therefore, to enrich the developed framework SoBiRL both practically and theoretically, it is essential to extend it to stochastic settings. In detail, we approach this in two steps. (1) Designing a scheme to estimate the hyper-gradient ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/5-28.png","element":"img","alt":" ∇ϕ","inline":true,"padRight":true},{"text":"by sampling trajectories and then characterizing the resulting ","element":"span"},{"text":"stochasticity, i.e., the bias and variance. (2) Replacing the hyper-gradient in SoBiRL with its stochastic counterpart and incorporating a momentum technique to accommodate the stochasticity, leading to our stochastic variant, Stoc-SoBiRL.","element":"span"}],[{"text":"In the stochastic scenario, we still focus on the problem formulation (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":") with the upper-level function (","element":"span"},{"href":"#id-97","text":"11","element":"a"},{"text":") as discussed in Section ","element":"span"},{"text":"5","element":"span"},{"text":". Denote the trajectory tuple ","element":"span"},{"style":{"height":16},"width":339.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-0.png","element":"img","alt":"(d1, d2, . . . , dI) as d","inline":true,"padRight":true},{"text":"for simplicity. In the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th outer iteration, to estimate ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-1.png","element":"img","alt":" ∇ϕ","inline":true},{"text":", an expectation with respect to the random variable ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"d","element":"span"},{"text":", we can generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"independent tuples ","element":"span"},{"style":{"height":16.39},"width":426.67,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-2.png","element":"img","alt":" dm := (dm1 , dm2 , . . . , dmI )","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , M","element":"span"},{"text":", ","element":"span"},{"text":"evaluate corresponding quantities along each ","element":"span"},{"style":{"height":10.8},"width":52.3,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-3.png","element":"img","alt":" dm","inline":true,"padRight":true},{"text":"and average them. To this end, it is necessary to tackle the terms ","element":"span"},{"style":{"height":14.18},"width":80.72,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-4.png","element":"img","alt":" ∇Q∗","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.38},"width":81.32,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-5.png","element":"img","alt":" ∇V ∗","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-98","text":"13","element":"a"},{"text":"). Assuming access to a generative model (","element":"span"},{"href":"#id-101","referenceIndex":43,"text":"Kearns and Singh","element":"a"},{"text":", ","element":"span"},{"href":"#id-101","referenceIndex":43,"text":"1998","element":"a"},{"text":"; ","element":"span"},{"href":"#id-102","referenceIndex":51,"text":"Li ","element":"a"},{"href":"#id-102","referenceIndex":51,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-102","referenceIndex":51,"text":"2024a","element":"a"},{"text":"), from any initial state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":", we sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"independent trajectories of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"by implementing ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-6.png","element":"img","alt":" πk","inline":true},{"text":", i.e., for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , J","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"94%"},"width":887,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-7.png","element":"img"}],[{"text":"Collecting all these random variables yields ","element":"span"},{"style":{"height":14},"width":105.94,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-8.png","element":"img","alt":" ξk :=","inline":true},{"style":{"height":19.67},"width":679.23,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-9.png","element":"img","alt":"{ξjk (s, a) , j = 1, . . . , J, s ∈ S, a ∈ A}","inline":true},{"text":", and the esti- ","element":"span"},{"text":"mator for ","element":"span"},{"style":{"height":14.19},"width":80.71,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-10.png","element":"img","alt":" ∇Q∗ ","inline":true,"padRight":true},{"text":"can be constructed as follows.","element":"span"}],[{"style":{"width":"76%"},"width":721,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-11.png","element":"img"}],[{"text":"The random variable ","element":"span"},{"style":{"height":14},"width":37.25,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-12.png","element":"img","alt":" ζk","inline":true,"padRight":true},{"text":"and the associated estimator ","element":"span"},{"style":{"height":14.18},"width":96.63,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-13.png","element":"img","alt":"¯∇V ζk","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":11.38},"width":81.3,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-14.png","element":"img","alt":" ∇V ∗","inline":true,"padRight":true},{"text":"are constructed similarly. Consequently, denoting ","element":"span"},{"style":{"height":17.9},"width":412.45,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-15.png","element":"img","alt":" Dk := (d1k, d2k, . . . , dMk )","inline":true},{"text":", we obtain ","element":"span"},{"text":"the stochastic hyper-gradient:","element":"span"}],[{"style":{"width":"99%"},"width":933,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-16.png","element":"img"}],[{"text":"which is abbreviated as ","element":"span"},{"style":{"height":16.83},"width":73.96,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-17.png","element":"img","alt":"¯∇ϕk","inline":true,"padRight":true},{"text":"in the following discussion. The sampling scheme is outlined in Algorithm ","element":"span"},{"href":"#id-103","text":"3","element":"a"},{"text":".","element":"span"}],[{"text":"Momentum techniques are known to be beneficial for reducing variance and accelerating algorithms (","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"Cutkosky ","element":"a"},{"href":"#id-104","referenceIndex":18,"text":"and Orabona","element":"a"},{"text":", ","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"2019","element":"a"},{"text":"). ","element":"span"},{"text":"To this end, we maintain a momentum-instructed ","element":"span"},{"style":{"height":13.59},"width":191.94,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-18.png","element":"img","alt":" hk in the k","inline":true},{"text":"-th outer iteration,","element":"span"}],[{"style":{"width":"99%"},"width":932,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-19.png","element":"img"}],[{"text":"which tracks ","element":"span"},{"style":{"height":16.83},"width":73.96,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-20.png","element":"img","alt":"¯∇ϕk","inline":true,"padRight":true},{"text":"via current and historical hyper-gradient estimates. Using ","element":"span"},{"style":{"height":13.19},"width":39.96,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-21.png","element":"img","alt":" hk","inline":true,"padRight":true},{"text":"to update the upper-level ","element":"span"},{"style":{"height":9.19},"width":39.78,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-22.png","element":"img","alt":"xk","inline":true},{"text":", we obtain the stochastic Algorithm ","element":"span"},{"href":"#id-105","text":"4","element":"a"},{"text":", Stoc-SoBiRL.","element":"span"}]]},{"heading":"7 THEORETICAL ANALYSIS","paragraphs":[[{"text":"In this section, we prove global convergence and give the iteration and sample complexity for the proposed ","element":"span"},{"text":"algorithms. ","element":"span"},{"text":"The analysis requires only the regularity conditions—specifically, boundedness and Lipschitz continuity—of the first-order information. This distinguishes from general AID-based methods, which require additional smoothness assumptions regarding second-order derivatives (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi and Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-106","referenceIndex":14,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-106","referenceIndex":14,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"Khanduri et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"Dagréou et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":41,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":41,"text":"2022","element":"a"},{"text":"). The properties of ","element":"span"},{"style":{"height":16},"width":235.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-23.png","element":"img","alt":" φ, π∗(x), and","inline":true},{"style":{"height":16},"width":105,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-24.png","element":"img","alt":"V ∗(x)","inline":true,"padRight":true},{"text":"are outlined in Appendix ","element":"span"},{"href":"#id-107","text":"F","element":"a"},{"text":". Moreover, it is worth emphasizing that the analysis for M-SoBiRL is different from that for SoBiRL. Thus, we organize their properties and the detailed proofs in Appendix ","element":"span"},{"href":"#id-87","text":"G ","element":"a"},{"text":"and Appendix ","element":"span"},{"href":"#id-108","text":"H","element":"a"},{"text":". The analysis for Stoc-SoBiRL is referred to Appendix ","element":"span"},{"href":"#id-109","text":"I","element":"a"},{"text":".","element":"span"}],[{"id":"id-110","style":{"fontWeight":"bold"},"text":"Assumption 7.1. ","element":"span"},{"text":"In the model-based scenario (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":"), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is continuously differentiable. The gradient ","element":"span"},{"style":{"height":16},"width":153.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-25.png","element":"img","alt":" ∇f(x, π)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":15.59},"width":44.12,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-26.png","element":"img","alt":" Lf","inline":true},{"text":"-Lipschitz continuous, i.e., for any ","element":"span"},{"style":{"height":12},"width":156.42,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-27.png","element":"img","alt":" x1, x2 ∈","inline":true},{"style":{"height":18.97},"width":936.01,"height":47.42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-28.png","element":"img","alt":"Rn, π1, π2 ∈ ∆|S|, ∥∇f(x1, π1) − ∇f(x2, π2)∥2 ≤","inline":true},{"style":{"height":16.79},"width":525.87,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-29.png","element":"img","alt":"Lf (∥x1 − x2∥2 + ∥π1 − π2∥∞)","inline":true},{"text":". ","element":"span"},{"text":"Additionally, ","element":"span"},{"text":"the boundedness condition holds, ","element":"span"},{"style":{"height":16.79},"width":423.71,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-30.png","element":"img","alt":" ∥∇πf(x, π∗(x))∥2 ≤Cfπ.","inline":true,"padRight":true},{"id":"id-111","style":{"fontWeight":"bold"},"text":"Assumption 7.2. ","element":"span"},{"text":"In the model-free scenario with upper-level function (","element":"span"},{"href":"#id-97","text":"11","element":"a"},{"text":"), denote ","element":"span"},{"style":{"height":16},"width":335.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-31.png","element":"img","alt":" d = (d1, d2, . . . , dI)","inline":true},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"d","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"parameterized by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"trajectories, each consisting of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"steps. It is bounded, i.e., ","element":"span"},{"style":{"height":16},"width":229.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-32.png","element":"img","alt":"|l (d; x)| ≤ Cl","inline":true},{"text":", and is continuously differentiable with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"d","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":37.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-33.png","element":"img","alt":" Ll","inline":true},{"text":"-Lipschitz continuous, i.e., for any ","element":"span"},{"style":{"height":14},"width":215.14,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-34.png","element":"img","alt":" x1, x2 ∈ Rn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.78},"width":466.34,"height":41.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-35.png","element":"img","alt":" d, ∥l(d; x1) − l(d; x2)∥2 ≤","inline":true},{"style":{"height":16},"width":254.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-36.png","element":"img","alt":"Ll∥x1 − x2∥2.","inline":true,"padRight":true},{"text":"Moreover, ","element":"span"},{"style":{"height":16},"width":142.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-37.png","element":"img","alt":" ∇l(d; x)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":53.36,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-38.png","element":"img","alt":" Ll1","inline":true},{"text":"-Lipschitz continuous, ","element":"span"},{"text":"i.e., ","element":"span"},{"text":"for ","element":"span"},{"text":"any ","element":"span"},{"style":{"height":14},"width":259.67,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-39.png","element":"img","alt":"x1, x2 ∈ Rn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"d","element":"span"},{"text":", ","element":"span"},{"style":{"height":16.78},"width":725.98,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-40.png","element":"img","alt":"∥∇l(d; x1) − ∇l(d; x2)∥2 ≤ Ll1 ∥x1−x2∥2.","inline":true}],[{"text":"Assumptions ","element":"span"},{"href":"#id-110","text":"7.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-111","text":"7.2 ","element":"a"},{"text":"specify the requirements for the upper-level problems in model-based and model-free scenarios, respectively. Generally, the boundedness and Lipschitz continuity assumptions related to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"are standard in bilevel optimization (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi and Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"Arbel and Mairal","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":3,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"Chen ","element":"a"},{"href":"#id-112","referenceIndex":13,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"2022b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"2022","element":"a"},{"text":").","element":"span"}],[{"id":"id-115","style":{"fontWeight":"bold"},"text":"Assumption 7.3. ","element":"span"},{"text":"The reward is bounded by ","element":"span"},{"style":{"height":14.4},"width":131.92,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-41.png","element":"img","alt":" Cr, i.e.,","inline":true},{"style":{"height":16},"width":224.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-42.png","element":"img","alt":"|rsa(x)| ≤ Cr","inline":true},{"text":", and the gradient is bounded by ","element":"span"},{"style":{"height":13.6},"width":148.23,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-43.png","element":"img","alt":" Crx, i.e.,","inline":true},{"style":{"height":16.78},"width":312.27,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-44.png","element":"img","alt":"∥∇rsa(x)∥2 ≤ Crx","inline":true},{"text":". Additionally, ","element":"span"},{"style":{"height":15.6},"width":363.67,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-45.png","element":"img","alt":" rsa(x) is Lr-Lipschitz","inline":true,"padRight":true},{"text":"smooth, i.e., for any ","element":"span"},{"style":{"height":14},"width":202.27,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-46.png","element":"img","alt":" x1, x2 ∈ Rn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":253.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-47.png","element":"img","alt":" (s, a) ∈ S × A","inline":true},{"text":", ","element":"span"},{"style":{"height":16.78},"width":724.02,"height":41.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-48.png","element":"img","alt":"∥∇rsa(x1) − ∇rsa(x2)∥2 ≤ Lr ∥x1 − x2∥2.","inline":true}],[{"text":"This assumption on the parameterized reward function is common in RL (","element":"span"},{"href":"#id-113","referenceIndex":101,"text":"Zhang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-113","referenceIndex":101,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-114","referenceIndex":49,"text":"Lan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-114","referenceIndex":49,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"In the model-based scenario, the following theorem reveals that the proposed algorithm M-SoBiRL benefits from amortizing the hyper-gradient approximation in which it enjoys the ","element":"span"},{"style":{"height":19.2},"width":134.7,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-49.png","element":"img","alt":" O�ϵ−1�","inline":true},{"text":"convergence rate, and the inner iteration number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"can be selected as a constant independent of the accuracy ","element":"span"},{"style":{"height":2},"width":27.18,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/6-50.png","element":"img","alt":" ϵ.","inline":true}],[{"id":"id-150","style":{"fontWeight":"bold"},"text":"Theorem 7.4 ","element":"span"},{"text":"(Model-based)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", in M-SoBiRL, we can choose constant step","element":"span"}],[{"style":{"width":"96%"},"width":1884,"height":532,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-0.png","element":"img"}],[{"id":"id-127","text":"Figure 1: Comparison of algorithms on the Atari game, BeamRider, evaluated by the ground-truth reward. Each ","element":"figcaption","subtype":"caption"},{"text":"bilevel algorithm collects a total of ","element":"figcaption","subtype":"caption"},{"text":"3000 ","element":"figcaption","subtype":"caption"},{"text":"trajectory pairs. The running average over ","element":"figcaption","subtype":"caption"},{"text":"15 ","element":"figcaption","subtype":"caption"},{"text":"consecutive episodes is adopted for the presentation, and results are averaged over ","element":"figcaption","subtype":"caption"},{"text":"5 ","element":"figcaption","subtype":"caption"},{"text":"seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"sizes ","element":"span"},{"style":{"height":14.8},"width":62.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-1.png","element":"img","alt":" β, η","inline":true},{"style":{"fontStyle":"italic"},"text":", and the inner iteration number ","element":"span"},{"style":{"height":16},"width":177.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-2.png","element":"img","alt":" N ∼ O(1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the iterates ","element":"span"},{"style":{"height":16},"width":208.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-3.png","element":"img","alt":" {xk} satisfy","inline":true}],[{"style":{"width":"55%"},"width":525,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Detailed parameter setting is referred to Theorem ","element":"span"},{"href":"#id-116","style":{"fontStyle":"italic"},"text":"G.12","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Coupled with Proposition ","element":"span"},{"href":"#id-117","text":"5.2","element":"a"},{"text":", the next proposition marks a substantial development to clarify the Lipschitz property of the hyper-gradient.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition 7.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", we have ","element":"span"},{"style":{"height":16.79},"width":936.43,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-5.png","element":"img","alt":"∥∇ϕ(x1) − ∇ϕ(x2)∥2 ≤ Lϕ ∥x1 − x2∥2 for any x1, x2 ∈","inline":true},{"style":{"height":15.99},"width":209.6,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-6.png","element":"img","alt":"Rn, with Lϕ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"specified in Proposition ","element":"span"},{"href":"#id-118","style":{"fontStyle":"italic"},"text":"H.6","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-99","style":{"fontWeight":"bold"},"text":"Theorem 7.6 ","element":"span"},{"text":"(Model-free)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and given the accuracy ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-7.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Algorithm ","element":"span"},{"href":"#id-100","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", we can set the constant ","element":"span"},{"style":{"height":22.66},"width":148.53,"height":56.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-8.png","element":"img","alt":" β < 12Lϕ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", and then the iterates","element":"span"}],[{"style":{"width":"99%"},"width":932,"height":213,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":14.18},"width":39.74,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-10.png","element":"img","alt":" ϕ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the minimum of ","element":"span"},{"style":{"height":16},"width":78.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-11.png","element":"img","alt":" ϕ(x)","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":18.71},"width":130.78,"height":46.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-12.png","element":"img","alt":" Lϕ, L�ϕ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are specified in Propositions ","element":"span"},{"href":"#id-118","style":{"fontStyle":"italic"},"text":"H.6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-119","style":{"fontStyle":"italic"},"text":"H.7","element":"a"},{"style":{"fontStyle":"italic"},"text":", respectively.","element":"span"}],[{"text":"Combined with Table ","element":"span"},{"href":"#id-50","text":"1","element":"a"},{"text":", the convergence property in Theorem ","element":"span"},{"href":"#id-99","text":"7.6 ","element":"a"},{"text":"implies that SoBiRL realizes better iteration complexity than PBRL (","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"Shen and Chen","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"2023","element":"a"},{"text":") with the same computation cost at each outer and inner iteration. Moreover, SoBiRL attains the convergence rate of the same order as PARL (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"), but only employs first-order oracles and gets rid of the convexity assumption on the lower-level problem.","element":"span"}],[{"text":"In the context of stochastic bilevel RL, the distribution of samples ","element":"span"},{"style":{"height":15.6},"width":202.33,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-13.png","element":"img","alt":" (Dk, ξk, ζk)","inline":true,"padRight":true},{"text":"relies on the variable ","element":"span"},{"style":{"height":14},"width":164.92,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-14.png","element":"img","alt":" πk, which","inline":true,"padRight":true},{"text":"differs from related work employing momentum techniques (","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"Cutkosky and Orabona","element":"a"},{"text":", ","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"Khanduri et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"Huang","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"2023","element":"a"},{"text":"). To circumvent this misalignment in Stoc-SoBiRL, we give the statistical properties (bias and variance of the stochastic hyper-gradient) in Appendix ","element":"span"},{"href":"#id-120","text":"I.2 ","element":"a"},{"text":"and the convergence analysis in Appendix ","element":"span"},{"href":"#id-121","text":"I.3","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 7.7 ","element":"span"},{"text":"(Stochastic)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", we can choose appropriate sampling configurations ","element":"span"},{"style":{"height":18.18},"width":712.5,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-15.png","element":"img","alt":" M ∼ O(K4/3), J ∼ O(1), T ∼ O(log K)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the iterates ","element":"span"},{"style":{"height":16},"width":82.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-16.png","element":"img","alt":" {xk}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"generated by Algorithm ","element":"span"},{"href":"#id-105","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"92%"},"width":870,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Detailed parameter setting is given in Theorem ","element":"span"},{"href":"#id-122","style":{"fontStyle":"italic"},"text":"I.7","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Enlightened by the momentum acceleration, the average norm square of the hyper-gradient achieves ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-18.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"within ","element":"span"},{"style":{"height":17.78},"width":148.95,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-19.png","element":"img","alt":"�O(ϵ−1.5)","inline":true,"padRight":true},{"text":"outer iterations. Therefore, the samples required by the upper-level leader are of the order","element":"span"}],[{"style":{"height":17.79},"width":148.94,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/7-20.png","element":"img","alt":"O(ϵ−3.5)","inline":true},{"text":", which can be alleviated in practice, as the leader can leverage vectorized environments (","element":"span"},{"href":"#id-123","referenceIndex":66,"text":"Makoviy- ","element":"a"},{"href":"#id-123","referenceIndex":66,"text":"chuk et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-123","referenceIndex":66,"text":"2021","element":"a"},{"text":") within each outer iteration; see Appendix ","element":"span"},{"href":"#id-124","text":"L ","element":"a"},{"text":"for further discussion.","element":"span"}]]},{"heading":"8 EXPERIMENTS","paragraphs":[[{"text":"In this section, we conduct RLHF experiments on Atari games and Mujoco environments to validate the efficiency of SoBiRL, and one synthetic experiment to verify the convergence of M-SoBiRL. The experiment details and parameter settings are provided in Appendix ","element":"span"},{"href":"#id-125","text":"J","element":"a"},{"text":". We have made the code available on ","element":"span"},{"href":"https://github.com/UCAS-YanYang/SoBiRL","text":"https://github.com/UCAS-YanYang/SoBiRL","element":"a"},{"text":".","element":"span"}],[{"text":"In RLHF experiments, we compare SoBiRL with the bilevel algorithms DRLHF (","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"2017","element":"a"},{"text":"), PBRL (","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":"), HPGD ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"Thoma et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"2024","element":"a"},{"text":"), and a baseline algorithm SAC (","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":",","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"b","element":"a"},{"text":"). All the bilevel solvers harness deep neural","element":"span"}],[{"id":"id-130","text":"Table 2: Comparison of algorithms on the Atari games and Mujoco simulations, evaluated by the ground-truth ","element":"figcaption","subtype":"caption"},{"text":"reward. Each bilevel algorithm collects a total of 3000 trajectory pairs. The results are recorded after appropriate timesteps and averaged over 5 seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"101%"},"width":1978,"height":285,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/8-0.png","element":"img"}],[{"text":"networks to predict rewards, while the baseline SAC receives ground-truth rewards for training. The results are reported in Figures ","element":"span"},{"href":"#id-127","text":"1","element":"a"},{"text":", ","element":"span"},{"href":"#id-128","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-129","text":"3","element":"a"},{"text":", and Table ","element":"span"},{"href":"#id-130","text":"2","element":"a"},{"text":", showing that even in the face of unknown rewards, the performance of SoBiRL is on a par with the baseline SAC. Additionally, SoBiRL achieves a higher episodic return than the compared methods within the same timesteps. Another interesting observation is that, taking BeamRider for example, compared to the ground truth, the preference prediction based on the reward model of DRLHF achieves an accuracy of approximately ","element":"span"},{"text":"85%","element":"span"},{"text":", while it hovers around ","element":"span"},{"text":"54% ","element":"span"},{"text":"with the reward model of SoBiRL. This stems from different update rules. Specifically, DRLHF alternates between learning a policy given the parameterized reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"and minimizing ","element":"span"},{"style":{"height":16},"width":87.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/8-1.png","element":"img","alt":" E [lh]","inline":true,"padRight":true},{"text":"based on collecting trajectories, which decouples the task into two separate phases. It aligns preferences, but overlooks the exploration for rewards promoting better policy. Nevertheless, SoBiRL glues the two levels via the hyper-gradient ","element":"span"},{"style":{"height":14.4},"width":218.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/8-2.png","element":"img","alt":" ∇ϕ, with the","inline":true,"padRight":true},{"text":"first term (","element":"span"},{"text":"12","element":"span"},{"text":") exploiting the preference information to align the reward predictor and the second term (","element":"span"},{"href":"#id-98","text":"13","element":"a"},{"text":") exploring the implicit reward to unearth a more favorable policy. Therefore, SoBiRL produces superior results in reward prediction, although it exhibits lower accuracy in alignment of trajectory preference.","element":"span"}],[{"text":"Regarding M-SoBiRL, the results of a synthetic experiment are shown in Figure ","element":"span"},{"href":"#id-131","text":"4","element":"a"},{"text":". The curves exhibit the benign convergence property of the proposed algorithm.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"Bin Gao was supported by the Young Elite Scientist Sponsorship Program by CAST. Ya-xiang Yuan was supported by the National Natural Science Foundation of China (grant No. 12288201). The authors are grateful to the Program and Area Chairs and Reviewers for their detailed and valuable comments and suggestions.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-46","text":"Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, ","element":"span"},{"text":"G. (2020). Optimality and approximation with policy","element":"span"}],[{"text":"gradient methods in Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual Conference on Learning Theory","element":"span"},{"text":", pages 64–66. PMLR.","element":"span"}],[{"id":"id-62","text":"Ahmed, Z., Le Roux, N., Norouzi, M., and Schuurmans, ","element":"span"},{"text":"D. (2019). Understanding the impact of entropy on policy optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 151–160. PMLR.","element":"span"}],[{"id":"id-36","text":"Arbel, M. and Mairal, J. (2022). Amortized implicit ","element":"span"},{"text":"differentiation for stochastic bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-74","text":"Arora, S. and Doshi, P. (2021). A survey of inverse ","element":"span"},{"text":"reinforcement learning: ","element":"span"},{"text":"challenges, methods and progress. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artificial Intelligence","element":"span"},{"text":", 297:103500.","element":"span"}],[{"id":"id-214","text":"Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, ","element":"span"},{"text":"M. (2013). The arcade learning environment: an evaluation platform for general agents. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Artificial Intelligence Research","element":"span"},{"text":", 47:253–279.","element":"span"}],[{"id":"id-24","text":"Berner, C., Brockman, G., Chan, B., Cheung, V., ","element":"span"},{"text":"Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. (2019). Dota 2 with large scale deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1912.06680","element":"span"},{"text":".","element":"span"}],[{"id":"id-3","text":"Bertinetto, L., Henriques, J., Torr, P., and Vedaldi, A. ","element":"span"},{"text":"(2019). Meta-learning with differentiable closed-form solvers. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-29","text":"Brown, D., Goo, W., Nagarajan, P., and Niekum, S. ","element":"span"},{"text":"(2019). Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 783–792. PMLR.","element":"span"}],[{"id":"id-64","text":"Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. ","element":"span"},{"text":"(2022). Fast global convergence of natural policy gradient methods with entropy regularization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research","element":"span"},{"text":", 70(4):2563–2578.","element":"span"}],[{"id":"id-8","text":"Chakraborty, S., Bedi, A., Koppel, A., Wang, H., ","element":"span"},{"text":"Manocha, D., Wang, M., and Huang, F. (2024). PARL: a unified framework for policy alignment in","element":"span"}],[{"text":"reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-34","text":"Chen, L., Xu, J., and Zhang, J. (2024). On finding small ","element":"span"},{"text":"hyper-gradients in bilevel optimization: hardness results and improved analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual Conference on Learning Theory","element":"span"},{"text":", pages 947–980. PMLR.","element":"span"}],[{"id":"id-49","text":"Chen, S., Yang, D., Li, J., Wang, S., Yang, Z., and ","element":"span"},{"text":"Wang, Z. (2022a). Adaptive model design for Markov decision process. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3679–3700. PMLR.","element":"span"}],[{"id":"id-112","text":"Chen, T., Sun, Y., Xiao, Q., and Yin, W. (2022b). A ","element":"span"},{"text":"single-timescale method for stochastic bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 2466–2488. PMLR.","element":"span"}],[{"id":"id-106","text":"Chen, T., Sun, Y., and Yin, W. (2021). Closing the gap: ","element":"span"},{"text":"tighter analysis of alternating stochastic gradient methods for bilevel problems. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":".","element":"span"}],[{"id":"id-224","text":"Chen, Z., Zhou, Y., Chen, R.-R., and Zou, S. (2022c). ","element":"span"},{"text":"Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3794–3834. PMLR.","element":"span"}],[{"id":"id-28","text":"Christiano, P. F., Leike, J., Brown, T., Martic, M., ","element":"span"},{"text":"Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 30.","element":"span"}],[{"id":"id-54","text":"Christianson, B. (1994). Reverse accumulation and ","element":"span"},{"text":"attractive fixed points. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Optimization Methods and Software","element":"span"},{"text":", 3(4):311–326.","element":"span"}],[{"id":"id-104","text":"Cutkosky, A. and Orabona, F. (2019). Momentum- ","element":"span"},{"text":"based variance reduction in non-convex sgd. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 32.","element":"span"}],[{"id":"id-37","text":"Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. ","element":"span"},{"text":"(2022). A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 35, pages 26698–26710.","element":"span"}],[{"id":"id-69","text":"Devidze, R., Kamalaruban, P., and Singla, A. (2022). ","element":"span"},{"text":"Exploration-guided reward shaping for reinforcement learning under sparse rewards. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 35, pages 5829–5842.","element":"span"}],[{"id":"id-75","text":"Fiez, T., Chasnov, B., and Ratliff, L. (2020). Implicit ","element":"span"},{"text":"learning dynamics in Stackelberg games: equilibria characterization, convergence analysis, and empirical study. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3133–3144. PMLR.","element":"span"}],[{"id":"id-0","text":"Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. ","element":"span"},{"text":"(2017). Forward and reverse gradient-based hyperpa-","element":"span"}],[{"text":"rameter optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1165–1173. PMLR.","element":"span"}],[{"id":"id-1","text":"Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and ","element":"span"},{"text":"Pontil, M. (2018). Bilevel programming for hyper-parameter optimization and meta-learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1568–1577. PMLR.","element":"span"}],[{"id":"id-52","text":"Geist, M., Scherrer, B., and Pietquin, O. (2019). A the- ","element":"span"},{"text":"ory of regularized Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 2160–2169. PMLR.","element":"span"}],[{"id":"id-12","text":"Ghadimi, S. and Wang, M. (2018). Approximation ","element":"span"},{"text":"methods for bilevel programming. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1802.02246","element":"span"},{"text":".","element":"span"}],[{"id":"id-32","text":"Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. ","element":"span"},{"text":"(2020). On the iteration complexity of hypergradient computation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3748–3758. PMLR.","element":"span"}],[{"id":"id-55","text":"Grazzi, R., Pontil, M., and Salzo, S. (2021). Conver- ","element":"span"},{"text":"gence properties of stochastic hypergradients. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 3826–3834. PMLR.","element":"span"}],[{"id":"id-56","text":"Grazzi, R., Pontil, M., and Salzo, S. (2023). Bilevel ","element":"span"},{"text":"optimization with a lower-level contraction: optimal sample complexity without warm-start. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 24(167):1–37.","element":"span"}],[{"id":"id-70","text":"Gupta, D., Chandak, Y., Jordan, S., Thomas, P. S., ","element":"span"},{"text":"and C da Silva, B. (2023). Behavior alignment via reward function optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 36.","element":"span"}],[{"id":"id-59","text":"Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. ","element":"span"},{"text":"(2017). Reinforcement learning with deep energybased policies. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1352–1361. PMLR.","element":"span"}],[{"id":"id-60","text":"Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. ","element":"span"},{"text":"(2018a). Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1861–1870. PMLR.","element":"span"}],[{"id":"id-126","text":"Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., ","element":"span"},{"text":"Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. (2018b). Soft actor-critic algorithms and applications. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1812.05905","element":"span"},{"text":".","element":"span"}],[{"id":"id-19","text":"Hao, J., Gong, X., and Liu, M. (2024). Bilevel optimiza- ","element":"span"},{"text":"tion under unbounded smoothness: a new algorithm and convergence analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-226","text":"He, Y., Hu, J., Huang, X., Lu, S., Wang, B., and ","element":"span"},{"text":"Yuan, K. (2024). Distributed bilevel optimization with communication compression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":".","element":"span"}],[{"id":"id-132","text":"Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. (2023). ","element":"span"},{"text":"A two-timescale stochastic algorithm framework for bilevel optimization: complexity analysis and application to actor-critic. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 33(1):147–180.","element":"span"}],[{"id":"id-16","text":"Hu, X., Xiao, N., Liu, X., and Toh, K.-C. (2023). An im- ","element":"span"},{"text":"proved unconstrained approach for bilevel optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 33(4):2801– 2829.","element":"span"}],[{"id":"id-39","text":"Hu, Y., Wang, J., Xie, Y., Krause, A., and Kuhn, D. ","element":"span"},{"text":"(2024). Contextual stochastic bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 36.","element":"span"}],[{"id":"id-30","text":"Hu, Y., Wang, W., Jia, H., Wang, Y., Chen, Y., Hao, ","element":"span"},{"text":"J., Wu, F., and Fan, C. (2020). Learning to utilize shaping rewards: a new approach of reward shaping. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 33, pages 15931–15941.","element":"span"}],[{"id":"id-45","text":"Huang, F. (2023). On momentum-based gradient meth- ","element":"span"},{"text":"ods for bilevel optimization with nonconvex lower-level. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2303.03944","element":"span"},{"text":".","element":"span"}],[{"id":"id-43","text":"Huang, F., Li, J., Gao, S., and Huang, H. (2022). En- ","element":"span"},{"text":"hanced bilevel optimization via Bregman distance. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 35, pages 28928–28939.","element":"span"}],[{"id":"id-15","text":"Ji, K., Liu, M., Liang, Y., and Ying, L. (2022). Will ","element":"span"},{"text":"bilevel optimizers benefit from loops. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 35, pages 3011–3023.","element":"span"}],[{"id":"id-35","text":"Ji, K., Yang, J., and Liang, Y. (2021). Bilevel opti- ","element":"span"},{"text":"mization: convergence analysis and enhanced design. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 4882–4892. PMLR.","element":"span"}],[{"id":"id-101","text":"Kearns, M. and Singh, S. (1998). Finite-sample conver- ","element":"span"},{"text":"gence rates for Q-learning and indirect algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 11.","element":"span"}],[{"id":"id-42","text":"Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, ","element":"span"},{"text":"Z., and Yang, Z. (2021). ","element":"span"},{"text":"A near-optimal algorithm for stochastic bilevel optimization via doublemomentum. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pages 30271–30283.","element":"span"}],[{"id":"id-225","text":"Kong, B., Zhu, S., Lu, S., Huang, X., and Yuan, ","element":"span"},{"text":"K. (2024). Decentralized bilevel optimization over graphs: loopless algorithmic update and transient iteration complexity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.03167","element":"span"},{"text":".","element":"span"}],[{"id":"id-17","text":"Kwon, J., Kwon, D., Wright, S., and Nowak, R. D. ","element":"span"},{"text":"(2023). ","element":"span"},{"text":"A fully first-order method for stochastic bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 18083–18113. PMLR.","element":"span"}],[{"id":"id-47","text":"Lan, G. (2023). Policy mirror descent for reinforcement ","element":"span"},{"text":"learning: linear convergence, new sampling complex-","element":"span"}],[{"text":"ity, and generalized problem classes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 198(1):1059–1106.","element":"span"}],[{"id":"id-61","text":"Lan, Q. (2021). Variational quantum soft actor-critic. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2112.11921","element":"span"},{"text":".","element":"span"}],[{"id":"id-114","text":"Lan, Q., Tosatto, S., Farrahi, H., and Mahmood, R. ","element":"span"},{"text":"(2022). Model-free policy learning with reward gradients. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 4217–4234. PMLR.","element":"span"}],[{"id":"id-72","text":"Lee, K., Smith, L., and Abbeel, P. (2021). PEBBLE: ","element":"span"},{"text":"feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pretraining. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":".","element":"span"}],[{"id":"id-102","text":"Li, G., Wei, Y., Chi, Y., and Chen, Y. (2024a). Break- ","element":"span"},{"text":"ing the sample size barrier in model-based reinforcement learning with a generative model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Operations Research","element":"span"},{"text":", 72(1):203–221.","element":"span"}],[{"id":"id-66","text":"Li, H., Yu, H.-F., Ying, L., and Dhillon, I. S. (2024b). ","element":"span"},{"text":"Accelerating primal-dual methods for regularized Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 34(1):764–789.","element":"span"}],[{"id":"id-44","text":"Li, J., Gu, B., and Huang, H. (2022). A fully single loop ","element":"span"},{"text":"algorithm for bilevel optimization without Hessian inverse. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 36, pages 7426–7434.","element":"span"}],[{"id":"id-71","text":"Li, J., Hu, X., Xu, H., Liu, J., Zhan, X., Jia, Q.-S., and ","element":"span"},{"text":"Zhang, Y.-Q. (2023). Mind the gap: offline policy optimization for imperfect rewards. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-92","text":"Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., ","element":"span"},{"text":"Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-5","text":"Liu, H., Simonyan, K., and Yang, Y. (2019). Darts: ","element":"span"},{"text":"differentiable architecture search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-4","text":"Liu, R., Gao, J., Zhang, J., Meng, D., and Lin, Z. ","element":"span"},{"text":"(2021a). Investigating bi-level optimization for learning and vision from a unified perspective: a survey and beyond. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Pattern Analysis and Machine Intelligence","element":"span"},{"text":", 44(12):10045–10067.","element":"span"}],[{"id":"id-14","text":"Liu, R., Liu, X., Yuan, X., Zeng, S., and Zhang, ","element":"span"},{"text":"J. (2021b). ","element":"span"},{"text":"A value-function-based interior-point method for non-convex bi-level optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 6882–6892. PMLR.","element":"span"}],[{"id":"id-38","text":"Liu, R., Liu, Y., Yao, W., Zeng, S., and Zhang, J. ","element":"span"},{"text":"(2023). Averaged method of multipliers for bi-level optimization without lower-level strong convexity. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":",","element":"span"}],[{"text":"volume 202 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 21839–21866. PMLR.","element":"span"}],[{"id":"id-133","text":"Liu, R., Liu, Y., Zeng, S., and Zhang, J. (2021c). To- ","element":"span"},{"text":"wards gradient-based bilevel optimization with non-convex followers and beyond. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 34, pages 8662–8675.","element":"span"}],[{"id":"id-135","text":"Liu, R., Liu, Z., Yao, W., Zeng, S., and Zhang, J. ","element":"span"},{"text":"(2024a). ","element":"span"},{"text":"Moreau envelope for nonconvex bi-level optimization: a single-loop and Hessian-free solution strategy. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":".","element":"span"}],[{"id":"id-13","text":"Liu, R., Mu, P., Yuan, X., Zeng, S., and Zhang, J. ","element":"span"},{"text":"(2020). A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 6305–6315. PMLR.","element":"span"}],[{"id":"id-11","text":"Liu, S., Wang, Y., and Gao, X.-S. (2024b). Game- ","element":"span"},{"text":"theoretic unlearnable example generator. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual AAAI Conference on Artificial Intelligence","element":"span"},{"text":", volume 38, pages 21349–21358.","element":"span"}],[{"id":"id-219","text":"Loshchilov, I. and Hutter, F. (2018). Decoupled weight ","element":"span"},{"text":"decay regularization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-134","text":"Lu, S. (2023). SLM: a smoothed first-order Lagrangian ","element":"span"},{"text":"method for structured constrained nonconvex optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 36.","element":"span"}],[{"id":"id-123","text":"Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., ","element":"span"},{"text":"Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., et al. (2021). Isaac Gym: high performance GPU based physics simulation for robot learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)","element":"span"},{"text":".","element":"span"}],[{"id":"id-48","text":"Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. ","element":"span"},{"text":"(2020). On the global convergence rates of softmax policy gradient methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 6820–6829. PMLR.","element":"span"}],[{"id":"id-26","text":"Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, ","element":"span"},{"text":"V., and Hutter, M. (2022). Learning robust perceptive locomotion for quadrupedal robots in the wild. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science Robotics","element":"span"},{"text":", 7(62):eabk2822.","element":"span"}],[{"id":"id-25","text":"Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W., ","element":"span"},{"text":"Songhori, E., Wang, S., Lee, Y.-J., Johnson, E., Pathak, O., Nazi, A., et al. (2021). A graph placement methodology for fast chip design. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 594(7862):207–212.","element":"span"}],[{"id":"id-63","text":"Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli- ","element":"span"},{"text":"crap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). ","element":"span"},{"text":"Asynchronous methods for deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1928–1937. PMLR.","element":"span"}],[{"id":"id-220","text":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- ","element":"span"},{"text":"ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 518(7540):529–533.","element":"span"}],[{"id":"id-51","text":"Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. ","element":"span"},{"text":"(2017). Bridging the gap between value and policy based reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 30.","element":"span"}],[{"id":"id-31","text":"Pedregosa, F. (2016). Hyperparameter optimization ","element":"span"},{"text":"with approximate gradient. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 737–746. PMLR.","element":"span"}],[{"id":"id-73","text":"Saha, A., Pacchiano, A., and Lee, J. (2023). Duel- ","element":"span"},{"text":"ing RL: reinforcement learning with trajectory preferences. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 6263–6289. PMLR.","element":"span"}],[{"id":"id-90","text":"Schulman, J., Levine, S., Abbeel, P., Jordan, M., and ","element":"span"},{"text":"Moritz, P. (2015). Trust region policy optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 1889–1897. PMLR.","element":"span"}],[{"id":"id-91","text":"Schulman, J., Moritz, P., Levine, S., Jordan, M., and ","element":"span"},{"text":"Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-93","text":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., ","element":"span"},{"text":"and Klimov, O. (2017). Proximal policy optimization algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1707.06347","element":"span"},{"text":".","element":"span"}],[{"id":"id-136","text":"Seneta, E. (2006). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Non-negative Matrices and Markov Chains","element":"span"},{"text":". Springer Science & Business Media.","element":"span"}],[{"id":"id-18","text":"Shen, H. and Chen, T. (2023). On penalty-based bilevel ","element":"span"},{"text":"gradient descent method. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":". JMLR.","element":"span"}],[{"id":"id-9","text":"Shen, H., Yang, Z., and Chen, T. (2024). ","element":"span"},{"text":"Principled penalty-based methods for bilevel reinforcement learning and RLHF. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":".","element":"span"}],[{"id":"id-223","text":"Shen, H., Zhang, K., Hong, M., and Chen, T. (2023). ","element":"span"},{"text":"Towards understanding asynchronous advantage actor-critic: convergence and linear speedup. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Signal Processing","element":"span"},{"text":", 71:2579–2594.","element":"span"}],[{"id":"id-23","text":"Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, ","element":"span"},{"text":"I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 550(7676):354– 359.","element":"span"}],[{"id":"id-76","text":"Song, Z., Lee, J. D., and Yang, Z. (2023). Can we find ","element":"span"},{"text":"Nash equilibria at a linear rate in Markov games? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-67","text":"Sorg, J., Lewis, R. L., and Singh, S. (2010). Reward ","element":"span"},{"text":"design via online gradient ascent. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 23.","element":"span"}],[{"id":"id-189","text":"Sriperumbudur, B. K., Fukumizu, K., Gretton, A., ","element":"span"},{"text":"Schölkopf, B., and Lanckriet, G. R. (2009). On integral probability metrics, ","element":"span"},{"style":{"height":10},"width":26,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/12-0.png","element":"img","alt":" ϕ","inline":true},{"text":"-divergences and binary classification. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:0901.2698","element":"span"},{"text":".","element":"span"}],[{"id":"id-27","text":"Sun, J.-M., Yang, J., Mo, K., Lai, Y.-K., Guibas, L., ","element":"span"},{"text":"and Gao, L. (2024). Haisor: human-aware indoor scene optimization via deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM Transactions on Graphics","element":"span"},{"text":", 43(2):1–17.","element":"span"}],[{"id":"id-21","text":"Sutton, R. S. and Barto, A. G. (2018). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning: an Introduction","element":"span"},{"text":". MIT Press.","element":"span"}],[{"id":"id-22","text":"Szepesvári, C. (2022). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Algorithms for Reinforcement Learning","element":"span"},{"text":". Springer Nature.","element":"span"}],[{"id":"id-10","text":"Thoma, V., Pásztor, B., Krause, A., Ramponi, G., ","element":"span"},{"text":"and Hu, Y. (2024). Contextual bilevel reinforcement learning for incentive alignment. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":".","element":"span"}],[{"id":"id-215","text":"Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: ","element":"span"},{"text":"a physics engine for model-based control. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2012 IEEE/RSJ International Conference on Intelligent Robots and Systems","element":"span"},{"text":", pages 5026–5033. IEEE.","element":"span"}],[{"id":"id-7","text":"Wang, J., Chen, H., Jiang, R., Li, X., and Li, Z. (2021). ","element":"span"},{"text":"Fast algorithms for Stackelberg prediction game with least squares loss. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 10708–10716. PMLR.","element":"span"}],[{"id":"id-6","text":"Wang, X., Guo, W., Su, J., Yang, X., and Yan, J. ","element":"span"},{"text":"(2022). Zarts: on zero-order optimization for neural architecture search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 35, pages 12868– 12880.","element":"span"}],[{"id":"id-77","text":"Williams, R. J. and Peng, J. (1991). Function opti- ","element":"span"},{"text":"mization using connectionist reinforcement learning algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Connection Science","element":"span"},{"text":", 3(3):241–268.","element":"span"}],[{"id":"id-222","text":"Wu, J., Chen, S., Wang, M., Wang, H., and Xu, ","element":"span"},{"text":"H. (2024). ","element":"span"},{"text":"Contractual reinforcement learning: pulling arms with invisible hands. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2407.01458","element":"span"},{"text":".","element":"span"}],[{"id":"id-33","text":"Yang, H., Luo, L., Li, C. J., Jordan, M., and Fazel, M. ","element":"span"},{"text":"(2023). Accelerating inexact hypergradient descent for bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OPT 2023: Optimization for Machine Learning","element":"span"},{"text":".","element":"span"}],[{"id":"id-53","text":"Yang, W., Li, X., and Zhang, Z. (2019). A regularized ","element":"span"},{"text":"approach to sparse optimal policy in reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 32.","element":"span"}],[{"id":"id-40","text":"Yang, Y., Gao, B., and Yuan, Y.-x. (2025). LancBiO: ","element":"span"},{"text":"dynamic Lanczos-aided bilevel optimization via Krylov subspace. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-20","text":"Yao, W., Yu, C., Zeng, S., and Zhang, J. (2024). Con- ","element":"span"},{"text":"strained bi-level optimization: proximal Lagrangian value function approach and Hessian-free algorithm.","element":"span"}],[{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":".","element":"span"}],[{"id":"id-2","text":"Ye, J. J., Yuan, X., Zeng, S., and Zhang, J. (2023). ","element":"span"},{"text":"Difference of convex algorithms for bilevel programs with applications in hyperparameter selection. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematical Programming","element":"span"},{"text":", 198(2):1583–1616.","element":"span"}],[{"id":"id-65","text":"Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and ","element":"span"},{"text":"Chi, Y. (2023). Policy mirror descent for regularized reinforcement learning: a generalized framework with linear convergence. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 33(2):1061–1091.","element":"span"}],[{"id":"id-113","text":"Zhang, K., Koppel, A., Zhu, H., and Basar, T. (2020). ","element":"span"},{"text":"Global convergence of policy gradient methods to (almost) locally optimal policies. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 58(6):3586–3612.","element":"span"}],[{"id":"id-68","text":"Zheng, Z., Oh, J., and Singh, S. (2018). On learning ","element":"span"},{"text":"intrinsic rewards for policy gradient methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 31.","element":"span"}],[{"id":"id-227","text":"Zhu, S., Kong, B., Lu, S., Huang, X., and Yuan, K. ","element":"span"},{"text":"(2024). SPARKLE: a unified single-loop primal-dual framework for decentralized bilevel optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 37, pages 62912–62987.","element":"span"}],[{"id":"id-58","text":"Ziebart, B. D. (2010). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Modeling purposeful adaptive behavior with the principle of maximum causal entropy","element":"span"},{"text":". Carnegie Mellon University.","element":"span"}]]},{"heading":"Bilevel Reinforcement Learning via the Development of Hyper-gradient without Lower-Level Convexity: Supplementary Materials","paragraphs":[[{"id":"id-57","style":{"fontWeight":"bold"},"text":"A ","element":"span"},{"style":{"fontWeight":"bold"},"text":"RELATED WORK IN BILEVEL OPTIMIZATION","element":"span"}],[{"text":"By the implicit function theorem, the approximate implicit differentiation (AID) based method regards the optimal lower-level solution as a function of the upper-level variable, which yields the hyper-gradient to instruct the upper-level update. It implements alternating (inexact) gradient descent steps between two levels (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi ","element":"a"},{"href":"#id-12","referenceIndex":25,"text":"and Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"2022b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"Dagréou et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":19,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-132","referenceIndex":35,"text":"Hong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-132","referenceIndex":35,"text":"2023","element":"a"},{"text":"). In addition, generalizing the lower-level strong convexity, the studies (","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"Grazzi et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"2020","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":27,"text":"2021","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":28,"text":"2023","element":"a"},{"text":") focused on the bilevel problem where the lower-level problem is formulated as a fixed-point equation. Moreover, a line of research has been dedicated to the bilevel problem with a nonconvex lower-level objective (","element":"span"},{"href":"#id-133","referenceIndex":60,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-133","referenceIndex":60,"text":"2021c","element":"a"},{"text":"; ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"Shen and Chen","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"Huang","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-134","referenceIndex":65,"text":"Lu","element":"a"},{"text":", ","element":"span"},{"href":"#id-134","referenceIndex":65,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-135","referenceIndex":61,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-135","referenceIndex":61,"text":"2024a","element":"a"},{"text":"). Specifically, ","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"Shen and Chen ","element":"a"},{"text":"(","element":"span"},{"href":"#id-18","referenceIndex":79,"text":"2023","element":"a"},{"text":") proposed a penalty-based method to bypass the requirement of lower-level convexity. In (","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"Huang","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":39,"text":"2023","element":"a"},{"text":"), a momentum-based approach was designed to solve the bilevel problem with a lower-level function satisfying the PL condition. ","element":"span"},{"href":"#id-135","referenceIndex":61,"text":"Liu et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-135","referenceIndex":61,"text":"2024a","element":"a"},{"text":") utilized the Moreau envelope based reformulation to provide a single-loop algorithm for nonconvex-nonconvex bilevel optimization. In stochastic settings, bilevel algorithms have been developed in (","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"Ghadimi and Wang","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":25,"text":"2018","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"Ji et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":42,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"Khanduri et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":44,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-112","referenceIndex":13,"text":"2022b","element":"a"},{"text":").","element":"span"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"B ","element":"span"},{"style":{"fontWeight":"bold"},"text":"VECTOR AND MATRIX NOTATION","element":"span"}],[{"text":"Throughout this paper, we adhere to the following conventions of vector and matrix notation.","element":"span"}],[{"style":{"width":"99%"},"width":1941,"height":142,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-0.png","element":"img"}],[{"text":"Additionally, for a vector ","element":"span"},{"style":{"height":14.98},"width":138.28,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-1.png","element":"img","alt":" v ∈ R|S|","inline":true},{"text":", we use the notation of entry ","element":"span"},{"style":{"height":16},"width":160.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-2.png","element":"img","alt":" vs = v(s)","inline":true,"padRight":true},{"text":"interchangeably for convenience of exposition.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":17.39},"width":679.6,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-3.png","element":"img","alt":" P ∈ R|S||A|×|S|, r ∈ R|S||A|, π ∈ R|S||A|","inline":true},{"text":": the transition matrix, the reward function, and the policy.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":14.98},"width":235.69,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-4.png","element":"img","alt":" P π ∈ R|S|×|S|","inline":true},{"text":": the transition matrix induced by ","element":"span"},{"style":{"height":6.8},"width":35.15,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-5.png","element":"img","alt":" π.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":17.38},"width":413.28,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-6.png","element":"img","alt":" V π ∈ R|S|, Qπ ∈ R|S||A|","inline":true},{"text":": the soft value functions associated with ","element":"span"},{"style":{"height":6.8},"width":35.15,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-7.png","element":"img","alt":" π.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"height":18.18},"width":809.8,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-8.png","element":"img","alt":" V ∗(x) ∈ R|S|, Q∗(x) ∈ R|S||A|, π∗(x) ∈ R|S||A|","inline":true},{"text":": the optimal soft value functions and the optimal soft policy given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"style":{"height":16},"width":248.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-9.png","element":"img","alt":"• ∇ϕ(x) ∈ Rn","inline":true},{"text":": the hyper-gradient.","element":"span"}],[{"style":{"height":18.18},"width":753.77,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-10.png","element":"img","alt":"• ∇xφ(x, v) ∈ R|S|×n, ∇vφ(x, v) ∈ R|S|×|S|","inline":true},{"text":": the partial derivatives of ","element":"span"},{"style":{"height":16},"width":129.3,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-11.png","element":"img","alt":" φ(x, v).","inline":true}],[{"style":{"height":18.18},"width":672.4,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-12.png","element":"img","alt":"• ∇xf(x, π) ∈ Rn, ∇πf(x, π) ∈ R|S||A|","inline":true},{"text":": the partial derivatives of ","element":"span"},{"style":{"height":16},"width":130.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-13.png","element":"img","alt":" f(x, π).","inline":true}],[{"style":{"height":18.19},"width":353.36,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-14.png","element":"img","alt":"• ∇r(x) ∈ R|S||A|×n","inline":true},{"text":": the derivative of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"style":{"height":18.18},"width":1073.39,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/13-15.png","element":"img","alt":"• ∇V ∗(x) ∈ R|S|×n, ∇Q∗(x) ∈ R|S||A|×n, ∇π∗(x) ∈ R|S||A|×n","inline":true},{"text":": the implicit differentiations.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PRELIMINARIES ON NUMERICAL LINEAR ALGEBRA","element":"span"}],[{"text":"We first revisit some basics for non-negative matrices and Markov chains; see (","element":"span"},{"href":"#id-136","referenceIndex":78,"text":"Seneta","element":"a"},{"text":", ","element":"span"},{"href":"#id-136","referenceIndex":78,"text":"2006","element":"a"},{"text":"). ","element":"span"},{"id":"id-139","style":{"fontWeight":"bold"},"text":"Proposition C.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given a matrix ","element":"span"},{"style":{"height":16.78},"width":561.7,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-0.png","element":"img","alt":" B ∈ Rm×m satisfying ∥B∥∞ ≤ γ","inline":true},{"style":{"fontStyle":"italic"},"text":", then the matrix ","element":"span"},{"style":{"height":10.8},"width":99.08,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-1.png","element":"img","alt":" I − B","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is invertible with the magnitudes of all its eigenvalues located within the interval ","element":"span"},{"style":{"height":16},"width":234.81,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-2.png","element":"img","alt":" [1 − γ, 1 + γ].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Denote the entries of ","element":"span"},{"style":{"height":18.55},"width":696.55,"height":46.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-3.png","element":"img","alt":" B and I − B by {bij}mi.j=1 and {lij}mi.j=1","inline":true},{"text":", respectively. Set ","element":"span"},{"style":{"height":11.6},"width":100.95,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-4.png","element":"img","alt":" λ ∈ C","inline":true,"padRight":true},{"text":"as an eigenvalue of ","element":"span"},{"style":{"height":10.8},"width":100.35,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-5.png","element":"img","alt":"I − B","inline":true},{"text":", and consider the Gershgorin circle theorem, for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , m","element":"span"},{"text":",","element":"span"}],[{"id":"id-137","style":{"width":"59%"},"width":1161,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-6.png","element":"img"}],[{"text":"Incorporating ","element":"span"},{"href":"#id-137","style":{"height":17.19},"width":609.33,"height":42.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-7.png","element":"img","alt":" lii = 1 − bii and lij = −bij into (15","inline":true},{"text":"), we obtain","element":"span"}],[{"style":{"width":"25%"},"width":488,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-8.png","element":"img"}],[{"text":"The absolute value inequality leads to","element":"span"}],[{"style":{"width":"36%"},"width":347,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-9.png","element":"img"}],[{"text":"which completes the proof since","element":"span"}],[{"id":"id-138","style":{"height":16.79},"width":808.5,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-10.png","element":"img","alt":"Definition C.2. A matrix P ∈ Rm×m = (pij)","inline":true,"padRight":true},{"text":"is defined to be a transition matrix if all entries are non-negative","element":"span"}],[{"text":"and each row sums to ","element":"span"},{"text":"1","element":"span"},{"text":", i.e., for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , m","element":"span"},{"text":",","element":"span"}],[{"id":"id-140","style":{"fontWeight":"bold"},"text":"Proposition C.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A transition matrix ","element":"span"},{"style":{"height":16.57},"width":322.88,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-11.png","element":"img","alt":" P ∈ Rm×m = (pij)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"always has an eigenvalue of ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", while all the remaining","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"eigenvalues have absolute values less than or equal to ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition C.4. ","element":"span"},{"text":"A matrix ","element":"span"},{"style":{"height":16.79},"width":335.91,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-12.png","element":"img","alt":" X = (xij) ∈ Rm×m","inline":true,"padRight":true},{"text":"is called non-negative if all the entries are equal to or greater than zero, i.e.,","element":"span"}],[{"style":{"height":15.99},"width":451.47,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-13.png","element":"img","alt":"xij ≥ 0, for 1 ≤ i, j ≤ m.","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Definition C.5. ","element":"span"},{"text":"A matrix ","element":"span"},{"style":{"height":12.58},"width":191.08,"height":31.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-14.png","element":"img","alt":" A ∈ Rm×m","inline":true,"padRight":true},{"text":"is an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":"-matrix if it can be expressed in the form ","element":"span"},{"style":{"height":14.8},"width":211.54,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-15.png","element":"img","alt":" A = µI − B","inline":true},{"text":", where ","element":"span"},{"style":{"height":16.79},"width":162.02,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-16.png","element":"img","alt":"B = (bij)","inline":true,"padRight":true},{"text":"is non-negative, and ","element":"span"},{"style":{"height":14},"width":97.14,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-17.png","element":"img","alt":" µ ≥ 0","inline":true,"padRight":true},{"text":"is equal to or greater than the moduli of any eigenvalue of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":". ","element":"span"},{"id":"id-145","style":{"height":16.79},"width":1945.46,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-18.png","element":"img","alt":"Proposition C.6. Given a constant γ ∈ (0, 1) and a transition matrix P ∈ Rm×m = (pij), then the matrix","inline":true},{"style":{"height":15.6},"width":1945.05,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-19.png","element":"img","alt":"A = I − γP is an M-matrix with the moduli of every eigenvalue in [1 − γ, 1 + γ], Additionally, the inverse of A is","inline":true}],[{"style":{"width":"14%"},"width":280,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"which is a non-negative matrix and holds all diagonal entries greater than or equal to ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Combined with Definition ","element":"span"},{"href":"#id-138","text":"C.2 ","element":"a"},{"text":"which reveals ","element":"span"},{"style":{"height":16.78},"width":177.24,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-21.png","element":"img","alt":" ∥P∥∞ = 1","inline":true,"padRight":true},{"text":"and Proposition ","element":"span"},{"href":"#id-139","text":"C.1","element":"a"},{"text":", it follows that any ","element":"span"},{"style":{"height":11.6},"width":198.74,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-22.png","element":"img","alt":" λ ∈ C as an","inline":true,"padRight":true},{"text":"eigenvalue of ","element":"span"},{"style":{"height":16},"width":495.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-23.png","element":"img","alt":" A satisfies |λ| ∈ [1 − γ, 1 + γ]","inline":true},{"text":", yielding that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is invertible. Since ","element":"span"},{"style":{"height":16},"width":254.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-24.png","element":"img","alt":" ρ(γP) = γ < 1","inline":true,"padRight":true},{"text":"by Proposition ","element":"span"},{"href":"#id-140","text":"C.3","element":"a"},{"text":", the infinite series ","element":"span"},{"style":{"height":18},"width":183.08,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-25.png","element":"img","alt":"�∞t=0 γtP t","inline":true,"padRight":true},{"text":"converges and coincides with the inverse of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", that is,","element":"span"}],[{"id":"id-141","style":{"width":"57%"},"width":1114,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-26.png","element":"img"}],[{"text":"Coupled with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"being non-negative and ","element":"span"},{"href":"#id-141","style":{"height":19.61},"width":270.19,"height":49.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-27.png","element":"img","alt":" (γP)0 = I, (16)","inline":true,"padRight":true},{"text":"implies that ","element":"span"},{"style":{"height":13.38},"width":70.79,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/14-28.png","element":"img","alt":" A−1 ","inline":true,"padRight":true},{"text":"is non-negative with all diagonal entries greater than or equal to ","element":"span"},{"text":"1","element":"span"},{"text":".","element":"span"}],[{"text":"The next proposition serves as a technical tool in the later analysis.","element":"span"}],[{"id":"id-149","style":{"height":22.25},"width":880.78,"height":55.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-0.png","element":"img","alt":"Proposition C.7. For any π1, π2 ∈ R|S||A|+ ∩ ∆|S|","inline":true},{"style":{"fontStyle":"italic"},"text":", the vector norms satisfy","element":"span"}],[{"style":{"width":"30%"},"width":587,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"It suffices to show for any ","element":"span"},{"style":{"height":16},"width":614.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-2.png","element":"img","alt":" 0 < a, b < 1, |a − b| ≤ |log a − log b|","inline":true},{"text":". Without loss of generality, suppose that ","element":"span"},{"style":{"height":13.2},"width":237.49,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-3.png","element":"img","alt":"0 < b ≤ a < 1","inline":true},{"text":". Consider the function ","element":"span"},{"style":{"height":16},"width":299.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-4.png","element":"img","alt":" G (z) := z − log z","inline":true},{"text":", with the derivative","element":"span"}],[{"style":{"width":"31%"},"width":608,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-5.png","element":"img"}],[{"text":"which means ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":") ","element":"span"},{"text":"is monotonically decreasing on ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"text":". It leads to ","element":"span"},{"style":{"height":14},"width":345.19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-6.png","element":"img","alt":" a − log a ≤ b − log b","inline":true},{"text":", equivalently, ","element":"span"},{"style":{"height":13.2},"width":130.03,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-7.png","element":"img","alt":" a − b ≤","inline":true},{"style":{"height":14},"width":214.1,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-8.png","element":"img","alt":"log a − log b.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"D ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PROOF IN SECTION ","element":"span"},{"style":{"fontWeight":"bold"},"text":"4","element":"span"}],[{"text":"The structure of ","element":"span"},{"style":{"height":16},"width":97.41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-9.png","element":"img","alt":" φ(·, ·)","inline":true,"padRight":true},{"text":"facilitates the characterization of ","element":"span"},{"style":{"height":16},"width":137.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-10.png","element":"img","alt":" ∇V ∗(x)","inline":true},{"text":", as outlined in the following proposition. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proposition D.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":258.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-11.png","element":"img","alt":" x ∈ Rn, φ(x, ·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a contraction map, i.e., ","element":"span"},{"style":{"height":16},"width":395.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-12.png","element":"img","alt":" ∥∇vφ(x, v)∥∞ = γ < 1","inline":true},{"style":{"fontStyle":"italic"},"text":", and the matrix ","element":"span"},{"style":{"height":16},"width":240.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-13.png","element":"img","alt":"I − ∇vφ(x, v)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is invertible. Consequently, ","element":"span"},{"style":{"height":15.6},"width":103.38,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-14.png","element":"img","alt":" V ∗(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the unique fixed point of ","element":"span"},{"style":{"height":15.6},"width":107.81,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-15.png","element":"img","alt":" φ(x, ·)","inline":true},{"style":{"fontStyle":"italic"},"text":", with a well-defined derivative","element":"span"}],[{"id":"id-142","style":{"width":"100%"},"width":1948,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-16.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider the mapping ","element":"span"},{"style":{"height":17.38},"width":560.58,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-17.png","element":"img","alt":" F : Rn × R|S| → R|S| defined by","inline":true}],[{"style":{"width":"19%"},"width":382,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-18.png","element":"img"}],[{"text":"Then the partial derivative of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, v","element":"span"},{"text":") ","element":"span"},{"text":"with respect to ","element":"span"},{"style":{"height":16},"width":378.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-19.png","element":"img","alt":" v reads I − ∇vφ(x, v)","inline":true},{"text":". Computing","element":"span"}],[{"id":"id-143","style":{"width":"84%"},"width":1649,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-20.png","element":"img"}],[{"text":"leads to","element":"span"}],[{"text":"which means for any ","element":"span"},{"style":{"height":15.6},"width":252.21,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-21.png","element":"img","alt":" x ∈ Rn, φ(x, ·)","inline":true,"padRight":true},{"text":"is a contraction mapping, admiting a unique ","element":"span"},{"style":{"height":15.6},"width":572.35,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-22.png","element":"img","alt":" V ∗(x) such that F(x, V ∗(x)) = 0.","inline":true,"padRight":true},{"text":"Additionally, by applying Proposition ","element":"span"},{"href":"#id-139","text":"C.1","element":"a"},{"text":", we can derive that ","element":"span"},{"style":{"height":16},"width":176.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-23.png","element":"img","alt":" ∇vF(x, v)","inline":true,"padRight":true},{"text":"is invertible for any ","element":"span"},{"style":{"height":18.18},"width":322.3,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-24.png","element":"img","alt":" (x, v) ∈ Rn × R|S|.","inline":true,"padRight":true},{"text":"Consequently, the implicit function theorem implies that there exists a differentiable function ","element":"span"},{"style":{"height":14.58},"width":266.61,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-25.png","element":"img","alt":" V ∗ : Rn → R|S|","inline":true,"padRight":true},{"text":"such that, for any ","element":"span"},{"style":{"height":17.38},"width":364.46,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-26.png","element":"img","alt":" v ∈ R|S| and x ∈ Rn,","inline":true}],[{"style":{"width":"35%"},"width":699,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-27.png","element":"img"}],[{"text":"with the derivative ","element":"span"},{"style":{"height":16},"width":137.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-28.png","element":"img","alt":" ∇V ∗(x)","inline":true,"padRight":true},{"text":"satisfying","element":"span"}],[{"text":"Rearranging this yields the result (","element":"span"},{"href":"#id-142","text":"17","element":"a"},{"text":"). On the other hand, take (","element":"span"},{"href":"#id-143","text":"19","element":"a"},{"text":") at ","element":"span"},{"style":{"height":16},"width":186.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-29.png","element":"img","alt":" (x, V ∗(x)),","inline":true}],[{"id":"id-144","style":{"width":"84%"},"width":1640,"height":432,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/15-30.png","element":"img"}],[{"text":"where (","element":"span"},{"href":"#id-144","text":"20","element":"a"},{"text":") comes from (","element":"span"},{"href":"#id-79","text":"3","element":"a"},{"text":") and (","element":"span"},{"href":"#id-84","text":"6","element":"a"},{"text":"), and (","element":"span"},{"href":"#id-144","text":"21","element":"a"},{"text":") follows considering (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":"). Consequently, revisiting the definition of ","element":"span"},{"style":{"height":14.6},"width":109.94,"height":36.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-0.png","element":"img","alt":"P π∗(x) ","inline":true,"padRight":true},{"text":"yields the conclusion ","element":"span"},{"style":{"height":18.6},"width":459.9,"height":46.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-1.png","element":"img","alt":" ∇vφ (x, V ∗(x)) = γP π∗(x).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PROOFS IN SECTION ","element":"span"},{"style":{"fontWeight":"bold"},"text":"5","element":"span"}],[{"text":"In this section, we provide proofs of the propositions in Section ","element":"span"},{"text":"5","element":"span"},{"text":". The following proposition indicates that the implicit differentiations ","element":"span"},{"style":{"height":16},"width":138.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-2.png","element":"img","alt":" ∇V ∗(x)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":137.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-3.png","element":"img","alt":" ∇Q∗(x)","inline":true,"padRight":true},{"text":"can be estimated by sampling the reward gradient under the policy ","element":"span"},{"style":{"height":16},"width":107.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-4.png","element":"img","alt":" π∗(x).","inline":true}],[{"id":"id-147","style":{"fontWeight":"bold"},"text":"Proposition E.1. ","element":"span"},{"style":{"height":14.8},"width":611.1,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-5.png","element":"img","alt":" For any x ∈ Rn, s ∈ S, and a ∈ A,","inline":true}],[{"style":{"width":"56%"},"width":526,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Compute the partial derivative","element":"span"}],[{"style":{"width":"55%"},"width":524,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-7.png","element":"img"}],[{"text":"and substitute ","element":"span"},{"style":{"height":16},"width":309.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-8.png","element":"img","alt":" v = V ∗(x) into it,","inline":true}],[{"text":"where we adopt the similar derivation as (","element":"span"},{"href":"#id-144","text":"21","element":"a"},{"text":"). Applying Proposition ","element":"span"},{"href":"#id-145","text":"C.6 ","element":"a"},{"text":"to (","element":"span"},{"href":"#id-142","text":"18","element":"a"},{"text":"), we obtain","element":"span"}],[{"id":"id-146","style":{"width":"68%"},"width":1338,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-9.png","element":"img"}],[{"text":"Employ the subscript ","element":"span"},{"style":{"height":11.6},"width":93.4,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-10.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"to denote the corresponding row of a matrix. It follows from (","element":"span"},{"href":"#id-142","text":"17","element":"a"},{"text":"), (","element":"span"},{"href":"#id-146","text":"22","element":"a"},{"text":") and (","element":"span"},{"href":"#id-146","text":"23","element":"a"},{"text":") that","element":"span"}],[{"style":{"width":"58%"},"width":1148,"height":593,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-11.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":552,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-12.png","element":"img","alt":" P (st = s′, at = a|s0 = s, π∗ (x))","inline":true,"padRight":true},{"text":"is the probability of taking action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at state ","element":"span"},{"style":{"height":6.8},"width":32.68,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-13.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", given that the process starts from state ","element":"span"},{"style":{"height":9.19},"width":108.69,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-14.png","element":"img","alt":" s0 = s","inline":true,"padRight":true},{"text":"in the MDP ","element":"span"},{"style":{"height":16},"width":122.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-15.png","element":"img","alt":" Mτ(x)","inline":true},{"text":". Differentiating both sides of","element":"span"}],[{"style":{"width":"32%"},"width":627,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-16.png","element":"img"}],[{"text":"with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", we have","element":"span"}],[{"style":{"width":"57%"},"width":537,"height":225,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/16-17.png","element":"img"}],[{"text":"Subsequently, the next proposition characterizes the hyper-gradient via fully first-order information.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition E.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The hyper-gradient ","element":"span"},{"style":{"height":15.6},"width":109.92,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-0.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of the bilevel reinforcement learning problem ","element":"span"},{"text":"(","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with the upper-level function ","element":"span"},{"text":"(","element":"span"},{"href":"#id-97","text":"11","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"can be computed by","element":"span"}],[{"style":{"width":"76%"},"width":1492,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall that","element":"span"}],[{"style":{"width":"99%"},"width":939,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-2.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"text":"where the first equality comes from the chain rules, the second equality is obtained by “log probability tricks” in REINFORCE (","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"Sutton and Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":87,"text":"2018","element":"a"},{"text":"), and the last equality takes advantege of (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":").","element":"span"}],[{"id":"id-107","style":{"fontWeight":"bold"},"text":"F ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PROPERTIES OF ","element":"span"},{"style":{"height":19.6},"width":119.44,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-3.png","element":"img","alt":" V ∗(x)","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"AND ","element":"span"},{"style":{"height":19.6},"width":110.02,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-4.png","element":"img","alt":" π∗(x)","inline":true}],[{"text":"At the beginning, we uncover the Lipschitz properties of ","element":"span"},{"style":{"height":15.6},"width":103.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-5.png","element":"img","alt":" V ∗(x)","inline":true},{"text":", based on the results of Propositions ","element":"span"},{"href":"#id-147","text":"E.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-148","text":"H.5","element":"a"},{"text":". The proof of Proposition ","element":"span"},{"href":"#id-148","text":"H.5 ","element":"a"},{"text":"is deferred to Appendix ","element":"span"},{"href":"#id-108","text":"H","element":"a"},{"text":", since it needs more analytical tools, and this section primarily concentrates on the properties of ","element":"span"},{"style":{"height":16},"width":291.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-6.png","element":"img","alt":" V ∗(x) and π∗(x)","inline":true},{"text":". It is worth stressing that all results in this section","element":"span"}],[{"text":"only rely on Assumption ","element":"span"},{"href":"#id-115","text":"7.3","element":"a"},{"text":".","element":"span"}],[{"href":"#id-115","style":{"height":16},"width":1534.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-7.png","element":"img","alt":"Lemma F.1. Under Assumption 7.3, V ∗(x) is LV -Lipschitz continuous with LV =","inline":true}],[{"style":{"width":"66%"},"width":1303,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Combining the characterization from Proposition ","element":"span"},{"href":"#id-147","text":"E.1","element":"a"}],[{"style":{"width":"48%"},"width":939,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-9.png","element":"img"}],[{"text":"and the boundedness of ","element":"span"},{"style":{"height":16},"width":106.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-10.png","element":"img","alt":" ∇r(x)","inline":true,"padRight":true},{"text":"revealed by Assumption ","element":"span"},{"href":"#id-115","text":"7.3 ","element":"a"},{"text":"yields","element":"span"}],[{"style":{"width":"29%"},"width":566,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/17-11.png","element":"img"}],[{"text":"Put differently, the ","element":"span"},{"text":"2","element":"span"},{"text":"-norm of each row vector in ","element":"span"},{"style":{"height":16},"width":137.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-0.png","element":"img","alt":" ∇V ∗(x)","inline":true,"padRight":true},{"text":"does not exceed ","element":"span"},{"style":{"height":22.03},"width":58.79,"height":55.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-1.png","element":"img","alt":"Crx1−γ ","inline":true,"padRight":true},{"text":", which means","element":"span"}],[{"style":{"width":"99%"},"width":1941,"height":205,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-2.png","element":"img"}],[{"text":"The Lipschitz continuity of ","element":"span"},{"style":{"height":16},"width":96.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-3.png","element":"img","alt":" π∗(x)","inline":true,"padRight":true},{"text":"is connected to that of the value functions by the consistency condition (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":"). ","element":"span"},{"href":"#id-115","style":{"height":16},"width":1794.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-4.png","element":"img","alt":"Proposition F.2. Under Assumption 7.3, π∗(x) is Lπ-Lipschitz continuous, i.e., for any x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"78%"},"width":1537,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-5.png","element":"img"}],[{"style":{"height":14.8},"width":492.23,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-6.png","element":"img","alt":"Proof. For any s ∈ S, a ∈ A","inline":true},{"text":", from the results of Proposition ","element":"span"},{"href":"#id-147","text":"E.1","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"44%"},"width":868,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-7.png","element":"img"}],[{"text":"and the consistency condition (","element":"span"},{"href":"#id-84","text":"5","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"29%"},"width":280,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-8.png","element":"img"}],[{"text":"it follows that","element":"span"}],[{"style":{"width":"71%"},"width":669,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-9.png","element":"img"}],[{"text":"The shape of ","element":"span"},{"style":{"height":17.38},"width":324.2,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-10.png","element":"img","alt":" π ∈ R|S||A| implies","inline":true}],[{"style":{"width":"47%"},"width":442,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-11.png","element":"img"}],[{"text":"Applying Proposition ","element":"span"},{"href":"#id-149","text":"C.7 ","element":"a"},{"text":"achieves the conclusion.","element":"span"}],[{"text":"We derive the Lipschitz constant ","element":"span"},{"style":{"height":13.19},"width":68.8,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-12.png","element":"img","alt":" LV 1","inline":true,"padRight":true},{"text":"in Proposition ","element":"span"},{"href":"#id-148","text":"H.5 ","element":"a"},{"text":"focusing on the smoothness of each ","element":"span"},{"style":{"height":16},"width":104.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-13.png","element":"img","alt":" V ∗s (x)","inline":true},{"text":", which lays a ","element":"span"},{"text":"foundation for the subsequent proposition. The following Lipschitz constant ","element":"span"},{"style":{"height":17.78},"width":68.81,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-14.png","element":"img","alt":" LMV 1 ","inline":true,"padRight":true},{"text":"is noted with the superscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"to emphasize we consider the continuity of the whole matrix ","element":"span"},{"style":{"height":16},"width":234.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-15.png","element":"img","alt":" ∇V ∗(x) here.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Proposition F.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":17.77},"width":230.01,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-16.png","element":"img","alt":" V ∗(x) is LMV 1","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz smooth, i.e., for any ","element":"span"},{"style":{"height":14},"width":210.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-17.png","element":"img","alt":" x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"69%"},"width":1346,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Proposition ","element":"span"},{"href":"#id-148","text":"H.5 ","element":"a"},{"text":"implies that for any ","element":"span"},{"style":{"height":14},"width":105.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-19.png","element":"img","alt":" s ∈ S,","inline":true}],[{"style":{"width":"100%"},"width":940,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-20.png","element":"img"}],[{"text":"In this way, we can choose","element":"span"}],[{"style":{"width":"49%"},"width":462,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/18-21.png","element":"img"}],[{"id":"id-88","style":{"width":"99%"},"width":1944,"height":707,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-0.png","element":"img"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"G ","element":"span"},{"style":{"fontWeight":"bold"},"text":"ANALYSIS OF M-SOBIRL","element":"span"}],[{"text":"Enlightened by the hyper-gradient,","element":"span"}],[{"text":"we propose the model-based algorithm, M-SoBiRL; see Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":". Note that in practice, the special structure of RL will facilitate the computation. Specifically, the matrix ","element":"span"},{"style":{"height":16},"width":331.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-1.png","element":"img","alt":" Ak = (I − γP πk)⊤","inline":true,"padRight":true},{"text":"is usually sparse in RL, and the calculation of ","element":"span"},{"style":{"height":16},"width":515.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-2.png","element":"img","alt":" bk = U ⊤diag(πk)∇πf(xk, πk)","inline":true,"padRight":true},{"text":"also involves a sparse constant matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"id":"id-86","text":"and a diagonal ","element":"span"},{"style":{"height":16},"width":147.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-3.png","element":"img","alt":"diag(πk)","inline":true},{"text":". Therefore, the multiplications ","element":"span"},{"style":{"height":16.12},"width":287.51,"height":40.29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-4.png","element":"img","alt":" A⊤k Ak and A⊤k bk","inline":true,"padRight":true},{"text":"for updating ","element":"span"},{"style":{"height":9.19},"width":45.53,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-5.png","element":"img","alt":" wk","inline":true,"padRight":true},{"text":"in line 4 can be accelerated by sparse ","element":"span"},{"text":"multiplication oracles, the cost of which is denoted as ","element":"span"},{"style":{"height":11.59},"width":102.46,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-6.png","element":"img","alt":" csparse","inline":true},{"text":". Additionally, evaluating ","element":"span"},{"style":{"height":18.03},"width":59.21,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-7.png","element":"img","alt":"ˆ∇ϕ","inline":true,"padRight":true},{"text":"and updating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"in line 5 cost ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S||A|","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":")","element":"span"},{"text":". Moreover, the other lines in Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"totally cost ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S||A|","element":"span"},{"text":") ","element":"span"},{"text":"since they are element-wise operations on ","element":"span"},{"style":{"height":14},"width":153.45,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-8.png","element":"img","alt":" Qk or Vk","inline":true},{"text":". In summary, the computation cost of each outer iteration is ","element":"span"},{"style":{"height":16.79},"width":355.27,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-9.png","element":"img","alt":" O(|S||A|n + csparse).","inline":true}],[{"text":"Next, we focus on the convergence analysis for M-SoBiRL, starting with a short proof sketch for guidance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof Sketch of Theorem ","element":"span"},{"href":"#id-150","style":{"fontWeight":"bold"},"text":"7.4","element":"a"}],[{"style":{"width":"99%"},"width":939,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-10.png","element":"img"}],[{"text":"The proof is structured into four main steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step1: Preliminary Properties ","element":"span"},{"text":"Appendix ","element":"span"},{"href":"#id-151","text":"G.2 ","element":"a"},{"text":"studies the Lipschitz and boundedness properties of the quantities related to Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":", laying a foundation for the following analysis.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step2: Upper-bounding the Residual ","element":"span"},{"style":{"height":17.31},"width":200.82,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-11.png","element":"img","alt":" ∥wk − w∗k∥2","inline":true,"padRight":true},{"text":"Appendix ","element":"span"},{"href":"#id-152","text":"G.3 ","element":"a"},{"text":"bounds the term ","element":"span"},{"style":{"height":17.31},"width":213.71,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-12.png","element":"img","alt":" ∥wk − w∗k∥2,","inline":true}],[{"text":"where the coefficient","element":"span"},{"style":{"height":28.8},"width":243.12,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-13.png","element":"img","alt":"�1 − η(1−γ)22 �","inline":true},{"text":"implies its descent property, and the constants ","element":"span"},{"style":{"height":14},"width":137.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/19-14.png","element":"img","alt":" Cσπ, Lw","inline":true,"padRight":true},{"text":"will be specified later.","element":"span"}],[{"style":{"width":"90%"},"width":1759,"height":226,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-0.png","element":"img"}],[{"text":"with the contraction factor ","element":"span"},{"style":{"height":16.98},"width":49.84,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-1.png","element":"img","alt":" γN ","inline":true,"padRight":true},{"text":"responsible for the convergence property.","element":"span"}],[{"style":{"width":"74%"},"width":1446,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-2.png","element":"img"}],[{"text":"where the coefficients ","element":"span"},{"style":{"height":14},"width":40.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-3.png","element":"img","alt":" ζw","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":42.43,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-4.png","element":"img","alt":" ζQ","inline":true,"padRight":true},{"text":"are to be determined later, Appendix ","element":"span"},{"href":"#id-153","text":"G.5 ","element":"a"},{"text":"evaluates ","element":"span"},{"style":{"height":14.79},"width":181.54,"height":36.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-5.png","element":"img","alt":" Lk+1 − Lk","inline":true,"padRight":true},{"text":"to reveal the decreasing property of ","element":"span"},{"style":{"height":13.19},"width":44.49,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-6.png","element":"img","alt":" Lk","inline":true},{"text":". As a result, it proves that M-SoBiRL enjoys the convergence rate ","element":"span"},{"style":{"height":17.38},"width":123.6,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-7.png","element":"img","alt":" O(ϵ−1)","inline":true,"padRight":true},{"text":"with the inner iteration number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(1) ","element":"span"},{"text":"independent of the solution accuracy ","element":"span"},{"style":{"height":2},"width":27.18,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-8.png","element":"img","alt":" ϵ.","inline":true}],[{"id":"id-151","style":{"fontWeight":"bold"},"text":"G.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Lipschitz Properties of Quantities Related to Algorithm ","element":"span"},{"href":"#id-88","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"text":"In this subsection, we study the Lipschitz and boundedness properties of the quantities related to Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":", setting the groundwork for convergence analysis.","element":"span"}],[{"id":"id-161","style":{"width":"77%"},"width":1500,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The element of ","element":"span"},{"style":{"height":16.2},"width":181.34,"height":40.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-10.png","element":"img","alt":" P π1 − P π2 ","inline":true,"padRight":true},{"text":"in position ","element":"span"},{"style":{"height":19.2},"width":506.11,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-11.png","element":"img","alt":" (s, s′) is �a�π1sa − π2sa�Psas′","inline":true},{"text":". Consider the ","element":"span"},{"text":"2","element":"span"},{"text":"-norm of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"-th row in ","element":"span"},{"style":{"height":19.4},"width":196.3,"height":48.49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-12.png","element":"img","alt":" P π1 − P π2,","inline":true}],[{"style":{"width":"99%"},"width":1943,"height":675,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-13.png","element":"img"}],[{"id":"id-156","style":{"height":14.99},"width":525.79,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-14.png","element":"img","alt":"Lemma G.2. U ∈ R|S||A|×|S| ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"has full column rank, and","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Define that for any ","element":"span"},{"style":{"height":15.2},"width":371.27,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-15.png","element":"img","alt":" a ∈ A, Ua = I − γPa","inline":true},{"text":", where ","element":"span"},{"style":{"height":16.57},"width":231.44,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-16.png","element":"img","alt":" Pa ∈ R|S|×|S|","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16},"width":283.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-17.png","element":"img","alt":" Pa(s, s′) = Psas′","inline":true},{"text":". By the triangle inequality, ","element":"span"},{"style":{"height":16.78},"width":399.01,"height":41.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-18.png","element":"img","alt":" ∥Ua∥∞ ∈ [1 − γ.1 + γ]","inline":true},{"text":". Subsequently, the property of the matrix norms, ","element":"span"},{"style":{"height":19.2},"width":390.71,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-19.png","element":"img","alt":" ∥A∥2 ≤�∥A∥1 ∥A∥∞","inline":true},{"text":", produces","element":"span"}],[{"id":"id-154","style":{"width":"67%"},"width":1324,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-20.png","element":"img"}],[{"text":"Applying Proposition ","element":"span"},{"href":"#id-139","text":"C.1 ","element":"a"},{"text":"to ","element":"span"},{"style":{"height":13.19},"width":44.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-21.png","element":"img","alt":" Ua","inline":true,"padRight":true},{"text":"deduces that the magnitude of every eigenvalue of ","element":"span"},{"style":{"height":13.19},"width":44.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-22.png","element":"img","alt":" Ua","inline":true,"padRight":true},{"text":"belongs to ","element":"span"},{"style":{"height":16},"width":223.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-23.png","element":"img","alt":" [1 − γ, 1 + γ]","inline":true},{"text":". The property that ","element":"span"},{"style":{"height":16.78},"width":102.34,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-24.png","element":"img","alt":" ∥Ua∥2","inline":true,"padRight":true},{"text":"is greater than the magnitude of any of its eigenvalues leads to","element":"span"}],[{"id":"id-155","style":{"width":"56%"},"width":1098,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/20-25.png","element":"img"}],[{"text":"The definition of ","element":"span"},{"style":{"height":13.19},"width":44.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-0.png","element":"img","alt":" Ua","inline":true,"padRight":true},{"text":"reveals that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"rows of the matrix ","element":"span"},{"style":{"height":13.19},"width":44.22,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-1.png","element":"img","alt":" Ua","inline":true,"padRight":true},{"text":"are distributed in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"with a spacing of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|A|","element":"span"},{"text":". In this way, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"has full column rank since ","element":"span"},{"style":{"height":13.19},"width":44.21,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-2.png","element":"img","alt":" Ua","inline":true,"padRight":true},{"text":"is invertible. Consider the singular value of ","element":"span"},{"style":{"height":14.99},"width":259.02,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-3.png","element":"img","alt":" U ∈ R|S||A|×|S|","inline":true},{"text":". Specifically,","element":"span"}],[{"style":{"width":"48%"},"width":948,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":107.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-5.png","element":"img","alt":" Ua(s′)","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"height":13.59},"width":260.1,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-6.png","element":"img","alt":" s′-th row of Ua","inline":true},{"text":". Combining (","element":"span"},{"href":"#id-154","text":"25","element":"a"},{"text":") with (","element":"span"},{"href":"#id-155","text":"26","element":"a"},{"text":"), it follow that","element":"span"}],[{"style":{"width":"99%"},"width":1941,"height":279,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":16},"width":187.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-8.png","element":"img","alt":" ϑ (x) is Lϑ","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous, i.e., for any ","element":"span"},{"style":{"height":14},"width":210.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-9.png","element":"img","alt":" x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"65%"},"width":1282,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Given the boundedness of ","element":"span"},{"style":{"height":16},"width":247.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-11.png","element":"img","alt":" ∇πf(x, π∗(x))","inline":true,"padRight":true},{"text":"from Assumption ","element":"span"},{"href":"#id-110","text":"7.1 ","element":"a"},{"text":"and the boundedness of ","element":"span"},{"style":{"height":16.78},"width":87.4,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-12.png","element":"img","alt":" ∥U∥2","inline":true,"padRight":true},{"text":"derived in Lemma ","element":"span"},{"href":"#id-156","text":"G.2","element":"a"},{"text":", it follows that","element":"span"}],[{"style":{"width":"80%"},"width":1571,"height":467,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-13.png","element":"img"}],[{"text":"Consequently, we concentrate on the Lipschitz property of the hyper-objective ","element":"span"},{"style":{"height":16},"width":78.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-14.png","element":"img","alt":" ϕ(x)","inline":true},{"text":", which plays a vital role in analyzing the convergence properties of Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":". The superscript ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is appended to the Lipschitz constant, ","element":"span"},{"style":{"height":20.3},"width":59.12,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-15.png","element":"img","alt":"LMϕ ","inline":true,"padRight":true},{"text":"to make a distinction from the constant used in the model-free scenario.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition G.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":20.3},"width":223.86,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-16.png","element":"img","alt":" ∇ϕ(x) is LMϕ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous, i.e., for any ","element":"span"},{"style":{"height":14},"width":210.19,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-17.png","element":"img","alt":" x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"35%"},"width":684,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":28.85},"width":385.08,"height":72.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-19.png","element":"img","alt":" LMϕ = O�|S|3/2|A|3/2(1−γ)3 �","inline":true},{"style":{"fontStyle":"italic"},"text":"is specified in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-157","text":"31","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Revisit the expression of ","element":"span"},{"style":{"height":16},"width":111.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-20.png","element":"img","alt":" ∇ϕ(x)","inline":true,"padRight":true},{"text":"and incorporate the notation ","element":"span"},{"style":{"height":16},"width":88.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-21.png","element":"img","alt":" ϑ(x),","inline":true}],[{"id":"id-158","style":{"width":"88%"},"width":1724,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-22.png","element":"img"}],[{"text":"Consider the first term in (","element":"span"},{"href":"#id-158","text":"27","element":"a"},{"text":"),","element":"span"}],[{"id":"id-159","style":{"width":"33%"},"width":318,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/21-23.png","element":"img"}],[{"text":"Subsequent analysis involves the Lipschitz property of the second term,","element":"span"}],[{"id":"id-160","style":{"width":"99%"},"width":1943,"height":1006,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-0.png","element":"img"}],[{"text":"Collecting (","element":"span"},{"href":"#id-159","text":"28","element":"a"},{"text":"), (","element":"span"},{"href":"#id-160","text":"29","element":"a"},{"text":") and (","element":"span"},{"href":"#id-160","text":"30","element":"a"},{"text":") concludes that ","element":"span"},{"style":{"height":20.3},"width":223.19,"height":50.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-1.png","element":"img","alt":" ∇ϕ(x) is LMϕ ","inline":true,"padRight":true},{"text":"-Lipschitz continuous with","element":"span"}],[{"id":"id-157","style":{"width":"81%"},"width":1589,"height":306,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-2.png","element":"img"}],[{"text":"where (","element":"span"},{"href":"#id-157","text":"31","element":"a"},{"text":") comes from the expressions of ","element":"span"},{"style":{"height":17.78},"width":293.88,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-3.png","element":"img","alt":" Lπ, Lϑ and LMV 1.","inline":true}],[{"id":"id-152","style":{"fontWeight":"bold"},"text":"G.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence Property of ","element":"span"},{"style":{"height":9.19},"width":45.53,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-4.png","element":"img","alt":" wk","inline":true}],[{"text":"In Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":", the goal of ","element":"span"},{"style":{"height":9.19},"width":45.53,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-5.png","element":"img","alt":" wk","inline":true,"padRight":true},{"text":"is to track ","element":"span"},{"style":{"height":20.13},"width":265.36,"height":50.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-6.png","element":"img","alt":" w∗k = (A∗k)−1 b∗k ","inline":true,"padRight":true},{"text":"through outer iterations. This section delves into the ","element":"span"},{"text":"descent property of ","element":"span"},{"style":{"height":17.31},"width":200.82,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-7.png","element":"img","alt":" ∥wk − w∗k∥2","inline":true},{"text":". Initially, we prove that the exact quantities ","element":"span"},{"style":{"height":17.31},"width":92.56,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-8.png","element":"img","alt":" ∥b∗k∥2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.31},"width":103.99,"height":43.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-9.png","element":"img","alt":" ∥w∗k∥2","inline":true,"padRight":true},{"text":"are uniformly ","element":"span"},{"text":"bounded.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma G.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", it holds that","element":"span"}],[{"style":{"width":"26%"},"width":520,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Taking into account ","element":"span"},{"style":{"height":19.22},"width":1079.94,"height":48.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-11.png","element":"img","alt":" ∥U∥2 ≤�|S| |A| (1 + γ), π∗ ∈ ∆|S|, and ∇πf(x, π∗(x)) ≤ Cfπ","inline":true},{"text":", we can obtain","element":"span"}],[{"style":{"width":"36%"},"width":719,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/22-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.51},"width":212.5,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-0.png","element":"img","alt":" π∗k = π∗(xk)","inline":true},{"text":". Additionally, denote the eigenvalue of ","element":"span"},{"style":{"height":16.11},"width":46.89,"height":40.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-1.png","element":"img","alt":" A∗k","inline":true,"padRight":true},{"text":"with the smallest moduli by ","element":"span"},{"style":{"height":16.51},"width":172.13,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-2.png","element":"img","alt":" λmin (A∗k)","inline":true},{"text":". Proposi- ","element":"span"},{"text":"tion ","element":"span"},{"href":"#id-145","text":"C.6 ","element":"a"},{"text":"reveals ","element":"span"},{"style":{"height":16.51},"width":315.39,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-3.png","element":"img","alt":" λmin (A∗k) ≥ 1 − γ","inline":true},{"text":". In this way, ","element":"span"},{"style":{"height":17.31},"width":409.3,"height":43.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-4.png","element":"img","alt":" ∥A∗k∥2 ≥ λmin (A∗k) and","inline":true}],[{"style":{"width":"26%"},"width":512,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-5.png","element":"img"}],[{"id":"id-162","text":"So, it completes the proof by","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Drawing on Lemma ","element":"span"},{"href":"#id-161","text":"G.1 ","element":"a"},{"text":"and the boundedness of ","element":"span"},{"style":{"height":16.11},"width":127.71,"height":40.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-6.png","element":"img","alt":" Ak, A∗k,","inline":true}],[{"style":{"width":"86%"},"width":1686,"height":536,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-7.png","element":"img"}],[{"text":"Moreover, the Lipschitz and boundedness properties related to ","element":"span"},{"style":{"height":14},"width":56.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-8.png","element":"img","alt":" ∇f","inline":true,"padRight":true},{"text":"in Assumption ","element":"span"},{"href":"#id-110","text":"7.1 ","element":"a"},{"text":"reveal that","element":"span"}],[{"style":{"width":"84%"},"width":1651,"height":333,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-9.png","element":"img"}],[{"text":"The following lemma illustrates the descent property of ","element":"span"},{"style":{"height":20.93},"width":200.82,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-10.png","element":"img","alt":" ∥wk − w∗k∥22","inline":true}],[{"id":"id-175","style":{"fontWeight":"bold"},"text":"Lemma G.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", the iterates ","element":"span"},{"style":{"height":16},"width":239.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-11.png","element":"img","alt":" {(xk, πk, wk)}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"generated by Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"79%"},"width":1556,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/23-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By the identity ","element":"span"},{"style":{"height":20.4},"width":562.72,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-0.png","element":"img","alt":" ∥a − b∥22 = ∥a∥2 + ∥b∥2 − 2 ⟨a, b⟩","inline":true},{"text":", we establish","element":"span"}],[{"id":"id-163","style":{"width":"99%"},"width":1944,"height":178,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-1.png","element":"img"}],[{"text":"which is the update direction of ","element":"span"},{"style":{"height":9.19},"width":87.05,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-2.png","element":"img","alt":" wk−1","inline":true},{"text":", and correspondingly, use the following notation for reference,","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":218,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-3.png","element":"img"}],[{"text":"The identity","element":"span"}],[{"id":"id-167","style":{"width":"52%"},"width":497,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-4.png","element":"img"}],[{"text":"and the observation that ","element":"span"},{"style":{"height":15.71},"width":211.75,"height":39.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-5.png","element":"img","alt":" σ∗k = 0 yield","inline":true}],[{"text":"comes from the results of Lemma ","element":"span"},{"href":"#id-162","text":"G.6","element":"a"},{"text":", and the last inequality is obtained by ","element":"span"},{"style":{"height":19.74},"width":618.84,"height":49.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-6.png","element":"img","alt":" ∥πk − π∗k∥2 ≤�|S| |A| ∥πk − π∗k∥∞.","inline":true}],[{"text":"Subsequently, we bound the third term in (","element":"span"},{"href":"#id-163","text":"32","element":"a"},{"text":") in a similar way,","element":"span"}],[{"style":{"width":"77%"},"width":1500,"height":363,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-7.png","element":"img"}],[{"text":"By Young’s inequality,","element":"span"}],[{"style":{"width":"78%"},"width":738,"height":220,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/24-8.png","element":"img"}],[{"id":"id-164","style":{"width":"99%"},"width":1946,"height":221,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-0.png","element":"img"}],[{"text":"Substituting the inequality (","element":"span"},{"href":"#id-164","text":"35","element":"a"},{"text":") into (","element":"span"},{"href":"#id-163","text":"32","element":"a"},{"text":") yields","element":"span"}],[{"id":"id-165","style":{"width":"76%"},"width":715,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-1.png","element":"img"}],[{"text":"Then, we bound the term","element":"span"},{"style":{"height":20.99},"width":258.93,"height":52.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-2.png","element":"img","alt":"��w∗k − w∗k−1��2,","inline":true}],[{"text":"with ","element":"span"},{"style":{"height":28.8},"width":730.93,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-3.png","element":"img","alt":" Lw = 11−γ�Lϑ + |A|1−γ�|S| (1 + γ)CfπLπ�","inline":true},{"text":". Apply Young’s inequality to the inner product term in (","element":"span"},{"href":"#id-163","text":"32","element":"a"},{"text":"),","element":"span"}],[{"id":"id-166","style":{"width":"81%"},"width":1582,"height":415,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-4.png","element":"img"}],[{"text":"where we take ","element":"span"},{"style":{"height":23.64},"width":211.97,"height":59.1,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-5.png","element":"img","alt":" ρ2 = η(1−γ)22","inline":true,"padRight":true},{"text":". Assembling (","element":"span"},{"href":"#id-165","text":"36","element":"a"},{"text":"), (","element":"span"},{"href":"#id-166","text":"37","element":"a"},{"text":") and (","element":"span"},{"href":"#id-166","text":"38","element":"a"},{"text":"), we can estimate ","element":"span"},{"href":"#id-163","style":{"height":20.93},"width":344.31,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-6.png","element":"img","alt":" ∥wk − w∗k∥22 in (32),","inline":true}],[{"style":{"width":"79%"},"width":1556,"height":940,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/25-7.png","element":"img"}],[{"text":"where the last inequality is earned by incorporating","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":195,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-0.png","element":"img"}],[{"text":"implied in (","element":"span"},{"href":"#id-167","text":"34","element":"a"},{"text":") and (","element":"span"},{"href":"#id-166","text":"37","element":"a"},{"text":"), respectively.","element":"span"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"G.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence Properties of ","element":"span"},{"style":{"height":14.4},"width":194.55,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-1.png","element":"img","alt":" πk and Qk","inline":true}],[{"text":"The softmax mapping, plays a significant role in the update rule of ","element":"span"},{"style":{"height":9.19},"width":39.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-2.png","element":"img","alt":" πk","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":48.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-3.png","element":"img","alt":" Qk","inline":true},{"text":", for which we begin this section by introducing some properties of it, following from (","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"Cen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"2022","element":"a"},{"text":"). Given a ","element":"span"},{"style":{"height":14.99},"width":181.87,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-4.png","element":"img","alt":" θ ∈ R|S||A|","inline":true},{"text":", for any ","element":"span"},{"style":{"height":11.6},"width":93.41,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-5.png","element":"img","alt":" s ∈ S","inline":true},{"text":", use ","element":"span"},{"style":{"height":16.58},"width":157.29,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-6.png","element":"img","alt":"θs ∈ R|A| ","inline":true,"padRight":true},{"text":"to denote a vector with ","element":"span"},{"style":{"height":16},"width":355.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-7.png","element":"img","alt":" θs(a) = θsa = θ(s, a)","inline":true},{"text":". Recall the softmax mapping:","element":"span"}],[{"style":{"width":"42%"},"width":827,"height":167,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-8.png","element":"img"}],[{"text":"Consider the typical component, where ","element":"span"},{"style":{"height":18.97},"width":161.86,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-9.png","element":"img","alt":" πq ∈ R|A| ","inline":true,"padRight":true},{"text":"is parameterized by ","element":"span"},{"style":{"height":17.39},"width":154.22,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-10.png","element":"img","alt":" q ∈ R|A|,","inline":true}],[{"id":"id-171","style":{"width":"22%"},"width":436,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-11.png","element":"img"}],[{"text":"In this way,","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":10},"width":31.79,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-12.png","element":"img","alt":" qc","inline":true,"padRight":true},{"text":"is a certain convex combination of ","element":"span"},{"style":{"height":14.4},"width":250.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-13.png","element":"img","alt":" q1 and q2, and","inline":true}],[{"id":"id-173","style":{"width":"40%"},"width":792,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-14.png","element":"img"}],[{"text":"It follows that","element":"span"}],[{"text":"In Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":", it adopts ","element":"span"},{"style":{"height":14},"width":48.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-15.png","element":"img","alt":" Qk","inline":true,"padRight":true},{"text":"to approximate the optimal soft Q-value function of ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-16.png","element":"img","alt":" πk","inline":true,"padRight":true},{"text":"dynamically, i.e., the environment ","element":"span"},{"style":{"height":16},"width":141.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-17.png","element":"img","alt":" Mτ(xk)","inline":true,"padRight":true},{"text":"based on which the value function is evaluated varies through the outer iterations. To this end, we will analyze the difference of value functions, incurred by the change of the environment parameterized by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". Recall that ","element":"span"},{"style":{"height":16.51},"width":745.87,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-18.png","element":"img","alt":" π∗k = π∗ (xk), Q∗k = Q∗ (xk), V ∗k = V ∗ (xk).","inline":true}],[{"id":"id-170","style":{"fontWeight":"bold"},"text":"Lemma G.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2","element":"a"},{"style":{"fontStyle":"italic"},"text":", given a policy ","element":"span"},{"style":{"height":14.4},"width":400.81,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-19.png","element":"img","alt":" π, for any x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"59%"},"width":1157,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By definitions of the soft value functions, for any ","element":"span"},{"style":{"height":14.8},"width":297.84,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-21.png","element":"img","alt":" s ∈ S and a ∈ A,","inline":true}],[{"id":"id-168","style":{"width":"87%"},"width":1710,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/26-22.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"67%"},"width":637,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-0.png","element":"img"}],[{"text":"Substituting (","element":"span"},{"href":"#id-168","text":"41","element":"a"},{"text":") into (","element":"span"},{"href":"#id-169","text":"42","element":"a"},{"text":") yields","element":"span"}],[{"style":{"width":"57%"},"width":539,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-1.png","element":"img"}],[{"text":"which leads to the conclusion.","element":"span"}],[{"text":"We present the properties of the operator ","element":"span"},{"style":{"height":19.86},"width":446.2,"height":49.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-2.png","element":"img","alt":" TMτ (x) : R|S||A| �→ R|S||A|","inline":true},{"text":", following from (","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"Cen ","element":"a"},{"href":"#id-64","referenceIndex":9,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-64","referenceIndex":9,"text":"2022","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition G.9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The soft Bellman optimality operator ","element":"span"},{"style":{"height":19.87},"width":463.83,"height":49.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-3.png","element":"img","alt":" TMτ (x) : R|S||A| �→ R|S||A|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"associated with ","element":"span"},{"style":{"height":11.6},"width":128.87,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-4.png","element":"img","alt":" x ∈ Rn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies the properties below.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The optimal soft Q-function ","element":"span"},{"style":{"height":16},"width":104.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-5.png","element":"img","alt":" Q∗(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a fixed point of ","element":"span"},{"style":{"height":17.28},"width":215.55,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-6.png","element":"img","alt":" TMτ (x), i.e.,","inline":true}],[{"id":"id-169","style":{"width":"59%"},"width":1168,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-7.png","element":"img"}],[{"style":{"height":17.28},"width":275.5,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-8.png","element":"img","alt":"• TMτ (x) is a γ","inline":true},{"style":{"fontStyle":"italic"},"text":"-contraction in the ","element":"span"},{"style":{"height":5.2},"width":48.6,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-9.png","element":"img","alt":" ℓ∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"norm, i.e., for any ","element":"span"},{"style":{"height":17.38},"width":278.53,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-10.png","element":"img","alt":" Q1, Q2 ∈ R|S||A|","inline":true},{"style":{"fontStyle":"italic"},"text":", it holds that ","element":"span"},{"style":{"height":20.81},"width":1385.94,"height":52.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-11.png","element":"img","alt":"��TMτ (x)(Q1) − TMτ (x)(Q2)��∞ ≤ γ��Q1 − Q2��∞ . (44)","inline":true}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With any initial ","element":"span"},{"style":{"height":17.28},"width":353.24,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-12.png","element":"img","alt":" Q0, applying TMτ (x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"repeatedly converges to ","element":"span"},{"style":{"height":16},"width":104.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-13.png","element":"img","alt":" Q∗(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"linearly, i.e., for any ","element":"span"},{"style":{"height":14},"width":125.85,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-14.png","element":"img","alt":" N ∈ N,","inline":true},{"style":{"height":30.29},"width":856.23,"height":75.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-15.png","element":"img","alt":"��T NMτ (x)(Q0) − Q∗ (x)��∞ ≤ γN ��Q0 − Q∗ (x)��∞ .","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Substitute one of the consistency conditions from (","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"Nachum et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":72,"text":"2017","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"31%"},"width":610,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-16.png","element":"img"}],[{"text":"into the definition of ","element":"span"},{"style":{"height":17.28},"width":133.01,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-17.png","element":"img","alt":" TMτ (x),","inline":true}],[{"style":{"width":"60%"},"width":571,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-18.png","element":"img"}],[{"text":"Additionally, for any ","element":"span"},{"style":{"height":17.38},"width":292,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-19.png","element":"img","alt":" Q1, Q2 ∈ R|S||A|,","inline":true}],[{"style":{"width":"100%"},"width":941,"height":780,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/27-20.png","element":"img"}],[{"text":"Combining the contraction property of ","element":"span"},{"href":"#id-170","style":{"height":17.28},"width":513.4,"height":43.19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-0.png","element":"img","alt":" TMτ (x) and Lemma G.8 yield","inline":true}],[{"style":{"width":"88%"},"width":1715,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-1.png","element":"img"}],[{"text":"Revisiting the update rule for ","element":"span"},{"style":{"height":14},"width":53.85,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-2.png","element":"img","alt":" Vk,","inline":true}],[{"text":"we apply the inequality (","element":"span"},{"href":"#id-171","text":"39","element":"a"},{"text":") and the consistency condition (","element":"span"},{"href":"#id-84","text":"7","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"52%"},"width":1026,"height":163,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-3.png","element":"img"}],[{"text":"In this way, we bound the error term,","element":"span"}],[{"id":"id-176","style":{"width":"22%"},"width":208,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-4.png","element":"img"}],[{"id":"id-153","style":{"fontWeight":"bold"},"text":"G.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence Analysis of M-SoBiRL","element":"span"}],[{"text":"To begin with, some lemmas are provided for measuring the quality of the hyper-gradient estimator ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-5.png","element":"img","alt":"�∇ϕ","inline":true},{"text":". Following this, we proceed to the convergence analysis of Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma G.10. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", the following inequalities hold in Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"85%"},"width":1672,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Drawing on the Lipschitz and boundedness properties of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"in Assumption ","element":"span"},{"href":"#id-110","text":"7.1","element":"a"},{"text":", one can establish","element":"span"}],[{"style":{"width":"75%"},"width":1467,"height":222,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-7.png","element":"img"}],[{"text":"Compute the partial derivative","element":"span"}],[{"style":{"width":"55%"},"width":524,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-8.png","element":"img"}],[{"text":"and substitute ","element":"span"},{"style":{"height":16},"width":309.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-9.png","element":"img","alt":" v = V ∗(x) into it,","inline":true}],[{"text":"Denote the auxiliary policy generated by softmax parameterization associated with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":", i.e., for any ","element":"span"},{"style":{"height":15.6},"width":253.4,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-10.png","element":"img","alt":" (s, a) ∈ S × A,","inline":true}],[{"style":{"width":"71%"},"width":1383,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/28-11.png","element":"img"}],[{"text":"Applying Lemma ","element":"span"},{"href":"#id-172","text":"H.2 ","element":"a"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", I ","element":"span"},{"text":"= 1","element":"span"},{"text":", we obtain","element":"span"}],[{"style":{"width":"54%"},"width":513,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-0.png","element":"img"}],[{"text":"Collecting all rows of ","element":"span"},{"style":{"height":16},"width":320.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-1.png","element":"img","alt":" ∇xφ(x, v) leads to","inline":true}],[{"style":{"width":"37%"},"width":350,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-2.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"30%"},"width":285,"height":312,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-3.png","element":"img"}],[{"text":"where the last inequality results from (","element":"span"},{"href":"#id-173","text":"40","element":"a"},{"text":"). It follows that","element":"span"}],[{"text":"In the sequel, we analyze the error introduced by the estimator ","element":"span"},{"style":{"height":16},"width":250.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-4.png","element":"img","alt":"�∇ϕ (x, π, V, w)","inline":true,"padRight":true},{"text":"in approximating the true hyper-gradient ","element":"span"},{"style":{"height":16},"width":111.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-5.png","element":"img","alt":" ∇ϕ(x)","inline":true},{"text":", as stated in the following lemma.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma G.11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", the hyper-gradient estimator constructed in Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"81%"},"width":1581,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"width":"77%"},"width":728,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with","element":"span"}],[{"style":{"width":"34%"},"width":325,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall the counterpart true hyper-gradient,","element":"span"}],[{"style":{"width":"73%"},"width":694,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/29-9.png","element":"img"}],[{"style":{"width":"99%"},"width":1943,"height":746,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-0.png","element":"img"}],[{"text":"Consequently, we arrive at the convergence analysis. Denote ","element":"span"},{"style":{"height":22.4},"width":775.44,"height":56.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-1.png","element":"img","alt":" δkQ := ∥Qk − Q∗k∥2∞, δkπ = ∥log πk − log π∗k∥2∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.93},"width":308.54,"height":52.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-2.png","element":"img","alt":"δkw := ∥wk − w∗k∥22","inline":true},{"text":". In this fashion, revisit the established results in previous subsections.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"The definition of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"softmax ","element":"span"},{"text":"and the estimation (","element":"span"},{"href":"#id-173","text":"40","element":"a"},{"text":") reveal that","element":"span"}],[{"id":"id-174","style":{"width":"53%"},"width":1051,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Apply (","element":"span"},{"href":"#id-174","text":"47","element":"a"},{"text":") on the result of Lemma ","element":"span"},{"href":"#id-175","text":"G.7","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"67%"},"width":630,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Substitute the update rule of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"into (","element":"span"},{"href":"#id-176","text":"45","element":"a"},{"text":"),","element":"span"}],[{"style":{"width":"35%"},"width":333,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"• ","element":"span"},{"text":"Incorporate (","element":"span"},{"href":"#id-176","text":"46","element":"a"},{"text":"),","element":"span"}],[{"id":"id-116","style":{"fontWeight":"bold"},"text":"Theorem G.12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-110","style":{"fontStyle":"italic"},"text":"7.1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", in Algorithm ","element":"span"},{"href":"#id-88","style":{"fontStyle":"italic"},"text":"2","element":"a"},{"style":{"fontStyle":"italic"},"text":", we can choose constant step sizes ","element":"span"},{"style":{"height":14.8},"width":62.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-6.png","element":"img","alt":" β, η","inline":true},{"style":{"fontStyle":"italic"},"text":", and the inner iteration number ","element":"span"},{"style":{"height":16},"width":173.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-7.png","element":"img","alt":" N ∼ O(1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the iterates ","element":"span"},{"style":{"height":16},"width":208.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-8.png","element":"img","alt":" {xk} satisfy","inline":true}],[{"id":"id-181","style":{"width":"26%"},"width":514,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/30-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Detailed parameter setting is listed as ","element":"span"},{"text":"(","element":"span"},{"href":"#id-177","text":"56","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We consider the merit function","element":"span"}],[{"text":"where the coefficients ","element":"span"},{"style":{"height":15.59},"width":42.43,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-0.png","element":"img","alt":" ζQ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":40.43,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-1.png","element":"img","alt":" ζw","inline":true,"padRight":true},{"text":"are to be determined later. By the Lipschitz property of ","element":"span"},{"style":{"height":16},"width":78.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-2.png","element":"img","alt":" ϕ(x)","inline":true,"padRight":true},{"text":"and the update rule of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"in Algorithm ","element":"span"},{"href":"#id-88","text":"2","element":"a"},{"text":",","element":"span"}],[{"id":"id-179","style":{"width":"88%"},"width":1729,"height":350,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-3.png","element":"img"}],[{"text":"Subsequently, it follows that","element":"span"}],[{"id":"id-178","style":{"width":"82%"},"width":773,"height":1035,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-4.png","element":"img"}],[{"text":"with","element":"span"}],[{"id":"id-180","style":{"width":"55%"},"width":518,"height":663,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/31-5.png","element":"img"}],[{"text":"where (","element":"span"},{"href":"#id-178","text":"52","element":"a"},{"text":") comes from (","element":"span"},{"href":"#id-179","text":"51","element":"a"},{"text":") and (","element":"span"},{"href":"#id-180","text":"53","element":"a"},{"text":") is the derivation of substituting (","element":"span"},{"href":"#id-174","text":"49","element":"a"},{"text":"), (","element":"span"},{"href":"#id-174","text":"48","element":"a"},{"text":") and (","element":"span"},{"href":"#id-181","text":"50","element":"a"},{"text":"). One sufficient condition","element":"span"}],[{"id":"id-182","style":{"width":"99%"},"width":1944,"height":424,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-0.png","element":"img"}],[{"text":"Set the step size ","element":"span"},{"style":{"height":14.8},"width":317.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-1.png","element":"img","alt":" β = ρη with ρ > 0","inline":true},{"text":". More precisely, we present the parameter configuration to guarantee (","element":"span"},{"href":"#id-182","text":"55","element":"a"},{"text":").","element":"span"}],[{"id":"id-177","style":{"width":"86%"},"width":1690,"height":392,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-2.png","element":"img"}],[{"text":"which means ","element":"span"},{"style":{"height":16},"width":173.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-3.png","element":"img","alt":" N ∼ O(1)","inline":true,"padRight":true},{"text":"and the step sizes can be chosen as constants. In this way, (","element":"span"},{"href":"#id-180","text":"54","element":"a"},{"text":") implies","element":"span"}],[{"style":{"width":"26%"},"width":517,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-4.png","element":"img"}],[{"text":"Summing and telescoping it, we have","element":"span"}],[{"style":{"width":"55%"},"width":518,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-5.png","element":"img"}],[{"id":"id-108","style":{"fontWeight":"bold"},"text":"H ","element":"span"},{"style":{"fontWeight":"bold"},"text":"ANALYSIS OF SOBIRL","element":"span"}],[{"text":"In this section, we prove the convergence of Algorithm ","element":"span"},{"href":"#id-100","text":"1","element":"a"},{"text":", SoBiRL. We outline the proof sketch of Theorem ","element":"span"},{"href":"#id-99","text":"7.6 ","element":"a"},{"text":"here. Firstly, we characterize the distributional drift induced by two different policies ","element":"span"},{"href":"#id-183","style":{"height":16.59},"width":485.48,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-6.png","element":"img","alt":" π1, π2 in Lemma H.1. Then,","inline":true,"padRight":true},{"text":"the Lipschitz property of the hyper-gradient is clarified in Proposition ","element":"span"},{"href":"#id-118","text":"H.6","element":"a"},{"text":", and the quality of the hyper-gradient estimator is measured in Proposition ","element":"span"},{"href":"#id-119","text":"H.7","element":"a"},{"text":". Based on these two results, we arrive at Theorem ","element":"span"},{"href":"#id-184","text":"H.8","element":"a"},{"text":", the convergence analysis of SoBiRL.","element":"span"}],[{"text":"The next proposition, generalized from Lemma 1 in (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"), considers the distribution ","element":"span"},{"style":{"height":16},"width":360.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-7.png","element":"img","alt":"ρ ((d1, d2, . . . , dI) ; π)","inline":true},{"text":", with each trajectory ","element":"span"},{"style":{"height":19.91},"width":344.75,"height":49.77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-8.png","element":"img","alt":" di = {�sih, aih�}H−1h=0","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , I","element":"span"},{"text":") sampled from the trajectory ","element":"span"},{"text":"distribution ","element":"span"},{"style":{"height":16},"width":207.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-9.png","element":"img","alt":" ρ (d; π), i.e.,","inline":true}],[{"id":"id-190","style":{"width":"56%"},"width":1107,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-10.png","element":"img"}],[{"id":"id-183","style":{"fontWeight":"bold"},"text":"Lemma H.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Denote the total variation between distributions by ","element":"span"},{"style":{"height":16},"width":137.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-11.png","element":"img","alt":" DF (·, ·)","inline":true},{"style":{"fontStyle":"italic"},"text":". For any trajectory tuple ","element":"span"},{"style":{"height":16},"width":265.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-12.png","element":"img","alt":" (d1, d2, . . . , dI),","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with each trajectory holding a finite horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"83%"},"width":1635,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Specifically, for any ","element":"span"},{"style":{"height":17.38},"width":757.43,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-14.png","element":"img","alt":" x1, x2 ∈ Rn, take π1 = π∗(x1), π2 = π∗(x2),","inline":true}],[{"style":{"width":"76%"},"width":1492,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/32-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Firstly, we focus on the situation ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 1","element":"span"},{"text":", i.e., the one-trajectory distribution ","element":"span"},{"style":{"height":16},"width":121.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-0.png","element":"img","alt":" ρ (d; π)","inline":true},{"text":". Note that the total variation is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"-divergence induced by ","element":"span"},{"style":{"height":19.37},"width":282.76,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-1.png","element":"img","alt":" F(x) = 12 |x − 1|","inline":true},{"text":". The triangle inequality of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":"-divergence leads to a ","element":"span"},{"text":"decomposition:","element":"span"}],[{"style":{"width":"83%"},"width":1627,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-2.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"as a mixed policy executing ","element":"span"},{"style":{"height":13.39},"width":40.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-3.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"at the first time-step of any trajectory and obeying ","element":"span"},{"style":{"height":13.39},"width":40.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-4.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"for subsequent timesteps. In this way, by definition of ","element":"span"},{"style":{"height":14.4},"width":181.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-5.png","element":"img","alt":" DF and p,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"id":"id-186","text":")","element":"span"}],[{"id":"id-185","style":{"width":"99%"},"width":938,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-6.png","element":"img"}],[{"text":"From (","element":"span"},{"href":"#id-185","text":"59","element":"a"},{"text":"), it reveals that for different trajectory tuples, they share the same ","element":"span"},{"style":{"height":15.6},"width":72.42,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-7.png","element":"img","alt":" F(·)","inline":true,"padRight":true},{"text":"term if their initial states-actions pairs ","element":"span"},{"style":{"height":16},"width":124.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-8.png","element":"img","alt":" (s0, a0)","inline":true,"padRight":true},{"text":"are identical, which implies","element":"span"}],[{"id":"id-188","style":{"width":"63%"},"width":1246,"height":426,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-9.png","element":"img"}],[{"text":"Then we consider the second term in the triangle inequality decomposition (","element":"span"},{"href":"#id-186","text":"58","element":"a"},{"text":").","element":"span"}],[{"id":"id-187","style":{"width":"87%"},"width":1700,"height":645,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.38},"width":89.88,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-11.png","element":"img","alt":" dH−1 ","inline":true,"padRight":true},{"text":"denotes the trajectory with ","element":"span"},{"style":{"height":10.8},"width":104.74,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-12.png","element":"img","alt":" H − 1","inline":true,"padRight":true},{"text":"horizon, obtained by cutting ","element":"span"},{"style":{"height":13.59},"width":162.92,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-13.png","element":"img","alt":" s0 and a0","inline":true},{"text":". One can derive (","element":"span"},{"href":"#id-187","text":"61","element":"a"},{"text":") since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"aligns with ","element":"span"},{"style":{"height":13.39},"width":40.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-14.png","element":"img","alt":" π2","inline":true,"padRight":true},{"text":"at the first timestep, and with ","element":"span"},{"style":{"height":13.39},"width":40.15,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-15.png","element":"img","alt":" π1","inline":true,"padRight":true},{"text":"at the following steps, and (","element":"span"},{"href":"#id-187","text":"62","element":"a"},{"text":") follows from constructing a new mixed policy ","element":"span"},{"style":{"height":10},"width":34.04,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-16.png","element":"img","alt":" p′","inline":true,"padRight":true},{"text":"in a similar way and applying the triangle inequality. A subsequent result is that we reduce the original divergence between ","element":"span"},{"style":{"height":13.78},"width":172.56,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-17.png","element":"img","alt":" π1 and π2 ","inline":true,"padRight":true},{"text":"measured on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":"-length trajectories to divergence measured on one-length trajectories in (","element":"span"},{"href":"#id-188","text":"60","element":"a"},{"text":") and ","element":"span"},{"style":{"height":15.6},"width":126.55,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-18.png","element":"img","alt":" (H −1)","inline":true},{"text":"-length trajectories in (","element":"span"},{"href":"#id-187","text":"62","element":"a"},{"text":"). Repeating the process on the ","element":"span"},{"style":{"height":15.6},"width":246.33,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/33-19.png","element":"img","alt":" (H −1)-length","inline":true}],[{"text":"trajectories in (","element":"span"},{"href":"#id-187","text":"62","element":"a"},{"text":") yields","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":10},"width":39.6,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-0.png","element":"img","alt":" ρh","inline":true,"padRight":true},{"text":"is a state distribution related to the timestep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"and","element":"span"}],[{"style":{"width":"26%"},"width":517,"height":83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-1.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":16},"width":360.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-2.png","element":"img","alt":" ρ ((d1, d2, . . . , dI) ; π)","inline":true,"padRight":true},{"text":"is the product measure of ","element":"span"},{"style":{"height":16},"width":931.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-3.png","element":"img","alt":" ρ (di; π) , (i = 1, 2, . . . , I), i.e., ρ ((d1, d2, . . . , dI) ; π) =","inline":true},{"style":{"height":16},"width":609.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-4.png","element":"img","alt":"ρ (d1; π) × ρ (d2; π) × · · · × ρ (dI; π)","inline":true},{"text":". By the total variation inequality for the product measure,","element":"span"}],[{"style":{"width":"86%"},"width":1693,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-5.png","element":"img"}],[{"text":"A byproduct of Lemma ","element":"span"},{"href":"#id-183","text":"H.1 ","element":"a"},{"text":"is the lemma presented below, which measures the difference between the expectations of the same function evaluated on different distributions.","element":"span"}],[{"id":"id-172","style":{"fontWeight":"bold"},"text":"Lemma H.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If the vector-valued function ","element":"span"},{"style":{"height":16},"width":418.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-6.png","element":"img","alt":" z (d1, d2, . . . , dI; x) ∈ Rn ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded by ","element":"span"},{"style":{"height":16.78},"width":551.98,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-7.png","element":"img","alt":" C, i.e., ∥z (d1, d2, . . . , dI; x)∥2 ≤","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"C","element":"span"},{"style":{"fontStyle":"italic"},"text":", then for any trajectory tuple ","element":"span"},{"style":{"height":16},"width":254.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-8.png","element":"img","alt":" (d1, d2, . . . , dI)","inline":true},{"style":{"fontStyle":"italic"},"text":", with each trajectory holding a finite horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":", and policies ","element":"span"},{"style":{"height":14.4},"width":253.38,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-9.png","element":"img","alt":"π1, π2, it holds","inline":true}],[{"style":{"width":"83%"},"width":1617,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We resort to the property of the Integral Probability Metric (IPM) (","element":"span"},{"href":"#id-189","referenceIndex":85,"text":"Sriperumbudur et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-189","referenceIndex":85,"text":"2009","element":"a"},{"text":"),","element":"span"}],[{"id":"id-191","style":{"width":"99%"},"width":1946,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/34-11.png","element":"img"}],[{"text":"Combining (","element":"span"},{"href":"#id-190","text":"57","element":"a"},{"text":") and (","element":"span"},{"href":"#id-191","text":"64","element":"a"},{"text":") completes the proof.","element":"span"}],[{"text":"Drawing from the distributional drift investigated by Lemma ","element":"span"},{"href":"#id-172","text":"H.2","element":"a"},{"text":", we can bound the difference between expectations, ","element":"span"},{"id":"id-195","text":"where both functions and distributions are associated with the upper-level variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma H.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", if ","element":"span"},{"style":{"height":16},"width":418.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-0.png","element":"img","alt":" z (d1, d2, . . . , dI; x) ∈ Rn ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded by ","element":"span"},{"style":{"height":16.78},"width":557.88,"height":41.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-1.png","element":"img","alt":" C, i.e., ∥z (d1, d2, . . . , dI; x)∥2 ≤","inline":true},{"style":{"height":14.4},"width":220.07,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-2.png","element":"img","alt":"C, and is Lz","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous with respect to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", then the function","element":"span"}],[{"style":{"width":"35%"},"width":699,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":50.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-4.png","element":"img","alt":" LZ","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz continuous with ","element":"span"},{"style":{"height":19.2},"width":430.96,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-5.png","element":"img","alt":" LZ = Lz + CHILπ�|A|","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., for any ","element":"span"},{"style":{"height":14},"width":210.2,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-6.png","element":"img","alt":" x1, x2 ∈ Rn,","inline":true}],[{"style":{"width":"31%"},"width":620,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"To begin with, we decompose ","element":"span"},{"style":{"height":16},"width":252.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-8.png","element":"img","alt":" Z(x1) − Z(x2)","inline":true,"padRight":true},{"text":"into two terms,","element":"span"}],[{"style":{"width":"86%"},"width":1679,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-9.png","element":"img"}],[{"text":"The first term can be bounded by the Lipschitz continuity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"since the two expectations share the same distribution ","element":"span"},{"style":{"height":16},"width":189.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-10.png","element":"img","alt":"ρ(d; π∗(x1)","inline":true},{"text":". To control the second term, applying Lemma ","element":"span"},{"href":"#id-172","text":"H.2 ","element":"a"},{"text":"and considering the Lipschitz continuity of ","element":"span"},{"style":{"height":16},"width":96.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-11.png","element":"img","alt":" π∗(x)","inline":true,"padRight":true},{"text":"lead to","element":"span"}],[{"style":{"width":"92%"},"width":1805,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-12.png","element":"img"}],[{"text":"It completes the proof by","element":"span"}],[{"text":"The following proposition measures the quality of the implicit differentiation estimators. A similar result is introduced as an assumption in (","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":") (Lemma 18, condition (c)). We claim that it is satisfied in our setting.","element":"span"}],[{"id":"id-194","style":{"fontWeight":"bold"},"text":"Proposition H.4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", for any upper-level variable ","element":"span"},{"style":{"height":11.6},"width":120.26,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-13.png","element":"img","alt":" x ∈ Rn ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and policies ","element":"span"},{"style":{"height":16.59},"width":265.34,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-14.png","element":"img","alt":" π1, π2, we have","inline":true}],[{"style":{"width":"61%"},"width":1200,"height":232,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Specifically, taking ","element":"span"},{"style":{"height":17.39},"width":514.04,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-16.png","element":"img","alt":" π1 = π∗(x) and π2 = π yields","inline":true}],[{"style":{"width":"57%"},"width":544,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recalling the expression of ","element":"span"},{"style":{"height":16},"width":176.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-18.png","element":"img","alt":"�∇Vs (x, π)","inline":true}],[{"style":{"width":"100%"},"width":941,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-19.png","element":"img"}],[{"text":"and the recursive rule","element":"span"}],[{"id":"id-192","style":{"width":"55%"},"width":522,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/35-20.png","element":"img"}],[{"text":"Subtracting (","element":"span"},{"href":"#id-192","text":"65","element":"a"},{"text":") by (","element":"span"},{"href":"#id-192","text":"66","element":"a"},{"text":") yields","element":"span"}],[{"text":"The first term can be bounded by applying Lemma ","element":"span"},{"href":"#id-172","text":"H.2 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16.78},"width":555.67,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-0.png","element":"img","alt":" I = 1, H = 1, ∥∇rsa(x)∥2 ≤ Crx","inline":true},{"text":", and the second term can be decomposed into","element":"span"}],[{"id":"id-193","style":{"width":"76%"},"width":1494,"height":248,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-1.png","element":"img"}],[{"text":"where the term (","element":"span"},{"href":"#id-193","text":"68","element":"a"},{"text":") can be bounded by applying Lemma ","element":"span"},{"href":"#id-172","text":"H.2 ","element":"a"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", H ","element":"span"},{"text":"= 2 ","element":"span"},{"text":"and","element":"span"},{"style":{"height":30.29},"width":365.76,"height":75.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-2.png","element":"img","alt":"��∇Vs′(x, π)��2 ≤ Crx1−γ","inline":true,"padRight":true},{"text":".","element":"span"}],[{"style":{"width":"68%"},"width":1337,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-3.png","element":"img"}],[{"text":"and collecting the observations above into (","element":"span"},{"href":"#id-193","text":"67","element":"a"},{"text":") lead to","element":"span"}],[{"style":{"width":"57%"},"width":536,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-4.png","element":"img"}],[{"text":"which means","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":282,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-5.png","element":"img"}],[{"text":"we achieve the similar property of ","element":"span"},{"style":{"height":16},"width":177.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-6.png","element":"img","alt":"�∇Q (x, π).","inline":true}],[{"text":"A byproduct of Proposition ","element":"span"},{"href":"#id-194","text":"H.4 ","element":"a"},{"text":"is the Lipschitz smoothness of the implicit differentiations, ","element":"span"},{"style":{"height":16},"width":309.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-7.png","element":"img","alt":" V ∗(x) and Q∗(x).","inline":true}],[{"id":"id-148","style":{"fontWeight":"bold"},"text":"Proposition H.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", given ","element":"span"},{"style":{"height":14},"width":196.5,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-8.png","element":"img","alt":" x1, x2 ∈ Rn","inline":true},{"style":{"fontStyle":"italic"},"text":", we have for any ","element":"span"},{"style":{"height":14.8},"width":234.57,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-9.png","element":"img","alt":" s ∈ S, a ∈ A,","inline":true}],[{"style":{"width":"39%"},"width":777,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":13.19},"width":143.81,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-11.png","element":"img","alt":" LV 1 > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the smoothness constant for the value functions,","element":"span"}],[{"style":{"width":"31%"},"width":619,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Proposition ","element":"span"},{"href":"#id-147","text":"E.1 ","element":"a"},{"text":"and the definition of of the implicit differentiation reveal that","element":"span"}],[{"style":{"width":"63%"},"width":1240,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-13.png","element":"img"}],[{"text":"Through this equality, the following decomposition is derived,","element":"span"}],[{"style":{"width":"57%"},"width":1127,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/36-14.png","element":"img"}],[{"text":"Apply Proposition ","element":"span"},{"href":"#id-194","text":"H.4 ","element":"a"},{"text":"and Lipschitz continuity of ","element":"span"},{"style":{"height":14.19},"width":53.46,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-0.png","element":"img","alt":" π∗,","inline":true}],[{"style":{"width":"99%"},"width":939,"height":648,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-1.png","element":"img"}],[{"text":"Consequently,","element":"span"}],[{"style":{"width":"85%"},"width":800,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-2.png","element":"img"}],[{"text":"In a similar fashion, we can also conclude","element":"span"}],[{"text":"In this way, it establishes the Lipschitz smoothness of the hyper-objective ","element":"span"},{"style":{"height":16},"width":78.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-3.png","element":"img","alt":" ϕ(x)","inline":true},{"text":", which plays a key role in the convergence analysis.","element":"span"}],[{"id":"id-118","style":{"width":"99%"},"width":1941,"height":138,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":15.59},"width":121.41,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-5.png","element":"img","alt":" Lϕ > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is defined by","element":"span"}],[{"style":{"width":"72%"},"width":678,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Given","element":"span"}],[{"text":"it is deduced from Lemma ","element":"span"},{"href":"#id-195","text":"H.3 ","element":"a"},{"text":"that the first term in (","element":"span"},{"href":"#id-196","text":"69","element":"a"},{"text":") is","element":"span"},{"style":{"height":28.8},"width":390.66,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-7.png","element":"img","alt":"�Ll1 + LlHILπ�|A|�","inline":true},{"text":"-Lipschitz continuous since","element":"span"}],[{"style":{"height":16},"width":436.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-8.png","element":"img","alt":"∇l (d1, d2, . . . , dI; x) is Ll","inline":true},{"text":"-boundedness and ","element":"span"},{"style":{"height":13.19},"width":53.36,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-9.png","element":"img","alt":" Ll1","inline":true},{"text":"-Lipschitz continuous, revealed by Assumption ","element":"span"},{"href":"#id-111","text":"7.2","element":"a"},{"text":". Subsequently, we consider the second term in (","element":"span"},{"href":"#id-196","text":"69","element":"a"},{"text":"). Specifically, assembling the ","element":"span"},{"style":{"height":13.19},"width":37.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-10.png","element":"img","alt":" Ll","inline":true},{"text":"-Lipschitz continuity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":", the ","element":"span"},{"style":{"height":13.19},"width":68.8,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-11.png","element":"img","alt":" LV 1","inline":true},{"text":"-Lipschitz continuity of ","element":"span"},{"style":{"height":15.14},"width":320.37,"height":37.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-12.png","element":"img","alt":" ∇V ∗s , and the LV 1","inline":true},{"text":"-Lipschitz continuity of ","element":"span"},{"style":{"height":14.92},"width":96.75,"height":37.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-13.png","element":"img","alt":" ∇Q∗sa","inline":true},{"text":", along with the boundedness of these functions, we ","element":"span"},{"text":"conclude the function defined by","element":"span"}],[{"style":{"width":"67%"},"width":1305,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-14.png","element":"img"}],[{"text":"is bounded by","element":"span"}],[{"id":"id-196","style":{"width":"30%"},"width":288,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/37-15.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":13.19},"width":42.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-0.png","element":"img","alt":" Lz","inline":true},{"text":"-Lipschitz continuous with","element":"span"}],[{"text":"Applying Lemma ","element":"span"},{"href":"#id-195","text":"H.3 ","element":"a"},{"text":"on the second term in (","element":"span"},{"href":"#id-196","text":"69","element":"a"},{"text":") implies it is","element":"span"},{"style":{"height":28.8},"width":467.12,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-1.png","element":"img","alt":"�Lz + 2H2I2ClCrxLπ1−γ �|A|�","inline":true},{"text":"-Lipschitz continuous.","element":"span"}],[{"text":"Combining the Lipschitz constants of the two terms in (","element":"span"},{"href":"#id-196","text":"69","element":"a"},{"text":") provides the Lipschitz constant of ","element":"span"},{"style":{"height":16},"width":197,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-2.png","element":"img","alt":" ∇ϕ(x), i.e.,","inline":true}],[{"style":{"width":"86%"},"width":1681,"height":353,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-3.png","element":"img"}],[{"text":"Next, we analyze the error brought by the estimator ","element":"span"},{"style":{"height":16},"width":159.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-4.png","element":"img","alt":"�∇ϕ (x, π)","inline":true,"padRight":true},{"text":"to approximate the true hyper-gradient ","element":"span"},{"style":{"height":16},"width":156.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-5.png","element":"img","alt":" ∇ϕ(x) in","inline":true}],[{"text":"the following proposition. ","element":"span"},{"style":{"height":16},"width":170.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-6.png","element":"img","alt":"�∇ϕ (x, π).","inline":true}],[{"id":"id-119","href":"#id-111","style":{"height":16.99},"width":1944.72,"height":42.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-7.png","element":"img","alt":"Proposition H.7. Under Assumptions 7.2 and 7.3, for any upper-level variable x ∈ Rn and policies π1, π2, we","inline":true}],[{"style":{"width":"69%"},"width":1355,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":31.35},"width":861.14,"height":78.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-9.png","element":"img","alt":" L�ϕ = HILl�|A| +2τ −1ClCrxHI√|A|1−γ �1+γ1−γ + HI�","inline":true},{"style":{"fontStyle":"italic"},"text":". Specifically, taking ","element":"span"},{"style":{"height":17.39},"width":514.04,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-10.png","element":"img","alt":" π1 = π∗(x) and π2 = π yields","inline":true},{"style":{"height":30.29},"width":726.67,"height":75.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-11.png","element":"img","alt":"��∇ϕ(x) − �∇ϕ(x, π)��2 ≤ L�ϕ ∥π∗(x) − π∥2 .","inline":true}],[{"style":{"width":"93%"},"width":1814,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-12.png","element":"img"}],[{"text":"Denote ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"style":{"height":16},"width":108.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-13.png","element":"img","alt":" (d1, d2","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , d","element":"span"},{"style":{"height":16},"width":317.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-14.png","element":"img","alt":"I; x, π) := l (d1, d2","inline":true},{"style":{"fontStyle":"italic"},"text":", . . . , d","element":"span"},{"style":{"height":28.8},"width":506.48,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-15.png","element":"img","alt":"I; x)��i�h �∇�Qsihaih − Vsih","inline":true}],[{"text":"boundedness and Lipschitz continuity of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"in Assumption ","element":"span"},{"href":"#id-111","text":"7.2","element":"a"},{"text":", and of ","element":"span"},{"style":{"height":14},"width":158.04,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-16.png","element":"img","alt":"�∇Q, �∇V","inline":true,"padRight":true},{"text":"in Proposition ","element":"span"},{"href":"#id-194","text":"H.4","element":"a"},{"text":", we can establish the boundedness,","element":"span"}],[{"id":"id-197","style":{"width":"67%"},"width":1320,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-17.png","element":"img"}],[{"text":"and the Lipschitz continuity of ","element":"span"},{"style":{"height":16},"width":449.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-18.png","element":"img","alt":" z (d1, d2, . . . , dI; x, π), i.e.,","inline":true}],[{"text":"where we bound (","element":"span"},{"href":"#id-197","text":"72","element":"a"},{"text":") by the Lipschitz continuity (","element":"span"},{"href":"#id-197","text":"71","element":"a"},{"text":"), and bound (","element":"span"},{"href":"#id-197","text":"73","element":"a"},{"text":") by applying Lemma ","element":"span"},{"href":"#id-172","text":"H.2 ","element":"a"},{"text":"with (","element":"span"},{"href":"#id-197","text":"70","element":"a"},{"text":"). Given","element":"span"}],[{"style":{"width":"74%"},"width":1457,"height":189,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/38-19.png","element":"img"}],[{"style":{"width":"99%"},"width":1943,"height":480,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-0.png","element":"img"}],[{"text":"With all the lemmas and propositions now proven, we turn our attention to presenting the convergence theorem for the proposed algorithm, SoBiRL.","element":"span"}],[{"id":"id-184","style":{"fontWeight":"bold"},"text":"Theorem H.8. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and given the accuracy in Algorithm ","element":"span"},{"href":"#id-100","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", we can set the constant step size ","element":"span"},{"style":{"height":22.66},"width":137.31,"height":56.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-1.png","element":"img","alt":" β < 12Lϕ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", then the iterates ","element":"span"},{"style":{"height":16},"width":208.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-2.png","element":"img","alt":" {xk} satisfy","inline":true}],[{"style":{"width":"48%"},"width":952,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":14.18},"width":39.74,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-4.png","element":"img","alt":" ϕ∗ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the minimum of the hyper-objective ","element":"span"},{"style":{"height":16},"width":169.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-5.png","element":"img","alt":" ϕ(x), and","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In this proof, all the notation ","element":"span"},{"style":{"height":16},"width":51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-6.png","element":"img","alt":" ∥·∥","inline":true,"padRight":true},{"text":"without subscripts denotes the 2-norm for simplicity. Based on the ","element":"span"},{"style":{"height":15.59},"width":46.12,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-7.png","element":"img","alt":"Lϕ","inline":true},{"text":"-Lipschitz smoothness of ","element":"span"},{"style":{"height":16},"width":78.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-8.png","element":"img","alt":" ϕ(x)","inline":true,"padRight":true},{"text":"revealed by Proposition ","element":"span"},{"href":"#id-118","text":"H.6","element":"a"},{"text":", a gradient descent step results in the decrease in the hyper-objective ","element":"span"},{"style":{"height":14},"width":37.07,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-9.png","element":"img","alt":" φ:","inline":true}],[{"id":"id-198","style":{"width":"90%"},"width":1760,"height":857,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-10.png","element":"img"}],[{"text":"where the updating rule ","element":"span"},{"style":{"height":16},"width":454.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-11.png","element":"img","alt":" xk+1 = xk − β �∇ϕ (xk, πk)","inline":true,"padRight":true},{"text":"yields (","element":"span"},{"href":"#id-198","text":"74","element":"a"},{"text":"), and (","element":"span"},{"href":"#id-198","text":"75","element":"a"},{"text":") ","element":"span"},{"href":"#id-119","text":"come","element":"a"},{"text":"s from the inequalities ","element":"span"},{"style":{"height":16},"width":152.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-12.png","element":"img","alt":" |⟨a, b⟩| ≤","inline":true},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-13.png","element":"img","alt":"1","inline":true},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-14.png","element":"img","alt":"2","inline":true}],[{"style":{"height":28.8},"width":250.13,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-15.png","element":"img","alt":"�∥a∥2 + ∥b∥2�","inline":true},{"text":"and ","element":"span"},{"style":{"height":21.11},"width":425.6,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-16.png","element":"img","alt":"12 ∥a + b∥2 ≤ ∥a∥2 + ∥b∥2","inline":true},{"text":". Drawing from Proposition ","element":"span"},{"href":"#id-119","text":"H.7 ","element":"a"},{"text":"and the approximate lower-level ","element":"span"},{"text":"solution ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-17.png","element":"img","alt":" πk","inline":true,"padRight":true},{"text":"satisfying ","element":"span"},{"style":{"height":19.61},"width":337.73,"height":49.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-18.png","element":"img","alt":" ∥π∗(xk) − πk∥2 ≤ ϵ","inline":true},{"text":", we can incorporate","element":"span"}],[{"style":{"width":"48%"},"width":939,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/39-19.png","element":"img"}],[{"text":"into (","element":"span"},{"href":"#id-198","text":"76","element":"a"},{"text":") and gain the estimation","element":"span"}],[{"style":{"width":"58%"},"width":546,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-0.png","element":"img"}],[{"text":"Telescoping index from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":", we find that","element":"span"}],[{"text":"where we define ","element":"span"},{"style":{"height":14.19},"width":39.75,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-1.png","element":"img","alt":" ϕ∗","inline":true,"padRight":true},{"text":"to be minimum of the hyper-objective. By setting the step size ","element":"span"},{"style":{"height":22.66},"width":137.36,"height":56.66,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-2.png","element":"img","alt":" β < 12Lϕ","inline":true,"padRight":true},{"text":"and dividing both","element":"span"}],[{"text":"sides by ","element":"span"},{"style":{"height":28.8},"width":259.71,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-3.png","element":"img","alt":" K�β2 − β2Lϕ�","inline":true},{"text":", we arrive at the conclusion.","element":"span"}],[{"text":"The proof of Theorem ","element":"span"},{"href":"#id-184","text":"H.8 ","element":"a"},{"text":"reveals that ","element":"span"},{"style":{"height":21.77},"width":465.28,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-4.png","element":"img","alt":"1K� ||∇ϕ(xk)||22 = O( 1Kβ )","inline":true},{"text":", and we can substitute ","element":"span"},{"style":{"height":16.79},"width":238.68,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-5.png","element":"img","alt":" β = O(1/Lϕ)","inline":true,"padRight":true},{"text":"and","element":"span"}],[{"style":{"height":19.22},"width":259.55,"height":48.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-6.png","element":"img","alt":"Lϕ = O(�|A|)","inline":true,"padRight":true},{"text":"to obtain the outer iteration complexity as ","element":"span"},{"style":{"height":19.2},"width":227.59,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-7.png","element":"img","alt":" O(�|A|ϵ−1).","inline":true}],[{"id":"id-109","style":{"fontWeight":"bold"},"text":"I ","element":"span"},{"style":{"fontWeight":"bold"},"text":"STOCHASTIC MODEL-FREE SOFT BILEVEL RL","element":"span"}],[{"text":"In this section, the details of the stochastic model-free soft bilevel reinforcement learning algorithm, Stoc-SoBiRL are given. Moreover, we provide the statistical properties (bias and variance of the stochastic hyper-gradient) for the sampling scheme Algorithm ","element":"span"},{"href":"#id-103","text":"3 ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-120","text":"I.2","element":"a"},{"text":", and the convergence result of Stoc-SoBiRL Algorithm ","element":"span"},{"href":"#id-105","text":"4 ","element":"a"},{"text":"in Appendix ","element":"span"},{"href":"#id-121","text":"I.3","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"I.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Details of Stoc-SoBiRL","element":"span"}],[{"text":"The core principle for estimating the hyper-gradient is to sample independent trajectories, evaluate the relevant quantities along each trajectory, and then average these values. Specifically, in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th outer iteration, to estimate ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-8.png","element":"img","alt":" ∇ϕ","inline":true},{"text":", an expectation with respect to the random variable ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"d","element":"span"},{"text":", we can generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"independent tuples ","element":"span"},{"style":{"height":16.39},"width":410.22,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-9.png","element":"img","alt":"dm := (dm1 , dm2 , . . . , dmI )","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , M ","element":"span"},{"text":"and denote ","element":"span"},{"style":{"height":17.38},"width":393.56,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-10.png","element":"img","alt":" D := {d1, d2, . . . , dM}","inline":true},{"text":". The next focus is to tackle the ","element":"span"},{"text":"terms ","element":"span"},{"href":"#id-98","style":{"height":16.4},"width":387.65,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-11.png","element":"img","alt":" ∇Q∗ and ∇V ∗ in (13)","inline":true},{"text":". Assuming access to a generative model (","element":"span"},{"href":"#id-101","referenceIndex":43,"text":"Kearns and Singh","element":"a"},{"text":", ","element":"span"},{"href":"#id-101","referenceIndex":43,"text":"1998","element":"a"},{"text":"; ","element":"span"},{"href":"#id-102","referenceIndex":51,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-102","referenceIndex":51,"text":"2024a","element":"a"},{"text":"), from any initial state-action pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":", we sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"independent trajectories of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"by implementing ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-12.png","element":"img","alt":" πk","inline":true},{"text":", i.e., for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , J","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"74%"},"width":1454,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-13.png","element":"img"}],[{"text":"Collecting all these random variables yields ","element":"span"},{"style":{"height":19.67},"width":749.29,"height":49.17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-14.png","element":"img","alt":" ξk := {ξjk (s, a) , j = 1, . . . , J, s ∈ S, a ∈ A}","inline":true},{"text":", and the estimator for ","element":"span"},{"style":{"height":14.18},"width":80.71,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-15.png","element":"img","alt":"∇Q∗ ","inline":true,"padRight":true},{"text":"can be constructed as","element":"span"}],[{"style":{"width":"68%"},"width":1334,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-16.png","element":"img"}],[{"text":"Similarly, the constructions of the random variable ","element":"span"},{"style":{"height":14},"width":37.26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-17.png","element":"img","alt":" ζk","inline":true,"padRight":true},{"text":"and the associated estimator ","element":"span"},{"style":{"height":14.18},"width":256.75,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-18.png","element":"img","alt":"¯∇V ζk for ∇V ∗ ","inline":true,"padRight":true},{"text":"are detailed in Algorithm ","element":"span"},{"href":"#id-103","text":"3","element":"a"},{"text":". Consequently, we obtain the stochastic hyper-gradient:","element":"span"}],[{"id":"id-199","style":{"width":"98%"},"width":1911,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-19.png","element":"img"}],[{"text":"which is abbreviated as ","element":"span"},{"style":{"height":16.83},"width":73.96,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-20.png","element":"img","alt":"¯∇ϕk","inline":true,"padRight":true},{"text":"in the following discussion.","element":"span"}],[{"text":"Momentum techniques are known to be beneficial for reducing variance and accelerating algorithms (","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"Cutkosky ","element":"a"},{"href":"#id-104","referenceIndex":18,"text":"and Orabona","element":"a"},{"text":", ","element":"span"},{"href":"#id-104","referenceIndex":18,"text":"2019","element":"a"},{"text":"). To this end, we maintain a momentum-instructed ","element":"span"},{"style":{"height":13.59},"width":191.94,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-21.png","element":"img","alt":" hk in the k","inline":true},{"text":"-th outer iteration,","element":"span"}],[{"style":{"width":"75%"},"width":1471,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-22.png","element":"img"}],[{"text":"which tracks ","element":"span"},{"style":{"height":16.83},"width":73.96,"height":42.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-23.png","element":"img","alt":"¯∇ϕk","inline":true,"padRight":true},{"text":"via current and historical hyper-gradient estimates. Using ","element":"span"},{"style":{"height":13.19},"width":39.96,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-24.png","element":"img","alt":" hk","inline":true,"padRight":true},{"text":"to update the upper-level ","element":"span"},{"style":{"height":10.4},"width":111.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/40-25.png","element":"img","alt":" xk, we","inline":true,"padRight":true},{"text":"obtain the stochastic Algorithm ","element":"span"},{"href":"#id-105","text":"4","element":"a"},{"text":", Stoc-SoBiRL.","element":"span"}],[{"id":"id-103","style":{"width":"99%"},"width":1947,"height":1765,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/41-0.png","element":"img"}],[{"id":"id-105","style":{"fontWeight":"bold"},"text":"Algorithm 4 ","element":"span"},{"text":"Stochastic model-free soft bilevel reinforcement learning algorithm (Stoch-SoBiRL)","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Input: ","element":"span"},{"text":"iteration number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":", step sizes ","element":"span"},{"style":{"height":17.9},"width":201.4,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/41-1.png","element":"img","alt":" {βk, µk}Kk=1","inline":true},{"text":", initialization ","element":"span"},{"style":{"height":14},"width":155.63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/41-2.png","element":"img","alt":" x1, π0, h0","inline":true},{"text":", sampling configurations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M, J, T","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":429,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/41-3.png","element":"img"}],[{"id":"id-120","style":{"fontWeight":"bold"},"text":"I.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stochasticity of the Estimator ","element":"span"},{"style":{"height":16.83},"width":59.21,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-0.png","element":"img","alt":"¯∇ϕ","inline":true}],[{"text":"This part analyzes the bias and variance introduced by the stochastic hyper-gradient ","element":"span"},{"style":{"height":16.83},"width":59.21,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-1.png","element":"img","alt":"¯∇ϕ","inline":true},{"text":", which serve as a cornerstone for the convergence analysis of the developed Stoc-SoBiRL. To begin with, we show the bias of the estimators ","element":"span"},{"style":{"height":16.98},"width":540.92,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-2.png","element":"img","alt":"¯∇Qξ and ¯∇V ζ to �∇Q and �∇V","inline":true,"padRight":true},{"text":", respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma I.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", we have that given any ","element":"span"},{"style":{"height":14.8},"width":299.73,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-3.png","element":"img","alt":" s ∈ S and a ∈ A,","inline":true}],[{"id":"id-200","style":{"width":"72%"},"width":1409,"height":276,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recall the expression of ","element":"span"},{"style":{"height":14},"width":65.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-5.png","element":"img","alt":"�∇Q","inline":true,"padRight":true},{"text":"in Section ","element":"span"},{"text":"5","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"99%"},"width":938,"height":349,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-6.png","element":"img"}],[{"text":"which together with the boundedness of ","element":"span"},{"style":{"height":14.4},"width":187.82,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-7.png","element":"img","alt":" ∇r implies","inline":true}],[{"text":"In a similar fashion, we can also derive the boundedness for the bias of ","element":"span"},{"style":{"height":17.78},"width":252.09,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-8.png","element":"img","alt":"¯∇V ζks (xk, πk).","inline":true}],[{"text":"Now, we can bound the variance of ","element":"span"},{"style":{"height":17.63},"width":387.36,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-9.png","element":"img","alt":"¯∇ϕ (Dk, ξk, ζk; xk, πk)","inline":true,"padRight":true},{"text":"and the bias between it and the targeted ","element":"span"},{"style":{"height":14},"width":67.95,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-10.png","element":"img","alt":"�∇ϕ.","inline":true}],[{"id":"id-212","style":{"fontWeight":"bold"},"text":"Proposition I.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"id":"id-201","style":{"width":"80%"},"width":1568,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Additionally, it holds that","element":"span"}],[{"style":{"height":28.8},"width":1270.92,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-12.png","element":"img","alt":"EDk,ξk,ζk�� ¯∇ϕ (Dk, ξk, ζk; xk, πk) − EDk,ξk,ζk� ¯∇ϕ (Dk, ξk, ζk; xk, πk)��22","inline":true}],[{"style":{"width":"67%"},"width":1309,"height":140,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Recaping the formulation of ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-14.png","element":"img","alt":"�∇ϕ","inline":true,"padRight":true},{"text":"in Section ","element":"span"},{"text":"5","element":"span"},{"text":",","element":"span"}],[{"text":"and taking into account the expression of ","element":"span"},{"href":"#id-199","style":{"height":17.63},"width":191.73,"height":44.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-15.png","element":"img","alt":"¯∇ϕk in (80","inline":true},{"text":"), we can compute","element":"span"}],[{"style":{"width":"98%"},"width":1915,"height":192,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/42-16.png","element":"img"}],[{"text":"Applying boundedness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"and (","element":"span"},{"href":"#id-200","text":"82","element":"a"},{"text":") on the above equation yields (","element":"span"},{"href":"#id-201","text":"83","element":"a"},{"text":"). The next step is to investigate the variance of the stochastic hyper-gradient. Note that ","element":"span"},{"style":{"height":16.83},"width":73.95,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-0.png","element":"img","alt":"¯∇ϕk","inline":true,"padRight":true},{"text":"is the average of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"i.i.d. random variables","element":"span"}],[{"style":{"width":"81%"},"width":1579,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-1.png","element":"img"}],[{"text":"Considering the boundedness of ","element":"span"},{"style":{"height":16.98},"width":326.76,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-2.png","element":"img","alt":" ∇l, ¯∇Qξ and ¯∇V ζ","inline":true},{"text":", we obtain","element":"span"}],[{"style":{"width":"46%"},"width":906,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-3.png","element":"img"}],[{"text":"This reveals","element":"span"}],[{"text":"The above proposition means the bias and variance of the stochastic hyper-gradient can be sufficiently small by adjusting the sampling configurations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M, J, T","element":"span"},{"text":".","element":"span"}],[{"id":"id-121","style":{"fontWeight":"bold"},"text":"I.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergence Analysis of Stoc-SoBiRL","element":"span"}],[{"text":"In this part, we give a detailed convergence analysis for Stoc-SoBiRL. The Lipschitz properties of ","element":"span"},{"style":{"height":16.98},"width":249.68,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-4.png","element":"img","alt":"¯∇Qξ, ¯∇V and","inline":true},{"style":{"height":16.83},"width":59.21,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-5.png","element":"img","alt":"¯∇ϕ","inline":true,"padRight":true},{"text":"developed in the following lemmas serve as a foundation for convergence results of Stoc-SoBiRL.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma I.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", for arbitrary upper-level variables ","element":"span"},{"style":{"height":16.58},"width":196.51,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-6.png","element":"img","alt":" x1, x2 ∈ Rn","inline":true},{"style":{"fontStyle":"italic"},"text":", policies ","element":"span"},{"style":{"height":16.58},"width":99.88,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-7.png","element":"img","alt":" π1, π2","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":11.6},"width":93.41,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-8.png","element":"img","alt":" s ∈ S","inline":true},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"height":14.8},"width":265.16,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-9.png","element":"img","alt":"a ∈ A, we have","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Given a fixed ","element":"span"},{"style":{"height":14},"width":20,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-10.png","element":"img","alt":" ξ","inline":true},{"text":", by definition of ","element":"span"},{"style":{"height":16.98},"width":80.71,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-11.png","element":"img","alt":"¯∇Qξ","inline":true},{"text":", we can derive","element":"span"}],[{"style":{"width":"67%"},"width":1309,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-12.png","element":"img"}],[{"text":"The lipschitzness of ","element":"span"},{"style":{"height":11.2},"width":51.21,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-13.png","element":"img","alt":" ∇r","inline":true,"padRight":true},{"text":"leads to the lipschitzness of ","element":"span"},{"style":{"height":16.98},"width":80.71,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-14.png","element":"img","alt":"¯∇Qξ","inline":true},{"text":". The proof for ","element":"span"},{"style":{"height":14.18},"width":82.31,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-15.png","element":"img","alt":"¯∇V ζ ","inline":true,"padRight":true},{"text":"follows analogously.","element":"span"}],[{"id":"id-204","style":{"width":"99%"},"width":1945,"height":287,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-16.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By definition (","element":"span"},{"href":"#id-199","text":"80","element":"a"},{"text":"), it holds that","element":"span"}],[{"style":{"width":"2%"},"width":26,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-17.png","element":"img"}],[{"style":{"height":32.52},"width":821.62,"height":81.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-18.png","element":"img","alt":"¯∇ϕ�D, ξ, ζ; x1, π1�− ¯∇ϕ�D, ξ, ζ; x2, π2�= 1M","inline":true}],[{"text":"The boundedness and lipschitzness of ","element":"span"},{"style":{"height":16.98},"width":350.51,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/43-19.png","element":"img","alt":" l, ∇l, ¯∇Qξ and ¯∇V ζ ","inline":true,"padRight":true},{"text":"leads to the conclusion.","element":"span"}],[{"text":"Define ","element":"span"},{"style":{"height":19.2},"width":1825.83,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-0.png","element":"img","alt":" pk := EDk,ξk,ζk� ¯∇ϕ (Dk, ξk, ζk; xk, πk)�− �∇ϕ (xk, πk) and uk := hk − EDk,ξk,ζk� ¯∇ϕ (Dk, ξk, ζk; xk, πk)�,","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":13.19},"width":39.96,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-1.png","element":"img","alt":" hk","inline":true,"padRight":true},{"text":"obeying the update rule,","element":"span"}],[{"id":"id-207","style":{"width":"99%"},"width":1946,"height":405,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.3},"width":170.37,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-3.png","element":"img","alt":" bϕ and σ2ϕ ","inline":true,"padRight":true},{"text":"can be sufficiently small by adjusting the sampling configurations ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M, J, T","element":"span"},{"text":". All the notation ","element":"span"},{"style":{"height":16},"width":51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-4.png","element":"img","alt":" ∥·∥","inline":true,"padRight":true},{"text":"without subscripts denotes the 2-norm for simplicity.","element":"span"}],[{"text":"The next two lemmas characterize the descent properties of ","element":"span"},{"style":{"height":16},"width":270.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-5.png","element":"img","alt":" ϕ(xk) and ∥uk∥","inline":true,"padRight":true},{"text":"recursively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma I.5 ","element":"span"},{"text":"(Descent property of the hyper-objective)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", the iterates ","element":"span"},{"style":{"height":17.9},"width":140.31,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-6.png","element":"img","alt":" {xk}Kk=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"generated by Algorithm ","element":"span"},{"href":"#id-105","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"id":"id-203","style":{"width":"96%"},"width":1881,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":18.71},"width":184.8,"height":46.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-8.png","element":"img","alt":" Lϕ and L�ϕ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are specified in Proposition ","element":"span"},{"href":"#id-118","style":{"fontStyle":"italic"},"text":"H.6 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and Proposition ","element":"span"},{"href":"#id-119","style":{"fontStyle":"italic"},"text":"H.7","element":"a"},{"style":{"fontStyle":"italic"},"text":", respectively, and the expectations are with respect to the stochasticity of the algorithm.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Based on the ","element":"span"},{"style":{"height":15.59},"width":46.12,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-9.png","element":"img","alt":" Lϕ","inline":true},{"text":"-Lipschitz smoothness of ","element":"span"},{"style":{"height":16},"width":78.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-10.png","element":"img","alt":" ϕ(x)","inline":true,"padRight":true},{"text":"revealed by Proposition ","element":"span"},{"href":"#id-118","text":"H.6","element":"a"},{"text":", a stochastic gradient step results in the following estimate of the hyper-objective ","element":"span"},{"style":{"height":14},"width":37.07,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-11.png","element":"img","alt":" φ:","inline":true}],[{"id":"id-202","style":{"width":"86%"},"width":1675,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-12.png","element":"img"}],[{"text":"where the last equality comes from ","element":"span"},{"style":{"height":21.11},"width":625.59,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-13.png","element":"img","alt":" ⟨a, b⟩ = 12(∥a∥2 + ∥b∥2 − ∥a − b∥2)","inline":true},{"text":". ","element":"span"},{"text":"Taking into account the equation ","element":"span"},{"style":{"height":16},"width":469.59,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-14.png","element":"img","alt":"hk − �∇ϕ (xk, πk) − pk = uk","inline":true,"padRight":true},{"text":"and Proposition ","element":"span"},{"href":"#id-119","text":"H.7","element":"a"},{"text":", we obtain","element":"span"}],[{"style":{"width":"76%"},"width":1492,"height":255,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-15.png","element":"img"}],[{"text":"Incorporating this inequality into (","element":"span"},{"href":"#id-202","text":"89","element":"a"},{"text":") and taking expectation with respect to the algorithm lead to the result (","element":"span"},{"href":"#id-203","text":"88","element":"a"},{"text":").","element":"span"}],[{"style":{"width":"2%"},"width":28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-16.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma I.6 ","element":"span"},{"text":"(Descent property of the estimation error)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3","element":"a"},{"style":{"fontStyle":"italic"},"text":", the iterates ","element":"span"},{"style":{"height":17.9},"width":140.31,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-17.png","element":"img","alt":" {xk}Kk=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"generated by Algorithm ","element":"span"},{"href":"#id-105","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"id":"id-208","style":{"width":"98%"},"width":1914,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-18.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":18.71},"width":263.64,"height":46.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/44-19.png","element":"img","alt":" Lϕ, L�ϕ, and LS","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are specified in Proposition ","element":"span"},{"href":"#id-118","style":{"fontStyle":"italic"},"text":"H.6","element":"a"},{"style":{"fontStyle":"italic"},"text":", Proposition ","element":"span"},{"href":"#id-119","style":{"fontStyle":"italic"},"text":"H.7","element":"a"},{"style":{"fontStyle":"italic"},"text":", and Lemma ","element":"span"},{"href":"#id-204","style":{"fontStyle":"italic"},"text":"I.4","element":"a"},{"style":{"fontStyle":"italic"},"text":"; and the expectations are with respect to the stochasticity of the algorithm.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By definitions of ","element":"span"},{"style":{"height":14.4},"width":394.96,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-0.png","element":"img","alt":" hk, uk and pk, we have","inline":true}],[{"style":{"height":17.63},"width":1828.55,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-1.png","element":"img","alt":"− (1 − µk+1) ¯∇ϕ (Dk+1, ξk+1, ζk+1; xk, πk) + ¯∇ϕ (Dk+1, ξk+1, ζk+1; xk+1, πk+1) − �∇ϕ (xk+1, πk+1) − pk+1","inline":true},{"style":{"height":28.8},"width":1501.13,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-2.png","element":"img","alt":"= (1 − µk+1) uk + µk+1�¯∇ϕ (Dk+1, ξk+1, ζk+1; xk+1, πk+1) − �∇ϕ (xk+1, πk+1) − pk+1�","inline":true}],[{"id":"id-206","style":{"width":"68%"},"width":1339,"height":160,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-3.png","element":"img"}],[{"text":"This equality yields the following estimation","element":"span"}],[{"text":"where ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") ","element":"span"},{"text":"comes from the Lipschitz properties of ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-4.png","element":"img","alt":" ∇ϕ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":59.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-5.png","element":"img","alt":"�∇ϕ","inline":true,"padRight":true},{"text":"revealed by Proposition ","element":"span"},{"href":"#id-118","text":"H.6 ","element":"a"},{"text":"and Proposition ","element":"span"},{"id":"id-205","href":"#id-119","text":"H.7","element":"a"},{"text":", and Lipschitz property of ","element":"span"},{"style":{"height":16.83},"width":59.21,"height":42.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-6.png","element":"img","alt":"¯∇ϕ","inline":true,"padRight":true},{"text":"revealed by Lemma ","element":"span"},{"href":"#id-204","text":"I.4","element":"a"},{"text":". Incorporating (","element":"span"},{"href":"#id-205","text":"92","element":"a"},{"text":") into (","element":"span"},{"href":"#id-206","text":"91","element":"a"},{"text":") and recalling the variance ","element":"span"},{"style":{"height":11.59},"width":41.77,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-7.png","element":"img","alt":" σϕ","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-207","text":"87","element":"a"},{"text":"), we complete the proof.","element":"span"}],[{"text":"Now, assembling the descent properties of (","element":"span"},{"href":"#id-203","text":"88","element":"a"},{"text":") and (","element":"span"},{"href":"#id-208","text":"90","element":"a"},{"text":"), we derive the convergence results.","element":"span"}],[{"id":"id-122","style":{"fontWeight":"bold"},"text":"Theorem I.7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumptions ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"7.2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-115","style":{"fontStyle":"italic"},"text":"7.3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and given the maximum iteration number ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"style":{"fontStyle":"italic"},"text":", we can choose appropriate sampling configurations ","element":"span"},{"style":{"height":18.19},"width":699.37,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-8.png","element":"img","alt":" M ∼ O(K4/3), J ∼ O(1), T ∼ O(log K)","inline":true},{"style":{"fontStyle":"italic"},"text":", and set the parameters as follows,","element":"span"}],[{"id":"id-209","style":{"width":"99%"},"width":1945,"height":299,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-9.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then the iterates ","element":"span"},{"style":{"height":16},"width":82.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/45-10.png","element":"img","alt":" {xk}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"generated by Algorithm ","element":"span"},{"href":"#id-105","style":{"fontStyle":"italic"},"text":"4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"satisfy","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"where the expectation is with respect to the stochasticity of the algorithm.","element":"span"}],[{"style":{"width":"99%"},"width":1933,"height":67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-0.png","element":"img"}],[{"text":"leads to the following inequality.","element":"span"}],[{"text":"By the choice (","element":"span"},{"href":"#id-209","text":"93","element":"a"},{"text":") of ","element":"span"},{"style":{"height":14.4},"width":173.91,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-1.png","element":"img","alt":" βk and µk","inline":true},{"text":", we have for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , K","element":"span"},{"text":",","element":"span"}],[{"id":"id-210","style":{"width":"84%"},"width":1648,"height":217,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-2.png","element":"img"}],[{"text":"Taking the two estimations into (","element":"span"},{"href":"#id-210","text":"95","element":"a"},{"text":") shows","element":"span"}],[{"text":"Next, we combine (","element":"span"},{"href":"#id-210","text":"96","element":"a"},{"text":") with (","element":"span"},{"href":"#id-203","text":"88","element":"a"},{"text":") to get the descent property of ","element":"span"},{"style":{"height":13.19},"width":54.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-3.png","element":"img","alt":" Sk.","inline":true}],[{"style":{"width":"93%"},"width":1816,"height":310,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-4.png","element":"img"}],[{"text":"where the last inequality follows from the choice (","element":"span"},{"href":"#id-209","text":"94","element":"a"},{"text":") of ","element":"span"},{"style":{"height":11.59},"width":41.92,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-5.png","element":"img","alt":" nβ","inline":true},{"text":". Telescoping index from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"and dividing it by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"yield","element":"span"}],[{"id":"id-211","style":{"width":"92%"},"width":1802,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-6.png","element":"img"}],[{"text":"Rearrange (","element":"span"},{"href":"#id-211","text":"97","element":"a"},{"text":") and employ ","element":"span"},{"style":{"height":14.4},"width":159.54,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-7.png","element":"img","alt":" βK ≤ βk,","inline":true}],[{"id":"id-213","style":{"width":"81%"},"width":764,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-8.png","element":"img"}],[{"text":"Dividing ","element":"span"},{"style":{"height":16},"width":93.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-9.png","element":"img","alt":" βK/2","inline":true,"padRight":true},{"text":"on both sides implies that","element":"span"}],[{"text":"where ","element":"span"},{"style":{"height":14.19},"width":39.74,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-10.png","element":"img","alt":" ϕ∗ ","inline":true,"padRight":true},{"text":"is the minimum of ","element":"span"},{"style":{"height":16},"width":77.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-11.png","element":"img","alt":" ϕ(x)","inline":true},{"text":". By Proposition ","element":"span"},{"href":"#id-212","text":"I.2","element":"a"},{"text":", we can choose the appropriate sampling parameters ","element":"span"},{"style":{"height":10.8},"width":85.13,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-12.png","element":"img","alt":" M ∼","inline":true},{"href":"#id-207","style":{"height":22.53},"width":1550.94,"height":56.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/46-13.png","element":"img","alt":"O(K4/3), T ∼ O(log K) such that ∥pk∥2 ≤ b2ϕ = K−2/3 and σ2ϕ = K−4/3 in (86) and (87)","inline":true},{"text":". With the parameters ","element":"span"},{"text":"specified in (","element":"span"},{"href":"#id-209","text":"93","element":"a"},{"text":"), we have ","element":"span"},{"style":{"height":23.51},"width":1505.25,"height":58.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-0.png","element":"img","alt":" σ2ϕ�K+1k=1 1βk =O (1), �K+1k=1 βkϵk = �K+1k=1 β3k = O (log K), �K+1k=1 βk ∥pk∥2 =O (1), and","inline":true,"padRight":true},{"text":"therefore the conclusion,","element":"span"}],[{"style":{"width":"92%"},"width":1792,"height":217,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-1.png","element":"img"}],[{"text":"The inequality (","element":"span"},{"href":"#id-213","text":"98","element":"a"},{"text":") in the proof of Theorem ","element":"span"},{"href":"#id-122","text":"I.7 ","element":"a"},{"text":"shows ","element":"span"},{"style":{"height":24.35},"width":604.5,"height":60.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-2.png","element":"img","alt":"1K� E[||∇ϕ(xk)||22] = �O(L2�ϕK−2/3)","inline":true},{"text":", which means the outer ","element":"span"},{"text":"iteration complexity is ","element":"span"},{"style":{"height":17.78},"width":245.83,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-3.png","element":"img","alt":"�O(|A|1.5ϵ−1.5)","inline":true,"padRight":true},{"text":"by substituting ","element":"span"},{"style":{"height":22.34},"width":270.04,"height":55.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-4.png","element":"img","alt":" L�ϕ = O(�|A|).","inline":true}],[{"id":"id-125","style":{"fontWeight":"bold"},"text":"J ","element":"span"},{"style":{"fontWeight":"bold"},"text":"DETAILS ON EXPERIMENTS","element":"span"}],[{"text":"This section presents the details of the experiment implementation. We conduct the RLHF task on three Atari games from the Arcade Learning Environment (ALE) (","element":"span"},{"href":"#id-214","referenceIndex":5,"text":"Bellemare et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-214","referenceIndex":5,"text":"2013","element":"a"},{"text":") and three Mujoco environments (","element":"span"},{"href":"#id-215","referenceIndex":90,"text":"Todorov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-215","referenceIndex":90,"text":"2012","element":"a"},{"text":") to empirically validate the performance of the model-free algorithm, SoBiRL. The reward provided by the original environments serves as the ground truth, and for each trajectory pair, preference is assigned to the trajectory with higher accumulated ground-truth reward. We compare SoBiRL with the bilevel algorithms DRLHF (","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":16,"text":"2017","element":"a"},{"text":"), PBRL (","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":80,"text":"2024","element":"a"},{"text":"), HPGD ","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"Thoma et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":89,"text":"2024","element":"a"},{"text":"), and a baseline algorithm SAC (","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":31,"text":"2018a","element":"a"},{"text":",","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"b","element":"a"},{"text":"). All the bilevel solvers harness deep neural networks to predict rewards, while the baseline SAC receives ground-truth rewards for training. Additionally, we also test on a synthetic bilevel RL experiment to verify the convergence rate of the model-based algorithm, M-SoBiRL. The experiments are produced on a workstation that consists of two Intel® Xeon® Gold 6330 CPUs (total 2","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-5.png","element":"img","alt":"×","inline":true},{"text":"28 cores), 512GB RAM, and one NVIDIA A800 (80GB memory) GPU. We have made the code available on ","element":"span"},{"href":"https://github.com/UCAS-YanYang/SoBiRL","text":"https://github.com/UCAS-YanYang/SoBiRL","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"J.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Practical SoBiRL","element":"span"}],[{"text":"SoBiRL adopts deep neural networks to parameterize the policy and the reward model. In this way, we supplement Algorithm ","element":"span"},{"href":"#id-100","text":"1 ","element":"a"},{"text":"with details, resulting in the practical version, Algorithm ","element":"span"},{"href":"#id-216","text":"5","element":"a"},{"text":". The SAC algorithm capable of automating the temperature parameter (","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"2018b","element":"a"},{"text":") is chosen as the lower-level solver for SoBiRL, i.e., for a fixed reward network, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"SAC ","element":"span"},{"text":"is invoked with several timesteps to update the policy network, the Q-value network and the adaptive temperature parameter (line 2 of Algorithm ","element":"span"},{"href":"#id-216","text":"5","element":"a"},{"text":"). Here, to focus more on the implementation of SoBiRL, we stepsize the details of ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"SAC","element":"span"},{"text":", which fully follows the principles in (","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-126","referenceIndex":32,"text":"2018b","element":"a"},{"text":").","element":"span"}],[{"id":"id-216","style":{"fontWeight":"bold"},"text":"Algorithm 5 ","element":"span"},{"text":"Practical SoBiRL","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Input: ","element":"span"},{"text":"iteration threshold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":", inner iteration timesteps ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":", step size ","element":"span"},{"style":{"height":14.4},"width":194.3,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-6.png","element":"img","alt":" β, buffer D","inline":true,"padRight":true},{"text":"storing trajectory pairs, initial reward network parameterized by ","element":"span"},{"style":{"height":14.76},"width":34.82,"height":36.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-7.png","element":"img","alt":" θr1","inline":true},{"text":", initial policy network parameterized by ","element":"span"},{"style":{"height":14.76},"width":38.82,"height":36.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-8.png","element":"img","alt":" θπ0","inline":true,"padRight":true},{"text":", initial Q-value network","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":534,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/47-9.png","element":"img"}],[{"text":"Adapting from ","element":"span"},{"style":{"height":16},"width":197.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-0.png","element":"img","alt":"�∇ϕ (xk, πk)","inline":true,"padRight":true},{"text":"in Section ","element":"span"},{"text":"5","element":"span"},{"text":", a practical version of hyper-gradient estimator ","element":"span"},{"style":{"height":16.79},"width":412.12,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-1.png","element":"img","alt":" ∇pϕ (xk, πk, τk) is used,","inline":true}],[{"style":{"width":"99%"},"width":1928,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-2.png","element":"img"}],[{"text":"The main difference between (","element":"span"},{"href":"#id-217","text":"99","element":"a"},{"text":") and (","element":"span"},{"href":"#id-218","text":"14","element":"a"},{"text":") lies in two aspects: 1) ","element":"span"},{"style":{"height":16.79},"width":270.66,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-3.png","element":"img","alt":" ∇pϕ (xk, πk, τk)","inline":true,"padRight":true},{"text":"accommodates the adaptive temperature parameter ","element":"span"},{"style":{"height":9.19},"width":34.42,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-4.png","element":"img","alt":" τk","inline":true},{"text":"; 2) the term ","element":"span"},{"style":{"height":28.96},"width":474.54,"height":72.4,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-5.png","element":"img","alt":" ∇�rsihaih(xk) − Ea′∼πk(·|sih)","inline":true}],[{"style":{"height":28.8},"width":216.56,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-6.png","element":"img","alt":"�rsiha′(xk)��","inline":true},{"text":"is employed as an one-depth truncation of ","element":"span"},{"href":"#id-218","style":{"height":28.8},"width":692.98,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-7.png","element":"img","alt":"�∇�Qsihaih (xk, πk) − Vsih (xk, πk)�in (14","inline":true},{"text":"), inspired by the expressions,","element":"span"}],[{"id":"id-217","style":{"width":"60%"},"width":1179,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"J.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Settings for RLHF","element":"span"}],[{"text":"The results on the Atari games—BeamRider, Seaquest, and SpaceInvaders—are shown in Figures ","element":"span"},{"href":"#id-127","text":"1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-128","text":"2","element":"a"},{"text":". The results on the Mujoco environments—HalfCheetah, Walker2d, and Hopper—are presented in Figure ","element":"span"},{"href":"#id-129","text":"3","element":"a"},{"text":". The overall performance is summarized in Table ","element":"span"},{"href":"#id-130","text":"2","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"96%"},"width":1885,"height":532,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-9.png","element":"img"}],[{"id":"id-128","text":"Figure 2: Comparison on the Atari games—Seaquest and SpaceInvaders—evaluated by the ground-truth reward. ","element":"figcaption","subtype":"caption"},{"text":"Each bilevel algorithm collects a total of ","element":"figcaption","subtype":"caption"},{"text":"3000 ","element":"figcaption","subtype":"caption"},{"text":"trajectory pairs. The results are averaged over ","element":"figcaption","subtype":"caption"},{"text":"5 ","element":"figcaption","subtype":"caption"},{"text":"seeds.","element":"figcaption","subtype":"caption"}],[{"text":"As mentioned above, we take ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"SAC ","element":"span"},{"text":"as the lower-level solver for the bilevel algorithms. We set the initial temperature parameter ","element":"span"},{"style":{"height":12.79},"width":107.8,"height":31.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-10.png","element":"img","alt":" τ0 = 1","inline":true},{"text":", the inner iteration timesteps ","element":"span"},{"style":{"height":13.38},"width":143.93,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-11.png","element":"img","alt":" N = 104 ","inline":true,"padRight":true},{"text":"for Atari games and ","element":"span"},{"style":{"height":13.38},"width":208.67,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-12.png","element":"img","alt":" N = 5 × 103 ","inline":true,"padRight":true},{"text":"for Mujoco simulations, and the batch size to ","element":"span"},{"text":"64","element":"span"},{"text":". In addition, on Atari games, the actor network, the critic network, and the temperature parameter are updated by the Adam optimizer, with an initial learning rate ","element":"span"},{"style":{"height":13.38},"width":156.74,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-13.png","element":"img","alt":" 1 × 10−4","inline":true,"padRight":true},{"text":"for BeamRider and SpaceInvaders, and ","element":"span"},{"style":{"height":13.38},"width":147.68,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-14.png","element":"img","alt":" 3 × 10−4 ","inline":true,"padRight":true},{"text":"for Seaquest, linearly decaying to ","element":"span"},{"style":{"height":13.78},"width":247.9,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-15.png","element":"img","alt":" 0 after 8 × 106 ","inline":true,"padRight":true},{"text":"time steps (although the runs were actually trained for only ","element":"span"},{"style":{"height":13.38},"width":124.09,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-16.png","element":"img","alt":" 4 × 106 ","inline":true,"padRight":true},{"text":"timesteps). On all Mujoco simulations, the critic network and the temperature parameter share the same initial learning rate ","element":"span"},{"style":{"height":13.39},"width":150.17,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-17.png","element":"img","alt":" 1 × 10−3","inline":true},{"text":", while that of the actor network is ","element":"span"},{"style":{"height":13.39},"width":150.17,"height":33.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-18.png","element":"img","alt":" 3 × 10−4","inline":true},{"text":". They are updated by Adam optimizers, linearly decaying the learning rates to ","element":"span"},{"style":{"height":13.78},"width":245,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-19.png","element":"img","alt":" 0 after 8 × 106 ","inline":true,"padRight":true},{"text":"time steps (although the runs were actually trained for only ","element":"span"},{"style":{"height":13.38},"width":124.48,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-20.png","element":"img","alt":" 4 × 106 ","inline":true,"padRight":true},{"text":"timesteps).","element":"span"}],[{"text":"Bilevel algorithms employ deep neural networks to parameterize the reward model. The wrapped environment returns an ","element":"span"},{"style":{"height":10.8},"width":208.74,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-21.png","element":"img","alt":" 84 × 84 × 4","inline":true,"padRight":true},{"text":"tensor as the state information. Therefore, data in the shape of ","element":"span"},{"style":{"height":10.8},"width":208.73,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-22.png","element":"img","alt":" 84 × 84 × 4","inline":true,"padRight":true},{"text":"is fed into the reward network. It undergoes four convolutional layers with kernel sizes of ","element":"span"},{"style":{"height":14.8},"width":451.89,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-23.png","element":"img","alt":" 7 × 7, 5 × 5, 3 × 3, 3 × 3","inline":true,"padRight":true},{"text":"and stride values of ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"text":", respectively. Each convolutional layer holds ","element":"span"},{"text":"16 ","element":"span"},{"text":"filters and incorporates leaky ReLU nonlinearities (","element":"span"},{"style":{"height":10.8},"width":150.08,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-24.png","element":"img","alt":"α = 0.01","inline":true},{"text":"). Subsequently, the data passes through a fully connected layer of size ","element":"span"},{"text":"64 ","element":"span"},{"text":"and is then transformed into a scalar. Batch normalization and dropout with a dropout rate of ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"are applied to all convolutional layers to mitigate overfitting. An AdamW (","element":"span"},{"href":"#id-219","referenceIndex":64,"text":"Loshchilov and Hutter","element":"a"},{"text":", ","element":"span"},{"href":"#id-219","referenceIndex":64,"text":"2018","element":"a"},{"text":") optimizer is adopted with the learning rate ","element":"span"},{"style":{"height":17.38},"width":709.5,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-25.png","element":"img","alt":" 3 × 10−4, betas (0.9, 0.999), epsilon 10−8 ","inline":true,"padRight":true},{"text":"and weight decay ","element":"span"},{"style":{"height":13.38},"width":93.64,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/48-26.png","element":"img","alt":" 10−2.","inline":true}],[{"style":{"width":"96%"},"width":1884,"height":1108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/49-0.png","element":"img"}],[{"text":"Figure 3: Comparison of algorithms on the Mujoco simulations—HalfCheetah, Walker2d, and Hopper—evaluated by the ground-truth reward. Each bilevel algorithm collects a total of ","element":"figcaption","subtype":"caption"},{"text":"3000 ","element":"figcaption","subtype":"caption"},{"text":"trajectory pairs. The results are averaged over ","element":"figcaption","subtype":"caption"},{"id":"id-129","text":"5 ","element":"figcaption","subtype":"caption"},{"text":"seeds.","element":"figcaption","subtype":"caption"}],[{"text":"Trajectories of ","element":"span"},{"text":"25 ","element":"span"},{"text":"timesteps are collected to construct the comparison pairs. Initially, we warm up the reward model by ","element":"span"},{"text":"500 ","element":"span"},{"text":"epochs with ","element":"span"},{"text":"600 ","element":"span"},{"text":"labeled pairs. In the following training, it collects ","element":"span"},{"text":"6 ","element":"span"},{"text":"new pairs per reward learning epoch based on the current policy, until the buffer is filled with ","element":"span"},{"text":"3000 ","element":"span"},{"text":"pairs. The batch size is set to ","element":"span"},{"text":"32","element":"span"},{"text":".","element":"span"}],[{"text":"Regarding the Atari games, wrappers are employed, which originate from (","element":"span"},{"href":"#id-220","referenceIndex":71,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-220","referenceIndex":71,"text":"2015","element":"a"},{"text":"): initial ","element":"span"},{"text":"0 ","element":"span"},{"text":"to ","element":"span"},{"text":"30 ","element":"span"},{"text":"no-ops to inject stochasticity, max-pooling pixel values over the last two frames, an episodic life counter, four-frame skipping to accelerate sampling, four-frame stacking to help infer game dynamics, warping the image to size ","element":"span"},{"style":{"height":10.8},"width":128.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/49-1.png","element":"img","alt":"84 × 84","inline":true,"padRight":true},{"text":"and clipping rewards to ","element":"span"},{"style":{"height":16},"width":121.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/49-2.png","element":"img","alt":" [−1, 1].","inline":true}],[{"text":"Specifically, after conducting initial tests, we observe that PBRL with the “value penalty\" and the penalty parameter as ","element":"span"},{"text":"2 ","element":"span"},{"text":"provide robust performance, for which it is adopted for comparison.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"J.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Settings for the Synthetic Bilevel RL Problem","element":"span"}],[{"text":"To test M-SoBiRL, we build a synthetic bilevel RL problem in the form of (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":") with the upper-level ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"as a quadratic function of ","element":"span"},{"style":{"height":15.6},"width":94.82,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-0.png","element":"img","alt":" (x, π)","inline":true,"padRight":true},{"text":"and the dimension of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 100","element":"span"},{"text":". The lower-level problem is adapted from RL problems used in (","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"Zhan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-65","referenceIndex":100,"text":"2023","element":"a"},{"text":") and (","element":"span"},{"href":"#id-66","referenceIndex":52,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-66","referenceIndex":52,"text":"2024b","element":"a"},{"text":"). Specifically, we take ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|S| ","element":"span"},{"text":"= 10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|A| ","element":"span"},{"text":"= 5 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.8},"width":252.62,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-1.png","element":"img","alt":" γ = 0.5, τ = 1","inline":true},{"text":". The transition probabilities are generated randomly, and for all ","element":"span"},{"style":{"height":16},"width":245.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-2.png","element":"img","alt":" (s, a) ∈ S × A","inline":true},{"text":", the reward ","element":"span"},{"style":{"height":9.19},"width":50.01,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-3.png","element":"img","alt":" rsa","inline":true,"padRight":true},{"text":"is parameterized by a quadratic form of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"with a random perturbation. The step sizes in Algorithm ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"are taken as ","element":"span"},{"style":{"height":16},"width":346.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-4.png","element":"img","alt":" (β, η) = (0.008, 0.5).","inline":true,"padRight":true},{"text":"Results of the synthetic experiment are shown in Figure ","element":"span"},{"href":"#id-131","text":"4","element":"a"},{"text":". The curves exhibit the benign convergence property of the proposed algorithm.","element":"span"}],[{"style":{"width":"96%"},"width":1886,"height":509,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-5.png","element":"img"}],[{"id":"id-131","text":"Figure 4: A synthetic bilevel RL problem to verify the model-based algorithm, M-SoBiRL. Metrics are the ","element":"figcaption","subtype":"caption"},{"text":"hyper-gradient norm ","element":"figcaption","subtype":"caption"},{"style":{"height":16.78},"width":166.58,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-6.png","element":"img","alt":" ∥∇ϕ(x)∥2","inline":true,"padRight":true},{"text":"and the upper-level loss ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":130.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-7.png","element":"img","alt":" f(x, π).","inline":true}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"K ","element":"span"},{"style":{"fontWeight":"bold"},"text":"APPLICATIONS OF BILEVEL REINFORCEMENT LEARNING","element":"span"}],[{"text":"In this section, we provide a detailed explanation of how the formulation (","element":"span"},{"href":"#id-81","text":"4","element":"a"},{"text":"), with the upper-level function","element":"span"}],[{"id":"id-221","style":{"width":"68%"},"width":1325,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-8.png","element":"img"}],[{"text":"incorporates bilevel RL applications, using additional examples similar to those in Section ","element":"span"},{"href":"#id-94","text":"3.3","element":"a"},{"text":". We also explain how the boundedness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"assumed in Section ","element":"span"},{"text":"7 ","element":"span"},{"text":"can be verified individually for each case.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reward Shaping ","element":"span"},{"text":"From the perspective of bilevel RL, we can shape an auxiliary reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"at the lower level for efficient agent training, while maintaining the original environment at the upper level to align with the initial task evaluation (","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":38,"text":"2020","element":"a"},{"text":"), which is established as follows,","element":"span"}],[{"style":{"width":"45%"},"width":890,"height":168,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-9.png","element":"img"}],[{"text":"Note that its upper-level function falls into the formulation (","element":"span"},{"href":"#id-221","text":"100","element":"a"},{"text":") with ","element":"span"},{"style":{"height":21.3},"width":630.99,"height":53.25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-10.png","element":"img","alt":" I = 1, H = 1, l = −Qπ¯M¯τ�s10, a10�","inline":true},{"text":". Therefore, the boundedness and smoothness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"can be seen from Proposition ","element":"span"},{"href":"#id-148","text":"H.5","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reinforcement Learning from Human Feedback ","element":"span"},{"text":"The target of RLHF is to learn the intrinsic reward function that incorporates expert knowledge, from simple labels only containing human preferences. It optimizes a policy under ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"at the lower level, and adjusts ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"to align the preference predicted by the reward model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"with the true labels at the upper level.","element":"span"}],[{"style":{"width":"99%"},"width":1944,"height":335,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/50-11.png","element":"img"}],[{"text":"and the preference label ","element":"span"},{"style":{"height":16},"width":176.43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-0.png","element":"img","alt":" y ∈ {0, 1}","inline":true},{"text":", indicating preference for ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-1.png","element":"img","alt":" d1","inline":true,"padRight":true},{"text":"over ","element":"span"},{"style":{"height":13.19},"width":36.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-2.png","element":"img","alt":" d2","inline":true},{"text":", obeys human feedback distribution ","element":"span"},{"style":{"height":16},"width":378.35,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-3.png","element":"img","alt":"y ∼ Dhuman(y|d1, d2)","inline":true},{"text":". Moreover, ","element":"span"},{"style":{"height":13.19},"width":30.89,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-4.png","element":"img","alt":" lh","inline":true,"padRight":true},{"text":"is the binary cross-entropy loss, ","element":"span"},{"style":{"height":16},"width":703.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-5.png","element":"img","alt":" lh (d1, d2, y; x) = −y log P (d1 ≻ d2; x) −","inline":true},{"style":{"height":16},"width":425.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-6.png","element":"img","alt":"(1 − y) log P (d2 ≻ d1; x)","inline":true},{"text":", with the preference probability ","element":"span"},{"style":{"height":16},"width":240.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-7.png","element":"img","alt":" P (d1 ≻ d2; x)","inline":true,"padRight":true},{"text":"built by the Bradley–Terry model,","element":"span"}],[{"id":"id-83","style":{"width":"79%"},"width":1547,"height":156,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-8.png","element":"img"}],[{"text":"The upper-level function of RLHF is an instance of (","element":"span"},{"href":"#id-221","text":"100","element":"a"},{"text":") with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 2","element":"span"},{"text":", finite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", and ","element":"span"},{"style":{"height":16.79},"width":395.9,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-9.png","element":"img","alt":" l = Ey [lh (d1, d2, y; x)]","inline":true},{"text":". Moreover, the boundedness of the reward ","element":"span"},{"style":{"height":16},"width":224.82,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-10.png","element":"img","alt":" |rsa(x)| ≤ Cr","inline":true,"padRight":true},{"text":"will guarantee the boundedness of ","element":"span"},{"style":{"height":19.37},"width":417.84,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-11.png","element":"img","alt":" P, i.e., 12exp(−2HCr) ≤","inline":true},{"style":{"height":16},"width":306.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-12.png","element":"img","alt":"P(d1 ≻ d2; x) ≤ 1","inline":true},{"text":", and thus the boundedness of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Contract Design ","element":"span"},{"text":"This problem involves designing payment mechanisms that enable a principal to influence the agent’s decision-making process (","element":"span"},{"href":"#id-222","referenceIndex":94,"text":"Wu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-222","referenceIndex":94,"text":"2024","element":"a"},{"text":"). Through the state-contingent payment ","element":"span"},{"style":{"height":22.25},"width":205.72,"height":55.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-13.png","element":"img","alt":" x ∈ R|S|×|S|+","inline":true,"padRight":true},{"text":", the principal seeks to incentivize the agent to adopt policies that serve the principal’s interests. Specifically, when the agent takes action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", it costs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"in the lower level while the principle only observes the transition ","element":"span"},{"style":{"height":8.8},"width":118.9,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-14.png","element":"img","alt":"s → s′","inline":true,"padRight":true},{"text":"and receives the reward ","element":"span"},{"style":{"height":16},"width":128.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-15.png","element":"img","alt":" R(s, s′)","inline":true},{"text":". To encourage the transition, the principal offers a positive payment ","element":"span"},{"style":{"height":16},"width":120.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-16.png","element":"img","alt":"x(s, s′)","inline":true},{"text":", which is thus added to the lower-level objective and subtracted from the upper-level objective.","element":"span"}],[{"style":{"width":"70%"},"width":1382,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-17.png","element":"img"}],[{"text":"Note that its upper-level function is an instance of (","element":"span"},{"href":"#id-221","text":"100","element":"a"},{"text":") with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 1","element":"span"},{"text":", finite ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", and ","element":"span"},{"style":{"height":20.4},"width":435.26,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-18.png","element":"img","alt":" l = �Ht=0 −R(st, st+1) +","inline":true},{"style":{"height":15.6},"width":176.44,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-19.png","element":"img","alt":"x(st, st+1)","inline":true},{"text":". It is reasonable that the reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"and the payment ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"are bounded in practice, and thus the function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l ","element":"span"},{"text":"is bounded.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Efficient Robot Navigation ","element":"span"},{"text":"As formulated in (","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":10,"text":"2024","element":"a"},{"text":"), we consider a maze-world environment of size ","element":"span"},{"style":{"height":10.8},"width":121.59,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-20.png","element":"img","alt":" N × N","inline":true},{"text":", and aim to navigate a robot to get close to the destination ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, N","element":"span"},{"text":") ","element":"span"},{"text":"efficiently. To this end, the designer can introduce a modified goal ","element":"span"},{"style":{"height":14.19},"width":116.36,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-21.png","element":"img","alt":" x ∈ R2","inline":true},{"text":", which induces an associated reward ","element":"span"},{"style":{"height":16.58},"width":225.1,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-22.png","element":"img","alt":" rx ∈ R|S|×|A|","inline":true},{"text":", and then use this goal to train the robot (in the lower level). In this manner, the upper-level objective includes both the received reward and the additional cost of moving the robot from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"to ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, N","element":"span"},{"text":")","element":"span"},{"text":". The task is formulated by the following bilevel problem,","element":"span"}],[{"style":{"width":"60%"},"width":1173,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-23.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.6},"width":101.57,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-24.png","element":"img","alt":" ω > 0","inline":true,"padRight":true},{"text":"is the weight balancing the deviation from the original destination ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"N, N","element":"span"},{"text":") ","element":"span"},{"text":"and the accumulative reward measured under the modified goal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". In this case, the upper-level function matches (","element":"span"},{"href":"#id-221","text":"100","element":"a"},{"text":") with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"= 1","element":"span"},{"text":", finite ","element":"span"},{"style":{"height":28.8},"width":940,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-25.png","element":"img","alt":" H, and l = − �Ht=0�rx(st, at) − ω ∥x − (N, N)∥2 /H�","inline":true},{"text":", which is bounded if the reward is bounded.","element":"span"}],[{"id":"id-124","style":{"fontWeight":"bold"},"text":"L ","element":"span"},{"style":{"fontWeight":"bold"},"text":"EXTENSION TO ASYNCHRONOUS AND DISTRIBUTED BILEVEL RL SETTINGS","element":"span"}],[{"text":"In Algorithm ","element":"span"},{"href":"#id-100","text":"1","element":"a"},{"text":", the task within the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-th outer iteration is divided into two components: one is dedicated to approximately solving the lower-level problem such that ","element":"span"},{"style":{"height":17.9},"width":264.27,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-26.png","element":"img","alt":" ∥π∗k − πk∥22 ≤ ϵ","inline":true},{"text":", and the other is used to collect rollouts ","element":"span"},{"text":"via ","element":"span"},{"style":{"height":9.19},"width":39.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-27.png","element":"img","alt":" πk","inline":true},{"text":". We denote the sample complexity as ","element":"span"},{"style":{"height":13.59},"width":253.32,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-28.png","element":"img","alt":" csolver and croll","inline":true},{"text":", respectively. Employing the asynchronous strategy, existing work (","element":"span"},{"href":"#id-223","referenceIndex":81,"text":"Shen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-223","referenceIndex":81,"text":"2023","element":"a"},{"text":") shows the actor-critic algorithm achieves a sample complexity ","element":"span"},{"style":{"height":16},"width":152.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-29.png","element":"img","alt":" csolver/N","inline":true,"padRight":true},{"text":"per worker, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the number of workers. In a similar spirit, we can also take advantage of multiprocessing to collect rollouts amongst ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"workers asynchronously and thus enjoy ","element":"span"},{"style":{"height":16},"width":119.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19697/images/51-30.png","element":"img","alt":" croll/N","inline":true,"padRight":true},{"text":"complexity per worker. Finally, aggregate the rollouts from all the workers to estimate the hyper-gradient and update the upper-level variable. In ","element":"span"},{"text":"this manner, the error bound for the hyper-gradient estimation in our work still holds, and thus the convergence analysis can be generalized.","element":"span"}],[{"text":"In the distributed setting (e.g., see (","element":"span"},{"href":"#id-224","referenceIndex":15,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-224","referenceIndex":15,"text":"2022c","element":"a"},{"text":")), agents are connected via a fully decentralized network and interact with a shared environment. Since the agent can communicate only with its neighbors, estimating the (global) hyper-gradient becomes challenging. Therefore, developing a tailored communication strategy and establishing a convergence analysis is an interesting and meaningful topic with broad applications in multi-agent RL problems. To this end, recent advances in decentralized bilevel optimization offer valuable insights. (","element":"span"},{"href":"#id-225","referenceIndex":45,"text":"Kong ","element":"a"},{"href":"#id-225","referenceIndex":45,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-225","referenceIndex":45,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-226","referenceIndex":34,"text":"He et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-226","referenceIndex":34,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-227","referenceIndex":103,"text":"Zhu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-227","referenceIndex":103,"text":"2024","element":"a"},{"text":").","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]