39:[["$","audio",null,{"id":"tts"}],["$","$L3e",null,{"paperID":"95035","publisher":"neurips","paperJSON":{"title":"MetaCURL: Non-stationary Concave Utility Reinforcement Learning","paperID":"95035","avgLineHeight":11.02,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3f","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"We consider the task of learning in an episodic Markov decision process (MDP) with a finite state space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", a finite action space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", episodes of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":", and a probability transition kernel ","element":"span"},{"style":{"height":17.81},"width":242.52,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-0.png","element":"img","alt":"p := (pn)n∈[N]","inline":true,"padRight":true},{"text":"such that for all ","element":"span"},{"style":{"height":16},"width":522.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-1.png","element":"img","alt":" (x, a) ∈ X × A, pn(·|x, a) ∈ SX","inline":true,"padRight":true},{"text":". For any finite set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":", we denote by ","element":"span"},{"style":{"height":14.21},"width":44.52,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-2.png","element":"img","alt":"SB","inline":true,"padRight":true},{"text":"the simplex induced by this set, and by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|B| ","element":"span"},{"text":"its cardinality. For all ","element":"span"},{"style":{"height":16},"width":496,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-3.png","element":"img","alt":" d ∈ N we let [d] := {1, . . . , d}.","inline":true,"padRight":true},{"text":"At each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", an agent in state ","element":"span"},{"style":{"height":9.6},"width":40.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-4.png","element":"img","alt":" xn","inline":true,"padRight":true},{"text":"chooses an action ","element":"span"},{"style":{"height":16},"width":239.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-5.png","element":"img","alt":" an ∼ πn(·|xn)","inline":true,"padRight":true},{"text":"by means of a policy, and moves to the next state ","element":"span"},{"style":{"height":16},"width":377,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-6.png","element":"img","alt":" xn+1 ∼ pn+1(·|xn, an)","inline":true},{"text":", inducing a state-action distribution sequence ","element":"span"},{"style":{"height":17.6},"width":932.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-7.png","element":"img","alt":"µπ,p := (µπ,pn )n∈[N], where µπ,pn ∈ SX×A for all n ∈ [N].","inline":true}],[{"text":"In many applications of learning in episodic MDPs, an agent aims at finding an optimal policy ","element":"span"},{"style":{"height":7.2},"width":22.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-8.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"maximizing/minimizing a concave/convex function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"of its state-action distribution, known as the Concave Utility Reinforcement Learning (CURL) problem:","element":"span"}],[{"id":"id-35","style":{"width":"60%"},"width":964,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-9.png","element":"img"}],[{"text":"CURL extends reinforcement learning (RL) from linear to convex losses. Many machine learning problems can be written in the CURL setting, including: RL, where for a loss function ","element":"span"},{"style":{"height":16},"width":210,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-10.png","element":"img","alt":" ℓ, F(µπ,p) =","inline":true},{"style":{"height":16},"width":129,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-11.png","element":"img","alt":"⟨ℓ, µπ,p⟩","inline":true},{"text":"; pure RL exploration [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":"], where ","element":"span"},{"style":{"height":16},"width":456,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-12.png","element":"img","alt":" F(µπ,p) = ⟨µπ,p, log(µπ,p)⟩","inline":true},{"text":"; imitation learning [","element":"span"},{"href":"#id-1","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":35,"text":"35","element":"a"},{"text":"] and apprenticeship learning [","element":"span"},{"href":"#id-3","referenceIndex":55,"text":"55","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":1,"text":"1","element":"a"},{"text":"], where ","element":"span"},{"style":{"height":16.8},"width":543.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/0-13.png","element":"img","alt":" F(µπ,p) = Dg(µπ,p, µ∗), with Dg","inline":true,"padRight":true},{"text":"representing a Bregman","element":"span"}],[{"text":"38th Conference on Neural Information Processing Systems (NeurIPS 2024).","element":"span"}],[{"text":"divergence induced by a function ","element":"span"},{"style":{"height":14.59},"width":134.52,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-0.png","element":"img","alt":" g and µ∗ ","inline":true,"padRight":true},{"text":"being a behavior to be imitated; certain instances of mean-field control [","element":"span"},{"href":"#id-5","referenceIndex":7,"text":"7","element":"a"},{"text":"], where ","element":"span"},{"style":{"height":16},"width":423,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-1.png","element":"img","alt":" F(µπ,p) = ⟨ℓ(µπ,p), µπ,p⟩","inline":true},{"text":"; mean-field games with potential rewards [","element":"span"},{"href":"#id-6","referenceIndex":34,"text":"34","element":"a"},{"text":"]; among others. The CURL problem alters the additive structure inherent in standard RL, invalidating the classical Bellman equations, requiring the development of new algorithms.","element":"span"}],[{"text":"Most of existing works on CURL focus on stationary environments [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":57,"text":"57","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":58,"text":"58","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":56,"text":"56","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":25,"text":"25","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":11,"text":"11","element":"a"},{"text":"], where both the objective function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"and the probability transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"remain the same across episodes. However, in practical scenarios, environments are rarely stationary. The work of [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] is the first to address online CURL with objective functions that can change arbitrarily between episodes, also known as adversarial losses [","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"19","element":"a"},{"text":"]. However, their work assumes stationary probability kernels and presents results in terms of static regret (performance comparable to an optimal policy). In non-stationary scenarios, it is more relevant to minimize dynamic regret—the gap between the learner’s total loss and that of any policy sequence (see Eq. ","element":"span"},{"href":"#id-16","text":"(5) ","element":"a"},{"text":"for formal definition). In this work we address this problem by introducing the first algorithm for CURL handling adversarial objective functions and non-stationary probability transitions, achieving near-optimal dynamic regret.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"High-level idea. ","element":"span"},{"text":"Our approach, called MetaCURL, draws inspiration from the online learning literature. In online learning [","element":"span"},{"href":"#id-17","referenceIndex":9,"text":"9","element":"a"},{"text":"], non-stationarity is often managed by running multiple black-box algorithm instances from various starting points and dynamically selecting the best performer using an \"expert\" algorithm. This strategy has demonstrated effectiveness in settings with complete information [","element":"span"},{"href":"#id-18","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":59,"text":"59","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":47,"text":"47","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":33,"text":"33","element":"a"},{"text":"]. With MetaCURL, we extend this concept to decision-making in MDPs. Unlike classical online learning, the main challenge faced is uncertainty. We assume that the probability transition kernel in each episode has a known deterministic structure but is affected by an external noise with unknown distribution, placing us in a setting with only partial information (see Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"for more details). The learner is then unable to observe the agent’s loss under policies other than the one played.","element":"span"}],[{"text":"MetaCURL is a general algorithm that can be applied with any black-box algorithm with low dynamic regret in near-stationary environments. CURL approaches suitable as black-boxes rely on parametric algorithms that would require prior knowledge of the MDP changes to tune their learning rate. MetaCURL also addresses this challenge by simultaneously running multiple learning rates and weighting them in direct proportion to their empirical performance. MetaCURL achieves optimal regret of order ","element":"span"},{"style":{"height":19.2},"width":1211,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-2.png","element":"img","alt":"˜O�√∆π∗T + min {√∆p∞T, T 2/3(∆p)1/3}�, where ∆p∞ and ∆p represent","inline":true,"padRight":true},{"text":"the frequency and magnitude of changes of the probability transition kernel respectively, and ","element":"span"},{"style":{"height":14.59},"width":67.84,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-3.png","element":"img","alt":" ∆π∗","inline":true,"padRight":true},{"text":"is the magnitude of changes of the policy sequence we compare ourselves with in dynamic regret (see Eqs. ","element":"span"},{"href":"#id-22","text":"(6) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-23","text":"(7) ","element":"a"},{"text":"for formal definitions). MetaCURL does not require previous knowledge of the degree of non-stationarity of the environment, and can handle adversarial losses. To ensure completeness, we show that Greedy MD-CURL from [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] fulfills the requirements to serve as a black-box algorithm. This is the first dynamic regret analysis for a CURL approach.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Comparisons. ","element":"span"},{"text":"Without literature on non-stationary CURL, we review non-stationary RL approaches. Most methods [","element":"span"},{"href":"#id-24","referenceIndex":24,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":45,"text":"45","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":40,"text":"40","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":21,"text":"21","element":"a"},{"text":"] typically rely on prior knowledge of the MDP’s non-stationarity degree, while MetaCURL does not. Let ","element":"span"},{"style":{"height":17.6},"width":61.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-4.png","element":"img","alt":" ∆l∞","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.6},"width":40,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-5.png","element":"img","alt":" ∆l","inline":true,"padRight":true},{"text":"represent the frequency and ","element":"span"},{"text":"magnitude of change in the RL loss function, respectively","element":"span"},{"text":"1","element":"span"},{"text":". Recently, [","element":"span"},{"href":"#id-31","referenceIndex":54,"text":"54","element":"a"},{"text":"] achieved a regret of ","element":"span"},{"style":{"height":19.2},"width":797,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-6.png","element":"img","alt":"˜O�min {�(∆p∞ + ∆l∞)T, T 2/3(∆p + ∆l)1/3}�","inline":true},{"text":", a near-optimal result as demonstrated by [","element":"span"},{"href":"#id-29","referenceIndex":40,"text":"40","element":"a"},{"text":"], without requiring prior knowledge of the environment’s variation. However, this regret bound is tied to changes in loss functions, making it ineffective against adversarial losses. In contrast, rather than","element":"span"}],[{"id":"id-32","text":"Table 1: Comparisons of our results with the state-of-the-art in non-stationary RL. Here, ","element":"figcaption","subtype":"caption"},{"style":{"height":15.81},"width":135,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-7.png","element":"img","alt":" ∆p∞, ∆p","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":14.4},"width":63.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-8.png","element":"img","alt":" ∆π∗","inline":true,"padRight":true},{"text":"are defined in ","element":"figcaption","subtype":"caption"},{"href":"#id-22","text":"(6) ","element":"a","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"href":"#id-23","text":"(7)","element":"a","subtype":"caption"},{"text":"; and ","element":"figcaption","subtype":"caption"},{"style":{"height":17.79},"width":61.48,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-9.png","element":"img","alt":" ∆l∞","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":13.39},"width":43.2,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-10.png","element":"img","alt":" ∆l","inline":true,"padRight":true},{"text":"measure the RL loss function variations","element":"figcaption","subtype":"caption"},{"text":"1","element":"span","subtype":"caption"},{"text":". We ","element":"figcaption","subtype":"caption"},{"text":"introduce ","element":"figcaption","subtype":"caption"},{"style":{"height":18},"width":710,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-11.png","element":"img","alt":" DT (∆∞, ∆) := min {√∆∞T, T 2/3∆1/3}.","inline":true}],[{"style":{"width":"97%"},"width":1546,"height":225,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/1-12.png","element":"img"}],[{"text":"depending on the magnitude of variation of the loss function, MetaCURL’s bound depends on the magnitude of variation of the policy sequence we use for comparison in dynamic regret. This allows it to handle adversarial losses, and to compare against policies with a more favorable bias-variance trade-off, which may not align with the optimal policies for each loss. In addition, we improve this dependency by paying it as","element":"span"},{"style":{"height":16.19},"width":131,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-0.png","element":"img","alt":"√∆π∗T","inline":true,"padRight":true},{"text":"instead of ","element":"span"},{"style":{"height":18.4},"width":224.52,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-1.png","element":"img","alt":" (∆π∗)1/3T 2/3","inline":true},{"text":". We summarize comparisons in Table ","element":"span"},{"href":"#id-32","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Other related works. ","element":"span"},{"text":"The studies by [","element":"span"},{"href":"#id-33","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":42,"text":"42","element":"a"},{"text":"] examine the difference between optimizing the objective over infinite trials and the expectation of the objective over a single trial, challenging the traditional CURL formulation in Eq. ","element":"span"},{"href":"#id-35","text":"(1)","element":"a"},{"text":". Here, we retain the classic formulation to align with existing CURL research. Other works on RL with nonlinear objective functions are [","element":"span"},{"href":"#id-36","referenceIndex":46,"text":"46","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":16,"text":"16","element":"a"},{"text":"] focusing on rewards over trajectories rather than individual states. In addition to non-stationarity, there is a series of works on RL with adversarial losses but ","element":"span"},{"style":{"fontStyle":"italic"},"text":"stationary ","element":"span"},{"text":"probability transitions, with results only on static regret [","element":"span"},{"href":"#id-38","referenceIndex":48,"text":"48","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":30,"text":"30","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":50,"text":"50","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":14,"text":"14","element":"a"},{"text":"]. Another line of research is known as corruption-robust RL. It differs from non-stationary MDPs in that it assumes a ground-truth MDP and measures adversary malice by the degree of ground-truth corruption ","element":"span"},{"href":"#id-44","referenceIndex":31,"text":"[31, ","element":"a"},{"href":"#id-45","referenceIndex":38,"text":"38, ","element":"a"},{"href":"#id-46","referenceIndex":10,"text":"10, ","element":"a"},{"href":"#id-47","referenceIndex":60,"text":"60, ","element":"a"},{"href":"#id-48","referenceIndex":53,"text":"53]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Contributions. ","element":"span"},{"text":"We resume our main contributions below:","element":"span"}],[{"text":"• We introduce MetaCURL, the first algorithm for non-stationary CURL. Under the framework described in Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"MetaCURL achieves the optimal dynamic regret bound of order ","element":"span"},{"style":{"height":20.99},"width":714,"height":52.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-2.png","element":"img","alt":"˜O�√∆π∗T + min {√∆p∞T, T 2/3(∆p)1/3}�","inline":true},{"text":", without requiring previous knowledge of the degree of non-stationarity of the MDP. MetaCURL handles full adversarial losses and improves the dependency of the regret on the total variation of policies. MetaCURL is the first adaptation of Learning with Expert Advice (LEA) to deal with uncertainty in non-stationary MDPs.","element":"span"}],[{"text":"• We also establish the first dynamic regret upper bound for an online CURL algorithm in a nearly stationary environment, which can serve as a black-box routine for MetaCURL.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Notations. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":77.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-3.png","element":"img","alt":" ∥ · ∥1","inline":true,"padRight":true},{"text":"be the ","element":"span"},{"style":{"height":13.39},"width":39.52,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-4.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm, and for all ","element":"span"},{"style":{"height":17.6},"width":240,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-5.png","element":"img","alt":" v := (vn)n∈[N]","inline":true},{"text":", such that ","element":"span"},{"style":{"height":16.59},"width":195,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-6.png","element":"img","alt":" vn ∈ RX×A","inline":true},{"text":"we define ","element":"span"},{"style":{"height":18},"width":467.52,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-7.png","element":"img","alt":"∥v∥∞,1 := sup1≤n≤N ∥vn∥1.","inline":true}]]},{"heading":"2 General framework: non-stationary CURL","paragraphs":[[{"text":"When an agent plays a policy ","element":"span"},{"style":{"height":17.79},"width":256,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-8.png","element":"img","alt":" π := (πn)n∈[N]","inline":true,"padRight":true},{"text":"in an episodic MDP with probability transition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", it induces a state-action distribution sequence (also called the occupancy-measure [","element":"span"},{"href":"#id-49","referenceIndex":61,"text":"61","element":"a"},{"text":"]), which we denote by ","element":"span"},{"style":{"height":17.6},"width":333.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-9.png","element":"img","alt":" µπ,p := (µπ,pn )n∈[N]","inline":true},{"text":", with ","element":"span"},{"style":{"height":15.41},"width":233,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-10.png","element":"img","alt":" µπ,pn ∈ SX×A","inline":true},{"text":". It can be calculated recursively for all ","element":"span"},{"style":{"height":17.41},"width":898,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-11.png","element":"img","alt":"(x, a) ∈ X and n ∈ [N] by taking µπ,p0 (x, a) = µ0(x, a)","inline":true,"padRight":true},{"text":"fixed, and","element":"span"}],[{"style":{"width":"79%"},"width":1266,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-12.png","element":"img"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Offline CURL. ","element":"span"},{"text":"The classic CURL optimization problem in Eq. ","element":"span"},{"href":"#id-35","text":"(1) ","element":"a"},{"text":"considers minimizing a function ","element":"span"},{"style":{"height":17.6},"width":318.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-13.png","element":"img","alt":"F : (SX×A)N → R","inline":true},{"text":", here defined as ","element":"span"},{"style":{"height":20.4},"width":503.48,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-14.png","element":"img","alt":" F(µ) := �Nn=1 fn(µn) with fn","inline":true,"padRight":true},{"text":"a convex function over ","element":"span"},{"style":{"height":14.59},"width":123.52,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-15.png","element":"img","alt":" µn with","inline":true,"padRight":true},{"text":"values in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]","element":"span"},{"text":", across all policies that induce ","element":"span"},{"style":{"height":14.4},"width":68,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-16.png","element":"img","alt":" µπ,p","inline":true},{"text":". Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is not convex on the policy ","element":"span"},{"style":{"height":7.2},"width":22.52,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-17.png","element":"img","alt":" π","inline":true},{"text":". To convexify the problem, we define the set of state-action distributions satisfying the Bellman flow of a MDP with transition kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"as","element":"span"}],[{"style":{"width":"95%"},"width":1521,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-18.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":17.6},"width":151.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-19.png","element":"img","alt":" µ ∈ Mpµ0","inline":true},{"text":", there exists a strategy ","element":"span"},{"style":{"height":14.21},"width":327.48,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-20.png","element":"img","alt":" π such that µπ,p = µ","inline":true},{"text":". It suffices to take ","element":"span"},{"style":{"height":16},"width":317.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-21.png","element":"img","alt":" πn(a|x) ∝ µn(x, a)","inline":true,"padRight":true},{"text":"when the normalization factor is non-zero, and arbitrarily defined otherwise. There is thus an equivalence between the CURL problem (optimization on policies) and a convex optimization problem on state-action distributions satisfying the Bellman flow:","element":"span"}],[{"id":"id-69","style":{"width":"68%"},"width":1087,"height":70,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-22.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Online CURL. ","element":"span"},{"text":"In this paper we consider the online CURL problem in a non-stationary setting. We assume a finite-horizon scenario with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"episodes. An oblivious adversary generates a sequence of changing objective functions ","element":"span"},{"style":{"height":18.4},"width":291,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-23.png","element":"img","alt":" (F t)t∈[T ], with F t ","inline":true,"padRight":true},{"text":"being fully communicated to the learner only at the end of episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". We assume ","element":"span"},{"style":{"height":12.8},"width":41,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-24.png","element":"img","alt":" F t","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.41},"width":49,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-25.png","element":"img","alt":" LF","inline":true,"padRight":true},{"text":"-Lipschitz with respect to the ","element":"span"},{"style":{"height":16.61},"width":120.48,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/2-26.png","element":"img","alt":" ∥ · ∥∞,1","inline":true,"padRight":true},{"text":"norm for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". The probability transition kernel is also allowed to evolve over time and is denoted by ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-0.png","element":"img","alt":" pt","inline":true},{"text":"at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". The learner’s objective is then to calculate a sequence of strategies ","element":"span"},{"style":{"height":18.4},"width":136.48,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-1.png","element":"img","alt":" (πt)t∈[T ]","inline":true,"padRight":true},{"text":"minimizing a total loss ","element":"span"},{"style":{"height":20.59},"width":389,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-2.png","element":"img","alt":" LT := �Tt=1 F t(µπt,pt)","inline":true},{"text":", while dealing with adversarial objective functions ","element":"span"},{"style":{"height":12.8},"width":40.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-3.png","element":"img","alt":" F t","inline":true,"padRight":true},{"text":"and changing ","element":"span"},{"text":"probability transition kernels ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-4.png","element":"img","alt":" pt","inline":true},{"text":". To measure the learner’s performance, we use the notion of dynamic regret (the difference between the learner’s total loss and that of any policy sequence ","element":"span"},{"style":{"height":18.4},"width":189,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-5.png","element":"img","alt":" (πt,∗)t∈[T ]):","inline":true}],[{"id":"id-16","style":{"width":"78%"},"width":1250,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-6.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Non-stationarity measures. ","element":"span"},{"text":"We consider the following two non-stationary measures ","element":"span"},{"style":{"height":15.79},"width":241.52,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-7.png","element":"img","alt":" ∆p∞ and ∆p on","inline":true,"padRight":true},{"text":"the probability transition kernels that respectively measure abrupt and smooth variations:","element":"span"}],[{"id":"id-22","style":{"width":"98%"},"width":1564,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-8.png","element":"img"}],[{"text":"Regarding dynamic regret, we define for any sequence of policies ","element":"span"},{"style":{"height":18.61},"width":162.48,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-9.png","element":"img","alt":" (πt,∗)t∈[T ]","inline":true},{"text":", its non-stationarity measure as","element":"span"}],[{"id":"id-23","style":{"width":"87%"},"width":1389,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-10.png","element":"img"}],[{"text":"Moreover, for any interval ","element":"span"},{"style":{"height":20.21},"width":952.48,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-11.png","element":"img","alt":" I ⊆ [T], we write ∆pI := �t∈I ∆pt and ∆π∗I := �t∈I ∆π∗t .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Dynamic’s hypothesis. ","element":"span"},{"text":"For each episode ","element":"span"},{"style":{"height":16.61},"width":345.48,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-12.png","element":"img","alt":" t, let (xt0, at0) ∼ µ0(·)","inline":true},{"text":", and for all time steps ","element":"span"},{"style":{"height":16},"width":138.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-13.png","element":"img","alt":" n ∈ [N],","inline":true}],[{"id":"id-50","style":{"width":"62%"},"width":990,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-14.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.61},"width":37.48,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-15.png","element":"img","alt":" gn","inline":true,"padRight":true},{"text":"represents the deterministic part of the dynamics, and ","element":"span"},{"style":{"height":18.59},"width":149,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-16.png","element":"img","alt":" (ϵtn)n∈[N] ","inline":true,"padRight":true},{"text":"is a sequence of independent ","element":"span"},{"text":"external noises such that ","element":"span"},{"style":{"height":16.8},"width":342,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-17.png","element":"img","alt":" ϵtn ∼ htn(·), where htn ","inline":true,"padRight":true},{"text":"is any centered distribution. Note that these dynamics ","element":"span"},{"text":"imply that the probability transition kernel can be written as ","element":"span"},{"style":{"height":21.41},"width":619.52,"height":53.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-18.png","element":"img","alt":" ptn+1(x′|x, a) = P�gn(x, a, ϵtn) = x′�.","inline":true,"padRight":true},{"text":"Different variants of this problem can be considered, depending on the prior information available about the dynamics in Eq. ","element":"span"},{"href":"#id-50","text":"(8)","element":"a"},{"text":". In this article we consider the case where ","element":"span"},{"style":{"height":10.59},"width":37.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-19.png","element":"img","alt":" gn","inline":true,"padRight":true},{"text":"is fixed and known by the learner, but ","element":"span"},{"style":{"height":16.99},"width":39.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-20.png","element":"img","alt":" htn ","inline":true,"padRight":true},{"text":"is unknown and can change (hence the source of uncertainty and non-stationarity ","element":"span"},{"text":"of the transitions). To the best of our knowledge, there are no black-box algorithms in the literature that achieve sublinear regret for online CURL with adversarial losses without relying on model assumptions. In using RL methods to CURL, we believe model-optimistic approaches like UCRL (Upper Confidence RL [","element":"span"},{"href":"#id-51","referenceIndex":4,"text":"4","element":"a"},{"text":"]) could be adapted. However, these methods are computationally expensive, as they require solving an additional optimization problem in every episode. The black-box algorithm for CURL we consider from [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] provides closed-form solutions, which is more computationally efficient, but requires the same dynamic assumption as in Eq. ","element":"span"},{"href":"#id-50","text":"(8)","element":"a"},{"text":". Another class of RL methods is policy optimization (PO), which directly optimizes the policy and often yields closed-form solutions, leading to faster performance. Recent theoretical work [","element":"span"},{"href":"#id-52","referenceIndex":37,"text":"37","element":"a"},{"text":"] has shown that PO methods can achieve near-optimal regret without model assumptions. However, these methods rely on RL’s Bellman equations, which do not apply to CURL due to its non-linear nature. We believe that the MetaCURL analysis could potentially be extended to the case where ","element":"span"},{"style":{"height":10.61},"width":37.52,"height":26.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/3-21.png","element":"img","alt":" gn","inline":true,"padRight":true},{"text":"is unknown but belongs to a parametric family. We leave this extension for future work.","element":"span"}],[{"text":"This particular dynamic is also motivated by many real-world applications:","element":"span"}],[{"text":"• Controlling a fleet of drones in a known environment, subject to external influences due to weather conditions or human interventions.","element":"span"}],[{"text":"• Addressing data center power management aiming to cut energy expenses while maintaining service quality. Workload fluctuations cause dynamic job queue transitions, and volatile electricity prices lead to varying operational costs. The probabilities of task processing by each server are predetermined, but the probabilities of task arrival are uncertain ","element":"span"},{"href":"#id-53","referenceIndex":6,"text":"[6]","element":"a"},{"text":".","element":"span"}],[{"text":"• As renewable energy use increases and energy demand grows, balancing production and consumption becomes harder. Certain devices, like electric vehicle batteries and water heaters, can serve as flexible energy storage options. However, this requires electric grids to establish policies regulating when these devices turn on or off to match a desired consumption profile. These profiles can fluctuate daily due to changes in energy production levels. Despite knowing the devices’ physical dynamics, household consumption habits remain unpredictable and constantly changing ","element":"span"},{"href":"#id-54","referenceIndex":51,"text":"[51, ","element":"a"},{"href":"#id-55","referenceIndex":41,"text":"41]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Outline. ","element":"span"},{"text":"In this paper, we propose a novel approach to handle non-stationarity in MDPs, being the first to propose a solution to CURL within this context. We begin in Section ","element":"span"},{"text":"3 ","element":"span"},{"text":"by discussing the idea behind our algorithm’s construction and the key challenges within our framework. Section ","element":"span"},{"text":"4 ","element":"span"},{"text":"introduces MetaCURL, while Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"presents the main results of our regret analysis. The proofs’ specifics are provided in the appendix.","element":"span"}]]},{"heading":"3 Main idea","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"A hypothetical learner who achieves optimal regret. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m > ","element":"span"},{"text":"1","element":"span"},{"text":". Assume a hypothetical learner that could compute a sequence of restart times ","element":"span"},{"style":{"height":14.78},"width":533.76,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-0.png","element":"img","alt":" 1 = t1 < . . . < tm+1 = T + 1","inline":true},{"text":", where for each ","element":"span"},{"style":{"height":16},"width":539,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-1.png","element":"img","alt":"i ∈ [m] we let Ii := [ti, ti+1 − 1]","inline":true},{"text":", such that ","element":"span"},{"style":{"height":8.27},"width":136,"height":20.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-2.png","element":"img","alt":"∆p ∆p","inline":true}],[{"id":"id-56","style":{"width":"50%"},"width":804,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-3.png","element":"img"}],[{"text":"Consider any algorithm that, when computing ","element":"span"},{"style":{"height":16.8},"width":114,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-4.png","element":"img","alt":" (πt)t∈I","inline":true,"padRight":true},{"text":"with learning rate ","element":"span"},{"style":{"height":11.6},"width":20.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-5.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"for any interval ","element":"span"},{"style":{"height":16},"width":132,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-6.png","element":"img","alt":" I ⊆ [T],","inline":true,"padRight":true},{"text":"attains a dynamic regret relative to any sequence of policies ","element":"span"},{"style":{"height":16.8},"width":140,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-7.png","element":"img","alt":" (πt,∗)t∈I","inline":true,"padRight":true},{"text":"upper bounded by","element":"span"}],[{"id":"id-57","style":{"width":"78%"},"width":1238,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.81},"width":128,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-9.png","element":"img","alt":" (cj)j∈[3]","inline":true,"padRight":true},{"text":"are constants that may depend on the MDP parameters, and on the interval size only in logarithmic terms. This kind of regret bound holds for Greedy MD-CURL from [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] as we show in Appendix ","element":"span"},{"text":"G. ","element":"span"},{"text":"Suppose the hypothetical learner could also access ","element":"span"},{"style":{"height":14.4},"width":63.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-10.png","element":"img","alt":" ∆π∗","inline":true,"padRight":true},{"text":"to calculate the optimal learning rate. Hence, playing such an algorithm for all horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"with the optimal learning rate, the learner would have a dynamic regret upper bounded by","element":"span"}],[{"style":{"width":"59%"},"width":936,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-11.png","element":"img"}],[{"text":"Optimizing over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m","element":"span"},{"text":", the learner would obtain the optimal regret of order ","element":"span"},{"style":{"height":22.19},"width":462,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-12.png","element":"img","alt":"˜O�√∆π∗T + (∆p)1/3T 2/3�.","inline":true,"padRight":true},{"text":"In the case where the MDP is piece-wise stationary, if the learner takes ","element":"span"},{"style":{"height":13.6},"width":33.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-13.png","element":"img","alt":" Ii","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":19.01},"width":143.52,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-14.png","element":"img","alt":" ∆pIi = 0","inline":true},{"text":", it ","element":"span"},{"text":"obtains a regret of order ","element":"span"},{"style":{"height":19.2},"width":368.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-15.png","element":"img","alt":" O(√∆π∗T +√∆p∞T)","inline":true},{"text":", where ","element":"span"},{"style":{"height":15.6},"width":61.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-16.png","element":"img","alt":" ∆p∞","inline":true,"padRight":true},{"text":"is the number of times the probability ","element":"span"},{"text":"transitions of the MDP change over ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"]","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A meta algorithm to learn restart times. ","element":"span"},{"text":"In reality, the restart times of Eq. ","element":"span"},{"href":"#id-56","text":"(9)","element":"a"},{"text":", and the optimal learning rate, are unknown to the learner. Hence, we propose to build a meta aggregation algorithm to learn both. Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"represent a parametric black-box algorithm with dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"text":". We introduce a meta algorithm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"that, takes as input a finite set of learning rates ","element":"span"},{"style":{"height":11.6},"width":26,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-17.png","element":"img","alt":" Λ","inline":true},{"text":", and at each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", initializes ","element":"span"},{"style":{"height":16},"width":41,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-18.png","element":"img","alt":" |Λ|","inline":true,"padRight":true},{"text":"instances of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"text":", denoted as ","element":"span"},{"style":{"height":14},"width":63,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-19.png","element":"img","alt":" Et,λ","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"height":12.4},"width":98,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-20.png","element":"img","alt":" λ ∈ Λ","inline":true},{"text":". Each ","element":"span"},{"style":{"height":14},"width":62.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-21.png","element":"img","alt":" Et,λ","inline":true,"padRight":true},{"text":"operates independently within the interval ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"t, T","element":"span"},{"text":"]","element":"span"},{"text":". At time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"combines the decisions from the active runs ","element":"span"},{"style":{"height":18.19},"width":229.48,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-22.png","element":"img","alt":" {Es,λ}s≤t,λ∈Λ","inline":true,"padRight":true},{"text":"by weighted average. The idea is that at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", some of the outputs of ","element":"span"},{"style":{"height":18.21},"width":230,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-23.png","element":"img","alt":" {Es,λ}s≤t,λ∈Λ","inline":true,"padRight":true},{"text":"are not based on data prior to ","element":"span"},{"style":{"height":13.01},"width":91.52,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-24.png","element":"img","alt":" t′ < t","inline":true},{"text":", so if the environment changes at time ","element":"span"},{"style":{"height":12.4},"width":22,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-25.png","element":"img","alt":" t′","inline":true},{"text":", these outputs can be given a greater weight by the meta algorithm, enabling it to adapt more quickly to the change. At the same time, we expect a larger weight will be given to the empirically best learning rate. Let ","element":"span"},{"style":{"height":16},"width":145,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-26.png","element":"img","alt":" M(E, Λ)","inline":true,"padRight":true},{"text":"be the complete algorithm.","element":"span"}],[{"id":"id-103","style":{"fontWeight":"bold"},"text":"Remark 3.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The meta-algorithm increases the computational complexity of the parametric black-box algorithm by a factor of ","element":"span"},{"style":{"height":16},"width":122,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-27.png","element":"img","alt":" T × |Λ|","inline":true},{"style":{"fontStyle":"italic"},"text":", as it requires updating ","element":"span"},{"style":{"height":16},"width":107.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-28.png","element":"img","alt":" t × |Λ|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"instances at each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":". By strategically designing intervals to run the black-box algorithms, previous works on online learning have reduced computational complexity to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"O","element":"span"},{"text":"(log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":")) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"[","element":"span"},{"href":"#id-58","referenceIndex":15,"style":{"fontStyle":"italic"},"text":"15","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-18","referenceIndex":29,"style":{"fontStyle":"italic"},"text":"29","element":"a"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-59","referenceIndex":27,"style":{"fontStyle":"italic"},"text":"27","element":"a"},{"style":{"fontStyle":"italic"},"text":"]. Extending our analysis to these intervals is straightforward, but it would complicate the presentation of the paper. Thus, we decided to present our results using the naive choice of intervals. Also, in Section ","element":"span"},{"style":{"fontStyle":"italic"},"text":"5, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"we show that a learning rate grid with ","element":"span"},{"style":{"height":16},"width":206.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-29.png","element":"img","alt":" |Λ| = log(T)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is sufficient to obtain the optimal regret.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Regret decomposition. ","element":"span"},{"text":"Denote by ","element":"span"},{"style":{"height":13.39},"width":89.16,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-30.png","element":"img","alt":" πt,s,λ ","inline":true,"padRight":true},{"text":"the policy output from ","element":"span"},{"style":{"height":17.2},"width":258,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-31.png","element":"img","alt":" Es,λ at episode t","inline":true},{"text":", for learning rate ","element":"span"},{"style":{"height":13.6},"width":29,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-32.png","element":"img","alt":" λ,","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16.4},"width":255.48,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-33.png","element":"img","alt":" s ≤ t, and by πt ","inline":true,"padRight":true},{"text":"the policy output by the meta algorithm ","element":"span"},{"style":{"height":16},"width":144.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-34.png","element":"img","alt":" M(E, Λ)","inline":true,"padRight":true},{"text":"to be played by the learner. The regret of ","element":"span"},{"style":{"height":16},"width":144.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-35.png","element":"img","alt":" M(E, Λ)","inline":true,"padRight":true},{"text":"can be decomposed as the sum of the regret suffered by the meta algorithm aggregation scheme, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":", and the regret from the black-box algorithm, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"text":", played with any learning rate ","element":"span"},{"style":{"height":12.4},"width":105,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-36.png","element":"img","alt":" λ ∈ Λ","inline":true},{"text":". The dynamic regret, defined in Eq. ","element":"span"},{"href":"#id-16","text":"(5)","element":"a"},{"text":", can be decomposed, for any set of intervals ","element":"span"},{"style":{"height":16},"width":885.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-37.png","element":"img","alt":"Ii = [ti, ti+1 − 1], with 1 = t1 < . . . < tm+1 = T + 1","inline":true},{"text":", and for any learning rate ","element":"span"},{"style":{"height":14},"width":149.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-38.png","element":"img","alt":" λ ∈ Λ, as","inline":true}],[{"id":"id-79","style":{"width":"99%"},"width":1585,"height":245,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/4-39.png","element":"img"}],[{"text":"The black-box regret on ","element":"span"},{"style":{"height":13.6},"width":33.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-0.png","element":"img","alt":" Ii","inline":true,"padRight":true},{"text":"is exactly the standard regret for ","element":"span"},{"style":{"height":16},"width":145,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-1.png","element":"img","alt":" T = |Ii|","inline":true,"padRight":true},{"text":"with a learning rate of ","element":"span"},{"style":{"height":11.39},"width":20.48,"height":28.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-2.png","element":"img","alt":" λ","inline":true},{"text":". Hence, in order to prove low dynamic regret for ","element":"span"},{"style":{"height":16},"width":144.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-3.png","element":"img","alt":" M(E, Λ)","inline":true,"padRight":true},{"text":"we have to: show that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"incurs a low dynamic regret in each interval ","element":"span"},{"style":{"height":13.6},"width":32.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-4.png","element":"img","alt":" Ii","inline":true},{"text":"; find a black-box algorithm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"for CURL that has dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"text":", and build a learning rate grid ","element":"span"},{"style":{"height":11.6},"width":25.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-5.png","element":"img","alt":" Λ","inline":true,"padRight":true},{"text":"allowing us to perform nearly as well as the optimal learning rate.","element":"span"}]]},{"heading":"4 MetaCURL Algorithm","paragraphs":[[{"text":"We call our meta-algorithm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"MetaCURL. It is based on sleeping experts, is parameter-free, and achieves optimal regret. Its construction is described below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Learning with expert advice","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"General setting. ","element":"span"},{"text":"In Learning with Expert Advice (LEA), a learner makes sequential online predictions ","element":"span"},{"style":{"height":16.61},"width":173,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-6.png","element":"img","alt":"u1, . . . , uT","inline":true,"padRight":true},{"text":"in a decision space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", over a series of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"episodes, with the help of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"experts [","element":"span"},{"href":"#id-60","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":36,"text":"36","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":9,"text":"9","element":"a"},{"text":"]. For each round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", each expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"makes a prediction ","element":"span"},{"style":{"height":13.81},"width":59.48,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-7.png","element":"img","alt":" ut,k","inline":true},{"text":", and the learner combines the experts’ predictions by computing a vector ","element":"span"},{"style":{"height":17.41},"width":451.52,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-8.png","element":"img","alt":" vt := (vt,1, . . . , vt,K) ∈ SK","inline":true},{"text":", and predicting the convex combination of experts’ prediction ","element":"span"},{"style":{"height":20.8},"width":336.52,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-9.png","element":"img","alt":" ut := �Kk=1 vt,kut,k","inline":true},{"text":". The environment then reveals a convex ","element":"span"},{"text":"loss function ","element":"span"},{"style":{"height":13.2},"width":184.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-10.png","element":"img","alt":" ℓt : U → R","inline":true},{"text":". Each expert suffers a loss ","element":"span"},{"style":{"height":17.39},"width":245,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-11.png","element":"img","alt":" ℓt,k := ℓt(ut,k)","inline":true},{"text":", and the learner suffers a loss ","element":"span"},{"style":{"height":19.41},"width":190,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-12.png","element":"img","alt":"ˆℓt := ℓt(ut)","inline":true},{"text":". The learner’s objective is to keep the cumulative regret with respect to each expert as low as possible. For each expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", this quantity is defined as Reg","element":"span"},{"style":{"height":22.61},"width":410,"height":56.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-13.png","element":"img","alt":"[T ](k) := �Tt=1 ˆℓt − ℓt,k.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Sleeping experts. ","element":"span"},{"text":"In our case, each black-box algorithm is an expert that does not produce solutions outside its active interval. This problem can be reduced to the sleeping expert problem [","element":"span"},{"href":"#id-62","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":23,"text":"23","element":"a"},{"text":"], where experts are not required to provide solutions at every time step. Let ","element":"span"},{"style":{"height":17.6},"width":204,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-14.png","element":"img","alt":" It,k ∈ {0, 1}","inline":true,"padRight":true},{"text":"define a signal equal to ","element":"span"},{"text":"1 ","element":"span"},{"text":"if expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is active at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The algorithm knows ","element":"span"},{"style":{"height":19.2},"width":171.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-15.png","element":"img","alt":" (It,k)k∈[K]","inline":true,"padRight":true},{"text":"and assigns a zero weight to sleeping experts (","element":"span"},{"style":{"height":17.2},"width":404.52,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-16.png","element":"img","alt":"It,k = 0 implies vt,k = 0","inline":true},{"text":"). We would like to have a guarantee with respect to expert ","element":"span"},{"style":{"height":16},"width":123,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-17.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"text":"but only when it is active. Hence, we now aim to bound a cumulative regret that depends on the signal ","element":"span"},{"style":{"height":13.6},"width":57,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-18.png","element":"img","alt":" It,k","inline":true},{"text":": Reg","element":"span"},{"text":"sleep","element":"span"},{"style":{"height":23.81},"width":512,"height":59.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-19.png","element":"img","alt":"[T ] (k) := �Tt=1 It,k(ˆℓt − ℓt,k)","inline":true},{"text":". There is a generic reduction ","element":"span"},{"text":"from the sleeping expert framework to the general LEA setting ","element":"span"},{"href":"#id-64","referenceIndex":3,"text":"[3, ","element":"a"},{"href":"#id-65","referenceIndex":2,"text":"2] ","element":"a"},{"text":"(see Appendix ","element":"span"},{"href":"#id-66","text":"A.1)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Meta-aggregation scheme","element":"span"}],[{"text":"In every episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", for every learning rate ","element":"span"},{"style":{"height":12.4},"width":109.48,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-20.png","element":"img","alt":" λ ∈ Λ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.4},"width":96,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-21.png","element":"img","alt":" s ≤ t","inline":true},{"text":", an instance ","element":"span"},{"style":{"height":14},"width":66,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-22.png","element":"img","alt":" Es,λ","inline":true},{"text":"of the black-box algorithm acts as an expert computing a policy ","element":"span"},{"style":{"height":13.81},"width":87,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-23.png","element":"img","alt":" πt,s,λ","inline":true},{"text":". The meta algorithm aims to aggregate these predictions using a sleeping expert approach based on the expert’s losses. However, within CURL’s framework, the meta algorithm faces two challenges:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Uncertainty. ","element":"span"},{"text":"At the episode’s end, the learner has full information about the objective function ","element":"span"},{"style":{"height":13.01},"width":94.48,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-24.png","element":"img","alt":" F t. If","inline":true,"padRight":true},{"text":"the learner also knew ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-25.png","element":"img","alt":" pt","inline":true},{"text":", they could recursively calculate the corresponding state-action distribution ","element":"span"},{"href":"#id-67","style":{"height":19.79},"width":353.48,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-26.png","element":"img","alt":"µπt,s,λ,pt using Eq. (2)","inline":true,"padRight":true},{"text":"and observe the actual loss of each expert, denoted as ","element":"span"},{"style":{"height":20.19},"width":389.52,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-27.png","element":"img","alt":" F t(µπt,s,λ,pt). However,","inline":true,"padRight":true},{"text":"given that ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-28.png","element":"img","alt":" pt ","inline":true,"padRight":true},{"text":"is unknown to the learner, the true loss remains unobservable. Consequently, the meta-algorithm needs to create an estimator ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-29.png","element":"img","alt":" ˆpt","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-30.png","element":"img","alt":" pt","inline":true,"padRight":true},{"text":"and utilize it to estimate the losses. We propose a method to compute an estimator ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-31.png","element":"img","alt":" ˆpt ","inline":true,"padRight":true},{"text":"in Subsection ","element":"span"},{"href":"#id-68","text":"4.3.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Convexity. ","element":"span"},{"text":"As discussed in Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"the objective functions ","element":"span"},{"style":{"height":12.8},"width":41,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-32.png","element":"img","alt":" F t","inline":true},{"text":"are not convex over the space of polices. However, CURL is equivalent to a convex problem over the state-action distributions satisfying the Bellman’s flow as shown in Eq. ","element":"span"},{"href":"#id-69","text":"(4)","element":"a"},{"text":". Therefore, instead of aggregating policies, the meta algorithm aggregates the associated state-action distributions using the probability estimator ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-33.png","element":"img","alt":"ˆpt","inline":true,"padRight":true},{"text":"and the recursive scheme at Eq. ","element":"span"},{"href":"#id-67","text":"(2)","element":"a"},{"text":". We detail MetaCURL in Alg. ","element":"span"},{"href":"#id-70","text":"1 ","element":"a"},{"text":"when employed with the Exponentially Weighted Average forecaster (EWA) as the sleeping expert subroutine (we detail EWA in Appendix ","element":"span"},{"href":"#id-71","text":"A.2)","element":"a"},{"text":".","element":"span"}],[{"id":"id-68","style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Building an estimator of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"}],[{"text":"As discussed earlier, applying the learning with experts framework requires estimating the loss of non-played expert policies, which depends on estimating the non-stationary transition probabilities ","element":"span"},{"style":{"height":16},"width":43,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-34.png","element":"img","alt":" ˆpt.","inline":true,"padRight":true},{"text":"Standard RL techniques for bounding the ","element":"span"},{"style":{"height":13.41},"width":39.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-35.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm between the empirical estimator ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/5-36.png","element":"img","alt":" ˆpt","inline":true,"padRight":true},{"text":"and the true","element":"span"}],[{"id":"id-70","style":{"width":"100%"},"width":1594,"height":1492,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-0.png","element":"img"}],[{"text":"dynamics ","element":"span"},{"href":"#id-72","referenceIndex":44,"style":{"height":16},"width":157.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-1.png","element":"img","alt":" pt [44, 49","inline":true},{"text":"] are not applicable here due to non-stationarity. To address this, we introduce a second layer of sleeping experts for each ","element":"span"},{"style":{"height":16},"width":401.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-2.png","element":"img","alt":" (n, x, a) ∈ [N] × X × A","inline":true},{"text":", where each expert provides an empirical estimate of ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-3.png","element":"img","alt":" pt ","inline":true,"padRight":true},{"text":"based on different intervals. We then propose a new loss function in Eq. ","element":"span"},{"href":"#id-73","text":"(12) ","element":"a"},{"text":"and conduct a novel regret analysis in Prop. ","element":"span"},{"href":"#id-74","text":"5.2 ","element":"a"},{"text":"to achieve the optimal regret rate.","element":"span"}],[{"text":"In each episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the learner calculates independent samples ","element":"span"},{"style":{"height":18.99},"width":304,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-4.png","element":"img","alt":" xtn,x,a ∼ ptn(·|x, a)","inline":true,"padRight":true},{"text":"utilizing the external ","element":"span"},{"text":"noise sequence ","element":"span"},{"style":{"height":18.4},"width":151.52,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-5.png","element":"img","alt":" (εtn)n∈[N]","inline":true,"padRight":true},{"text":"observed (just let ","element":"span"},{"style":{"height":18.99},"width":425,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-6.png","element":"img","alt":" xtn,x,a = gn−1(x, a, εtn−1)","inline":true},{"text":", see Eq. ","element":"span"},{"href":"#id-50","text":"(8)","element":"a"},{"text":"). Each expert ","element":"span"},{"text":"outputs an empirical estimator of ","element":"span"},{"style":{"height":16.8},"width":154.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-7.png","element":"img","alt":" ptn(·|x, a)","inline":true,"padRight":true},{"text":"using samples across different intervals. We assume ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"experts, with expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"active in interval ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, T","element":"span"},{"text":"]","element":"span"},{"text":". Expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > s ","element":"span"},{"text":"outputs:","element":"span"}],[{"style":{"width":"70%"},"width":1114,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-8.png","element":"img"}],[{"text":"We let ","element":"span"},{"style":{"height":16.99},"width":154.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-9.png","element":"img","alt":" ˆptn(·|x, a)","inline":true,"padRight":true},{"text":"be the result of employing sleeping EWA with experts ","element":"span"},{"style":{"height":17.01},"width":171,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-10.png","element":"img","alt":" ˆpt,sn (·|x, a)","inline":true},{"text":", for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s < t","element":"span"},{"text":". ","element":"span"},{"text":"Typically, in density estimation with EWA, a logarithmic loss ","element":"span"},{"style":{"height":16},"width":124.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-11.png","element":"img","alt":" − log(·)","inline":true,"padRight":true},{"text":"is used. However, in this case ","element":"span"},{"style":{"height":16},"width":124.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-12.png","element":"img","alt":"− log(·)","inline":true,"padRight":true},{"text":"can be unbounded, so we opt here for a smoothed logarithmic loss, given by, for all ","element":"span"},{"style":{"height":14.21},"width":127.48,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-13.png","element":"img","alt":" q ∈ SX ,","inline":true}],[{"id":"id-73","style":{"width":"96%"},"width":1528,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-14.png","element":"img"}],[{"text":"The definition of this non-standard loss is further clarified during the regret analysis in Section ","element":"span"},{"text":"5. ","element":"span"},{"text":"This loss function is ","element":"span"},{"text":"1","element":"span"},{"text":"-exp concave (see Lemma 4 of [","element":"span"},{"href":"#id-75","referenceIndex":52,"text":"52","element":"a"},{"text":"]), hence the cumulative regret of EWA with respect to each expert ","element":"span"},{"style":{"height":16},"width":146,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-15.png","element":"img","alt":" s ∈ [T]","inline":true},{"text":", for all episodes ","element":"span"},{"style":{"height":16},"width":187,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-16.png","element":"img","alt":" τ ∈ [s, T]","inline":true},{"text":", satisfies Reg","element":"span"},{"text":"sleep","element":"span"},{"style":{"height":21.2},"width":951,"height":53,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/6-17.png","element":"img","alt":"[s,τ](s) = �τt=s ℓt(ˆptn(·|x, a)) − ℓt(ˆpt,sn (·|, x, a)) ≤ log(T)","inline":true,"padRight":true},{"text":"(for more information on the regret ","element":"span"},{"text":"bounds of EWA with exp-concave losses, see Appendix ","element":"span"},{"href":"#id-71","text":"A.2)","element":"a"},{"text":". We describe the complete online scheme to compute ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-0.png","element":"img","alt":" ˆpt ","inline":true,"padRight":true},{"text":"in Alg. ","element":"span"},{"href":"#id-76","text":"3 ","element":"a"},{"text":"at Appendix ","element":"span"},{"text":"B.","element":"span"}]]},{"heading":"5 Regret analysis","paragraphs":[[{"text":"This section presents the main result concerning MetaCURL’s regret analysis. Subsection ","element":"span"},{"href":"#id-77","text":"5.1 ","element":"a"},{"text":"shows an upper bound for ","element":"span"},{"style":{"height":12.8},"width":84,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-1.png","element":"img","alt":" Rmeta ","inline":true,"padRight":true},{"text":"when MetaCURL is played with EWA and ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-2.png","element":"img","alt":" ˆpt ","inline":true,"padRight":true},{"text":"is computed as in Subsection. ","element":"span"},{"href":"#id-68","text":"4.3. ","element":"a"},{"text":"Subsection ","element":"span"},{"href":"#id-78","text":"5.2 ","element":"a"},{"text":"introduces a learning rate grid for MetaCURL when the black-box algorithm meets the dynamic regret criteria in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"text":", providing an upper bound for ","element":"span"},{"style":{"height":13.79},"width":140.48,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-3.png","element":"img","alt":" Rblack-box","inline":true},{"text":". Given the dynamic regret decomposition of Eq. ","element":"span"},{"href":"#id-79","text":"(11)","element":"a"},{"text":", we see that the combination of these results leads to our main result, the full proof of which can be found in appendix ","element":"span"},{"text":"(F) ","element":"span"},{"text":":","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5.1 ","element":"span"},{"text":"(Main result)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":167.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-4.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Playing MetaCURL, with a parametric black-box algorithm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"style":{"fontStyle":"italic"},"text":", with a learning rate grid ","element":"span"},{"style":{"height":19.39},"width":287,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-5.png","element":"img","alt":" Λ := �2−j|j =","inline":true},{"style":{"height":19.39},"width":406.52,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-6.png","element":"img","alt":"0, 1, 2, . . . , ⌈log2(T)/2⌉�","inline":true},{"style":{"fontStyle":"italic"},"text":", and with EWA as the sleeping expert subroutine, we obtain, with probability at least ","element":"span"},{"style":{"height":11.6},"width":104,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-7.png","element":"img","alt":" 1 − 2δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for any sequence of policies ","element":"span"},{"style":{"height":18.4},"width":173.48,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-8.png","element":"img","alt":" (πt,∗)t∈[T ],","inline":true}],[{"style":{"width":"69%"},"width":1106,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-9.png","element":"img"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Meta-algorithm analysis","element":"span"}],[{"text":"Given the uncertainty in the probability transition, the meta regret term can be decomposed as follows:","element":"span"}],[{"id":"id-92","style":{"width":"97%"},"width":1542,"height":419,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Sleeping LEA regret. ","element":"span"},{"text":"Referring to Thm. ","element":"span"},{"href":"#id-80","text":"A.1 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"A, ","element":"span"},{"text":"using sleeping EWA as the sleeping expert subroutine of MetaCURL, with signals ","element":"span"},{"style":{"height":12.8},"width":130.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-11.png","element":"img","alt":" It,s = 1","inline":true,"padRight":true},{"text":"for active experts (","element":"span"},{"style":{"height":12.4},"width":85.48,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-12.png","element":"img","alt":"s ≤ t","inline":true},{"text":"), experts’ convex losses ","element":"span"},{"style":{"height":20.19},"width":371.48,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-13.png","element":"img","alt":" ℓt,s,λ := F t(µπt,s,λ,ˆpt)","inline":true},{"text":", and learner loss ","element":"span"},{"style":{"height":19.6},"width":270,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-14.png","element":"img","alt":"ˆℓt := F t(µπt,ˆpt)","inline":true},{"text":", yields, for any ","element":"span"},{"style":{"height":16},"width":134,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-15.png","element":"img","alt":" m ∈ [T]","inline":true,"padRight":true},{"text":"and for any set of intervals ","element":"span"},{"style":{"height":16},"width":895,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-16.png","element":"img","alt":" Ii = [ti, ti+1 − 1], with 1 = t1 < . . . < tm+1 = T + 1,","inline":true}],[{"id":"id-81","style":{"width":"92%"},"width":1462,"height":256,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-17.png","element":"img"}],[{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-18.png","element":"img","alt":"ˆpt","inline":true},{"style":{"fontWeight":"bold"},"text":"Estimation regret. ","element":"span"},{"text":"In a scenario without uncertainty in the MDP’s probability transitions, the meta-algorithm’s regret would simply be bounded by Eq. ","element":"span"},{"href":"#id-81","text":"(14)","element":"a"},{"text":", the sleeping expert regret used as a subroutine. However, given the presence of uncertainty, the main challenge in analyzing the meta-regret comes from the regret terms associated with the estimator ","element":"span"},{"style":{"height":16},"width":32.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-19.png","element":"img","alt":" ˆpt","inline":true},{"text":". We outline this analysis in Prop. ","element":"span"},{"href":"#id-74","text":"5.2.","element":"a"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"Proposition 5.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":257.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-20.png","element":"img","alt":" δ ∈ (0, 1), C :=","inline":true},{"style":{"height":28.8},"width":40,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-21.png","element":"img","alt":"�","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"height":22.59},"width":472.52,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-22.png","element":"img","alt":"2 log� N|X||A|2|X|Tδ �, and LF","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the Lipschitz constant of ","element":"span"},{"style":{"height":12.8},"width":41,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-23.png","element":"img","alt":"F t","inline":true},{"style":{"fontStyle":"italic"},"text":", with respect to the norm ","element":"span"},{"style":{"height":16.61},"width":356,"height":41.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-24.png","element":"img","alt":" ∥ · ∥∞,1, for all t ∈ [T]","inline":true},{"style":{"fontStyle":"italic"},"text":". With a probability of at least ","element":"span"},{"style":{"height":13.6},"width":283.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-25.png","element":"img","alt":" 1 − δ, MetaCURL","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"obtains","element":"span"}],[{"style":{"width":"94%"},"width":1494,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-26.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":16},"width":129,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-27.png","element":"img","alt":" m ∈ [T]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and for any set of intervals ","element":"span"},{"style":{"height":16},"width":885.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-28.png","element":"img","alt":" Ii = [ti, ti+1 − 1], with 1 = t1 < . . . < tm+1 = T + 1,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the same bound is valid for ","element":"span"},{"style":{"height":22},"width":416,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/7-29.png","element":"img","alt":"�mi=1�t∈Ii RˆpIi(πt,ti,λ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof idea is based mainly on the formulation of ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-0.png","element":"img","alt":" ˆpt","inline":true},{"text":"described in Subsection ","element":"span"},{"href":"#id-68","text":"4.3. ","element":"a"},{"text":"We start by using the convexity of ","element":"span"},{"style":{"height":12.8},"width":41,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-1.png","element":"img","alt":" F t ","inline":true,"padRight":true},{"text":"to linearize the expression, then we apply Holder’s inequality and exploit the ","element":"span"},{"style":{"height":13.39},"width":49.48,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-2.png","element":"img","alt":" LF","inline":true,"padRight":true},{"text":"-Lipschitz property of ","element":"span"},{"style":{"height":12.8},"width":41,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-3.png","element":"img","alt":" F t ","inline":true,"padRight":true},{"text":"to establish an upper bound based on the ","element":"span"},{"style":{"height":13.39},"width":39.52,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-4.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm difference of the state-action distributions induced by the true probability transition and the estimator. Using Lemma ","element":"span"},{"href":"#id-82","text":"C.5 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"C, ","element":"span"},{"text":"we then obtain that","element":"span"}],[{"style":{"width":"66%"},"width":1053,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-5.png","element":"img"}],[{"text":"To use the results from Subsection ","element":"span"},{"href":"#id-68","text":"4.3, ","element":"a"},{"text":"we first regularize ","element":"span"},{"style":{"height":16.8},"width":436.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-6.png","element":"img","alt":" pt and ˆpt, for each (n, x, a)","inline":true},{"text":", by averaging each with the uniform distribution over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", that we denote by ","element":"span"},{"style":{"height":16},"width":195.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-7.png","element":"img","alt":" p0 := 1/|X|","inline":true},{"text":". As both probabilities are now lower bounded, we can employ Pinsker’s inequality to convert the ","element":"span"},{"style":{"height":13.41},"width":39.48,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-8.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm into a KL divergence. The sum over ","element":"span"},{"style":{"height":16},"width":109,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-9.png","element":"img","alt":" t ∈ [T]","inline":true,"padRight":true},{"text":"of the KL divergence can then be decomposed as follows:","element":"span"}],[{"style":{"width":"94%"},"width":1494,"height":258,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":21.39},"width":181,"height":53.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-11.png","element":"img","alt":" ˆpt,tij (·|x, a)","inline":true,"padRight":true},{"text":"is the empirical estimate of ","element":"span"},{"style":{"height":19.2},"width":149.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-12.png","element":"img","alt":" ptj(·|x, a)","inline":true,"padRight":true},{"text":"calculated with the observed data from ","element":"span"},{"style":{"height":12.61},"width":66.48,"height":31.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-13.png","element":"img","alt":" ti to","inline":true},{"style":{"height":10.99},"width":80,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-14.png","element":"img","alt":"t − 1","inline":true},{"text":", and the expectation is over ","element":"span"},{"style":{"height":19.52},"width":454.32,"height":48.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-15.png","element":"img","alt":" ˜xtj,x,a ∼ (ptj(·|x, a) + p0)/2","inline":true},{"text":". The second term is the cumulative ","element":"span"},{"text":"regret of computing ","element":"span"},{"style":{"height":16},"width":32.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-16.png","element":"img","alt":" ˆpt","inline":true,"padRight":true},{"text":"using EWA with loss as in Eq. ","element":"span"},{"href":"#id-73","text":"(12)","element":"a"},{"text":", and is bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"m ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":")","element":"span"},{"text":". We finish and give more details of the proof in Appendix ","element":"span"},{"text":"D.","element":"span"}],[{"text":"Prop. ","element":"span"},{"href":"#id-74","text":"5.2 ","element":"a"},{"text":"together with Eq. ","element":"span"},{"href":"#id-81","text":"(14) ","element":"a"},{"text":"yields the main result of this subsection:","element":"span"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"Proposition 5.3 ","element":"span"},{"text":"(Meta regret bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With the same assumptions as Prop. ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"5.2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"height":16},"width":220,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-17.png","element":"img","alt":" m ∈ [T], with","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"probability at least ","element":"span"},{"style":{"height":13.6},"width":110.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-18.png","element":"img","alt":" 1 − 2δ,","inline":true}],[{"style":{"width":"78%"},"width":1244,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-19.png","element":"img"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Black-box algorithm analysis","element":"span"}],[{"text":"Assuming ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E ","element":"span"},{"text":"is a parametric black-box algorithm with dynamic regret satisfying Eq. ","element":"span"},{"href":"#id-57","text":"(10) ","element":"a"},{"text":"for any learning rate ","element":"span"},{"style":{"height":12},"width":96,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-20.png","element":"img","alt":" λ > 0","inline":true},{"text":", we only need to address the selection of the ","element":"span"},{"style":{"height":11.6},"width":20.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-21.png","element":"img","alt":" λ","inline":true},{"text":"s grid and optimize across ","element":"span"},{"style":{"height":11.6},"width":20,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-22.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"to achieve our final bound on ","element":"span"},{"style":{"height":21.39},"width":150.52,"height":53.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-23.png","element":"img","alt":" Rblack-box[T ] .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Learning rate grid. ","element":"span"},{"text":"The dynamic regret of Eq. ","element":"span"},{"href":"#id-57","text":"(10) ","element":"a"},{"text":"implies that any two ","element":"span"},{"style":{"height":11.6},"width":20,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-24.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"that are a constant factor of each other will guarantee the same upper-bound up to essentially the same constant factor. We therefore choose an exponentially spaced grid","element":"span"}],[{"id":"id-83","style":{"width":"71%"},"width":1140,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-25.png","element":"img"}],[{"text":"The meta-algorithm aggregation scheme guarantees that the learner performs as well as the best empirical learning rate. We thus obtain a bound on ","element":"span"},{"style":{"height":21.39},"width":140.48,"height":53.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-26.png","element":"img","alt":" Rblack-box[T ]","inline":true,"padRight":true},{"text":", with its proof in Appendix ","element":"span"},{"text":"E:","element":"span"}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"Proposition 5.4 ","element":"span"},{"text":"(Black-box regret bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume MetaCURL is played with a black-box algorithm satisfying dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"style":{"fontStyle":"italic"},"text":", with learning rate grid as in Eq. ","element":"span"},{"href":"#id-83","text":"(15)","element":"a"},{"style":{"fontStyle":"italic"},"text":". Hence, for any sequence of policies ","element":"span"},{"style":{"height":18.61},"width":173.48,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-27.png","element":"img","alt":" (πt,∗)t∈[T ],","inline":true}],[{"style":{"width":"89%"},"width":1416,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/8-28.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Greedy MD-CURL. ","element":"span"},{"text":"Greedy MD-CURL, developed by [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"], is a computationally efficient policyoptimization algorithm known for achieving sublinear static regret in online CURL with adversarial objective functions within a stationary MDP. In Thm. ","element":"span"},{"href":"#id-84","text":"G.3 ","element":"a"},{"text":"of Appendix ","element":"span"},{"text":"G, ","element":"span"},{"text":"we extend this analysis showing that Greedy MD-CURL also achieves dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"text":". To our knowledge, this is the first dynamic regret result for a CURL algorithm. Hence, Greedy MD-CURL can be used as a black-box for MetaCURL. We detail Greedy MD-CURL in Alg. ","element":"span"},{"href":"#id-85","text":"4 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"G.","element":"span"}]]},{"heading":"6 Conclusion, discussion, and future work","paragraphs":[[{"text":"In this paper, we present MetaCURL, the first algorithm for dealing with non-stationarity in CURL, a setting covering many problems in the literature that modifies the standard linear RL configuration, making typical RL techniques difficult to use. We also employ a novel approach to deal with non-stationarity in MDPs using the learning with expert advice framework from the online learning literature. The main difficulty in analyzing this method arises from uncertainty about probability transitions. We overcome this problem by employing a second expert scheme, and show that MetaCURL achieves near-optimal regret.","element":"span"}],[{"text":"Compared to the RL literature, our approach is more efficient, deals with adversarial losses, and has a better regret dependency concerning the varying losses, but to do so, we need to simplify the assumptions about the dynamics (all uncertainty comes only from the external noise, that is independent of the agent’s state-action). There seems to be a trade-off in RL: all algorithms dealing with both non-stationarity and full exploration use UCRL-type approaches, and are thus computationally expensive. We thus leave a question for future work: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"How can we effectively manage non-stationarity and adversarial losses using efficient algorithms, all while addressing full exploration?","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-4","text":"[1] ","element":"span"},{"text":"P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2004.","element":"span"}],[{"id":"id-65","text":"[2] ","element":"span"},{"text":"D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 17(23):1–21, 2016.","element":"span"}],[{"id":"id-64","text":"[3] ","element":"span"},{"text":"D. Adamskiy, M. K. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 25, 2012.","element":"span"}],[{"id":"id-51","text":"[4] ","element":"span"},{"text":"P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 21, 2008.","element":"span"}],[{"id":"id-9","text":"[5] ","element":"span"},{"text":"A. Barakat, I. Fatkhullin, and N. He. Reinforcement learning with general utilities: simpler variance reduction and large state-action space. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", pages 1753–1800, 2023.","element":"span"}],[{"id":"id-53","text":"[6] ","element":"span"},{"text":"M. Bayati. Power management policy for heterogeneous data center based on histogram and discrete-time mdp. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Electronic Notes in Theoretical Computer Science","element":"span"},{"text":", 337:5–22, 2018. Proceedings of the Ninth International Workshop on the Practical Application of Stochastic Modelling (PASM).","element":"span"}],[{"id":"id-5","text":"[7] ","element":"span"},{"text":"A. Bensoussan, P. Yam, and J. Frehse. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mean Field Games and Mean Field Type Control Theory","element":"span"},{"text":". SpringerBriefs in Mathematics. Springer, 2013.","element":"span"}],[{"id":"id-62","text":"[8] ","element":"span"},{"text":"A. Blum. Empirical support for winnow and weighted-majority based algorithms: results on a calendar scheduling domain. In A. Prieditis and S. Russell, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning Proceedings 1995","element":"span"},{"text":", pages 64–72, 1995.","element":"span"}],[{"id":"id-17","text":"[9] ","element":"span"},{"text":"N. Cesa-Bianchi and G. Lugosi. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Prediction, Learning, and Games","element":"span"},{"text":". Cambridge University Press, 2006.","element":"span"}],[{"id":"id-46","text":"[10] ","element":"span"},{"text":"Y. Chen, S. S. Du, and K. Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. In M. Meila and T. Zhang, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 139, pages 1561–1570, 2021.","element":"span"}],[{"id":"id-13","text":"[11] ","element":"span"},{"text":"W. C. Cheung. Exploration-exploitation trade-off in reinforcement learning on online markov decision processes with global concave rewards. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", abs/1905.06466, 2019.","element":"span"}],[{"id":"id-12","text":"[12] ","element":"span"},{"text":"W. C. Cheung. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 32, 2019.","element":"span"}],[{"id":"id-25","text":"[13] ","element":"span"},{"text":"W. C. Cheung, D. Simchi-Levi, and R. Zhu. Reinforcement learning for non-stationary Markov decision processes: The blessing of (More) optimism. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 119, pages 1843–1854, 2020.","element":"span"}],[{"id":"id-43","text":"[14] ","element":"span"},{"text":"A. Cohen, Y. Efroni, Y. Mansour, and A. Rosenberg. Minimax regret for stochastic shortest path. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 34, pages 28350–28361, 2021.","element":"span"}],[{"id":"id-58","text":"[15] ","element":"span"},{"text":"A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In F. Bach and D. Blei, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 32nd International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 37, pages 1405–1411, 2015.","element":"span"}],[{"id":"id-37","text":"[16] ","element":"span"},{"text":"R. De Santi, M. Prajapat, and A. Krause. Global reinforcement learning : Beyond linear and convex rewards via submodular semi-gradient methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 41st International Conference on Machine Learning","element":"span"},{"text":", volume 235, pages 10235–10266, 2024.","element":"span"}],[{"id":"id-27","text":"[17] ","element":"span"},{"text":"O. D. Domingues, P. M’enard, M. Pirotta, E. Kaufmann, and M. Valko. A kernel-based approach to non-stationary reinforcement learning in metric spaces. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics (AISTATS)","element":"span"},{"text":", volume 130, pages 3538–3546, 2021.","element":"span"}],[{"id":"id-40","text":"[18] ","element":"span"},{"text":"Y. Efroni, L. Shani, A. Rosenberg, and S. Mannor. Optimistic policy optimization with bandit feedback. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 119, pages 8604–8613, 2020.","element":"span"}],[{"id":"id-15","text":"[19] ","element":"span"},{"text":"E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematics of Operations Research","element":"span"},{"text":", 34(3):726–736, 2009.","element":"span"}],[{"id":"id-28","text":"[20] ","element":"span"},{"text":"Y. Fei, Z. Yang, Z. Wang, and Q. Xie. Dynamic regret of policy optimization in non-stationary environments. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 33, pages 6743–6754, 2020.","element":"span"}],[{"id":"id-30","text":"[21] ","element":"span"},{"text":"S. Feng, M. Yin, R. Huang, Y.-X. Wang, J. Yang, and Y. Liang. Non-stationary reinforcement learning under general function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 202, pages 9976–10007, 2023.","element":"span"}],[{"id":"id-60","text":"[22] ","element":"span"},{"text":"Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computer and System Sciences","element":"span"},{"text":", 55(1):119–139, 1997.","element":"span"}],[{"id":"id-63","text":"[23] ","element":"span"},{"text":"Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Annual ACM Symposium on Theory of Computing (STOC)","element":"span"},{"text":", page 334–343, 1997.","element":"span"}],[{"id":"id-24","text":"[24] ","element":"span"},{"text":"P. Gajane, R. Ortner, and P. Auer. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", abs/1805.10066, 2018.","element":"span"}],[{"id":"id-11","text":"[25] ","element":"span"},{"text":"M. Geist, J. Pérolat, M. Laurière, R. Elie, S. Perrin, O. Bachem, R. Munos, and O. Pietquin. Concave utility reinforcement learning: The mean-field game viewpoint. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Autonomous Agents and Multiagent Systems","element":"span"},{"text":", page 489–497, 2022.","element":"span"}],[{"id":"id-1","text":"[26] ","element":"span"},{"text":"S. K. S. Ghasemipour, R. Zemel, and S. Gu. A divergence minimization perspective on imitation learning methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Conference on Robot Learning","element":"span"},{"text":", volume 100, pages 1259–1277, 2020.","element":"span"}],[{"id":"id-59","text":"[27] ","element":"span"},{"text":"A. György, T. Linder, and G. Lugosi. Efficient tracking of large classes of experts. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2012 IEEE International Symposium on Information Theory Proceedings","element":"span"},{"text":", pages 885–889, 2012.","element":"span"}],[{"id":"id-0","text":"[28] ","element":"span"},{"text":"E. Hazan, S. Kakade, K. Singh, and A. Van Soest. Provably efficient maximum entropy exploration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 97, pages 2681–2691, 2019.","element":"span"}],[{"id":"id-18","text":"[29] ","element":"span"},{"text":"E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Electronic Colloquium on Computational Complexity (ECCC)","element":"span"},{"text":", 14, 01 2007.","element":"span"}],[{"id":"id-39","text":"[30] ","element":"span"},{"text":"C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 119, pages 4860–4869, 2020.","element":"span"}],[{"id":"id-44","text":"[31] ","element":"span"},{"text":"T. Jin, J. Liu, C. Rouyer, W. Chang, C.-Y. Wei, and H. Luo. No-regret online reinforcement learning with adversarial losses and transitions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 36, pages 38520–38585, 2023.","element":"span"}],[{"id":"id-42","text":"[32] ","element":"span"},{"text":"T. Jin and H. Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 33, pages 16557–16566, 2020.","element":"span"}],[{"id":"id-21","text":"[33] ","element":"span"},{"text":"K.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved Strongly Adaptive Online Learning using Coin Betting. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics (AISTATS)","element":"span"},{"text":", volume 54, pages 943–951, 2017.","element":"span"}],[{"id":"id-6","text":"[34] ","element":"span"},{"text":"P. Lavigne and L. Pfeiffer. Generalized conditional gradient and learning in potential mean field games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Applied Mathematics & Optimization","element":"span"},{"text":", 88(3):89, Oct 2023.","element":"span"}],[{"id":"id-2","text":"[35] ","element":"span"},{"text":"J. W. Lavington, S. Vaswani, and M. Schmidt. Improved policy optimization for online imitation learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 1st Conference on Lifelong Learning Agents","element":"span"},{"text":", volume 199, pages 1146–1173, 2022.","element":"span"}],[{"id":"id-61","text":"[36] ","element":"span"},{"text":"N. Littlestone and M. Warmuth. The weighted majority algorithm. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Information and Computation","element":"span"},{"text":", 108(2):212–261, 1994.","element":"span"}],[{"id":"id-52","text":"[37] ","element":"span"},{"text":"H. Luo, C.-Y. Wei, and C.-W. Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 34, pages 22931–22942, 2021.","element":"span"}],[{"id":"id-45","text":"[38] ","element":"span"},{"text":"T. Lykouris, M. Simchowitz, A. Slivkins, and W. Sun. Corruption-robust exploration in episodic reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory (COLT)","element":"span"},{"text":", volume 134, pages 3242– 3245, 2021.","element":"span"}],[{"id":"id-14","text":"[39] ","element":"span"},{"text":"B. M Moreno, M. Bregere, P. Gaillard, and N. Oudjane. Efficient model-based concave utility reinforcement learning through greedy mirror descent. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics (AISTATS)","element":"span"},{"text":", volume 238, pages 2206–2214, 2024.","element":"span"}],[{"id":"id-29","text":"[40] ","element":"span"},{"text":"W. Mao, K. Zhang, R. Zhu, D. Simchi-Levi, and T. Basar. Near-optimal model-free reinforcement learning in non-stationary episodic mdps. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 139, pages 7447–7458, 2021.","element":"span"}],[{"id":"id-55","text":"[41] ","element":"span"},{"text":"B. Marin Moreno, M. Brégère, P. Gaillard, and N. Oudjane. (Online) Convex Optimization for Demand-Side Management: Application to Thermostatically Controlled Loads, Jan. 2023.","element":"span"}],[{"id":"id-34","text":"[42] M. Mutti, R. De Santi, P. De Bartolomeis, and M. Restelli. Challenging common assumptions ","element":"span"},{"text":"in convex reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 35, pages 4489–4502, 2022.","element":"span"}],[{"id":"id-33","text":"[43] ","element":"span"},{"text":"M. Mutti, R. D. Santi, P. D. Bartolomeis, and M. Restelli. Convex reinforcement learning in finite trials. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 24(250):1–42, 2023.","element":"span"}],[{"id":"id-72","text":"[44] ","element":"span"},{"text":"G. Neu, A. Gyorgy, and C. Szepesvari. The adversarial stochastic shortest path problem with unknown transition probabilities. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics (AISTATS)","element":"span"},{"text":", pages 805–813. PMLR, 2012.","element":"span"}],[{"id":"id-26","text":"[45] ","element":"span"},{"text":"R. Ortner, P. Gajane, and P. Auer. Variational regret bounds for reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 35th Uncertainty in Artificial Intelligence Conference","element":"span"},{"text":", volume 115, pages 81–90, 2020.","element":"span"}],[{"id":"id-36","text":"[46] ","element":"span"},{"text":"M. Prajapat, M. Mutny, M. Zeilinger, and A. Krause. Submodular reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-20","text":"[47] A. Raj, P. Gaillard, and C. Saad. Non-stationary online regression, 2020.","element":"span"}],[{"id":"id-38","text":"[48] ","element":"span"},{"text":"A. Rosenberg and Y. Mansour. Online convex optimization in adversarial Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 97, pages 5478–5486, 2019.","element":"span"}],[{"text":"[49] ","element":"span"},{"text":"A. Rosenberg and Y. Mansour. Online convex optimization in adversarial Markov decision processes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", pages 5478–5486, 2019.","element":"span"}],[{"id":"id-41","text":"[50] ","element":"span"},{"text":"A. Rosenberg and Y. Mansour. Stochastic shortest path with adversarially changing costs. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Joint Conference on Artificial Intelligence (IJCAI)","element":"span"},{"text":", pages 2936–2942, 2021.","element":"span"}],[{"id":"id-54","text":"[51] ","element":"span"},{"text":"A. Séguret, C. Wan, and C. Alasseur. A mean field control approach for smart charging with aggregate power demand constraints. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2021 IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe)","element":"span"},{"text":", pages 01–05, 2021.","element":"span"}],[{"id":"id-75","text":"[52] ","element":"span"},{"text":"D. van der Hoeven, N. Zhivotovskiy, and N. Cesa-Bianchi. High-probability risk bounds via sequential predictors, 2023.","element":"span"}],[{"id":"id-48","text":"[53] ","element":"span"},{"text":"C.-Y. Wei, C. Dann, and J. Zimmert. ","element":"span"},{"text":"A model selection approach for corruption robust reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Algorithmic Learning Theory (ALT)","element":"span"},{"text":", volume 167, pages 1043–1096, 2022.","element":"span"}],[{"id":"id-31","text":"[54] ","element":"span"},{"text":"C.-Y. Wei and H. Luo. Non-stationary reinforcement learning without prior knowledge: an optimal black-box approach. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory (COLT)","element":"span"},{"text":", volume 134, pages 4300–4354, 2021.","element":"span"}],[{"id":"id-3","text":"[55] ","element":"span"},{"text":"T. Zahavy, A. Cohen, H. Kaplan, and Y. Mansour. Apprenticeship learning via frank-wolfe. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI Conference on Artificial Intelligence","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-10","text":"[56] ","element":"span"},{"text":"T. Zahavy, B. O' Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex mdps. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 34, pages 25746–25759, 2021.","element":"span"}],[{"id":"id-7","text":"[57] ","element":"span"},{"text":"J. Zhang, A. Koppel, A. S. Bedi, C. Szepesvari, and M. Wang. Variational policy gradient method for reinforcement learning with general utilities. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 33, pages 4572–4583, 2020.","element":"span"}],[{"id":"id-8","text":"[58] ","element":"span"},{"text":"J. Zhang, C. Ni, z. Yu, C. Szepesvari, and M. Wang. On the convergence and sample efficiency of variance-reduced policy gradient method. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 34, pages 2228–2240, 2021.","element":"span"}],[{"id":"id-19","text":"[59] ","element":"span"},{"text":"L. Zhang, T. Yang, rong jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 80, pages 5882–5891, 2018.","element":"span"}],[{"id":"id-47","text":"[60] ","element":"span"},{"text":"X. Zhang, Y. Chen, X. Zhu, and W. Sun. Robust policy gradient against strong data corruption. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning (ICML)","element":"span"},{"text":", volume 139, pages 12391–12401, 2021.","element":"span"}],[{"id":"id-49","text":"[61] ","element":"span"},{"text":"A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", volume 26, pages 1583–1591, 2013.","element":"span"}]]},{"heading":"A Learning with expert advice","paragraphs":[[{"text":"In this section, we take a closer look at the Learning with Expert Advice (LEA) framework. We start by presenting, in Subsection ","element":"span"},{"href":"#id-66","text":"A.1, ","element":"a"},{"text":"a general reduction of the sleeping experts framework to the standard framework. Thus, any LEA algorithm can be used as a sub-routine for MetaCURL. In Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"of the main paper, we show a regret bound for MetaCURL using the Exponentially Weighted Average Forecaster (EWA) algorithm [","element":"span"},{"href":"#id-17","referenceIndex":9,"text":"9","element":"a"},{"text":"], also known as Hedge. In Subsection ","element":"span"},{"href":"#id-71","text":"A.2 ","element":"a"},{"text":"we present the main results of playing EWA with convex and exp-concave losses.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Setting. ","element":"span"},{"text":"We recall the general setting of learning with expert advice (LEA) as presented in the main paper: a learner makes sequential online predictions ","element":"span"},{"style":{"height":16.59},"width":175.04,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-0.png","element":"img","alt":" u1, . . . , uT ","inline":true,"padRight":true},{"text":"in a decision space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", over a series of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"episodes, with the help of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"experts. For each round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", each expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"makes a prediction ","element":"span"},{"style":{"height":15.79},"width":140.8,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-1.png","element":"img","alt":" ut,k, and","inline":true,"padRight":true},{"text":"the learner combines the experts’ predictions by computing a vector ","element":"span"},{"style":{"height":17.39},"width":530.32,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-2.png","element":"img","alt":" vt := (vt,1, . . . , vt,K) ∈ SK, and","inline":true,"padRight":true},{"text":"predicting ","element":"span"},{"style":{"height":20.4},"width":333.32,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-3.png","element":"img","alt":" ut := �Kk=1 vt,kut,k","inline":true},{"text":". The environment then reveals a convex loss function ","element":"span"},{"style":{"height":13.39},"width":184.64,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-4.png","element":"img","alt":" ℓt : U → R","inline":true},{"text":". ","element":"span"},{"text":"Each expert suffers a loss ","element":"span"},{"style":{"height":17.38},"width":247.96,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-5.png","element":"img","alt":" ℓt,k := ℓt(ut,k)","inline":true},{"text":", and the learner suffers a loss ","element":"span"},{"style":{"height":19.01},"width":193.84,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-6.png","element":"img","alt":"ˆℓt := ℓt(ut)","inline":true},{"text":". The learner’s objective is to keep the cumulative regret with respect to each expert as low as possible. For each expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", this quantity is defined as Reg","element":"span"},{"style":{"height":22.66},"width":410.48,"height":56.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-7.png","element":"img","alt":"[T ](k) := �Tt=1 ˆℓt − ℓt,k.","inline":true}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sleeping experts","element":"span"}],[{"text":"The sleeping expert problem [","element":"span"},{"href":"#id-62","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":23,"text":"23","element":"a"},{"text":"] is a LEA framework where experts are not required to provide solutions at every time step. Let ","element":"span"},{"style":{"height":17.38},"width":217.64,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-8.png","element":"img","alt":" It,k ∈ {0, 1}","inline":true,"padRight":true},{"text":"define a binary signal that equals ","element":"span"},{"text":"1 ","element":"span"},{"text":"if expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"is active at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The algorithm knows ","element":"span"},{"style":{"height":19.07},"width":178.6,"height":47.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-9.png","element":"img","alt":" (It,k)k∈[K]","inline":true,"padRight":true},{"text":"and assigns a zero weight to sleeping experts. We would like to have a guarantee with respect to expert ","element":"span"},{"style":{"height":16},"width":139.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-10.png","element":"img","alt":" k ∈ [K]","inline":true,"padRight":true},{"text":"but only when it is active. Hence, we now aim to bound a cumulative regret that depends on the signal ","element":"span"},{"style":{"height":13.38},"width":59.16,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-11.png","element":"img","alt":"It,k","inline":true},{"text":": Reg","element":"span"},{"text":"sleep","element":"span"},{"style":{"height":23.86},"width":509.8,"height":59.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-12.png","element":"img","alt":"[T ] (k) := �Tt=1 It,k(ˆℓt − ℓt,k)","inline":true},{"text":". We present a generic reduction from the sleeping expert ","element":"span"},{"text":"framework to the standard LEA framework ","element":"span"},{"href":"#id-64","referenceIndex":3,"text":"[3, ","element":"a"},{"href":"#id-65","referenceIndex":2,"text":"2]","element":"a"},{"text":":","element":"span"}],[{"text":"Let, for all episodes ","element":"span"},{"style":{"height":16},"width":124.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-13.png","element":"img","alt":" t ∈ [T],","inline":true}],[{"style":{"width":"26%"},"width":414,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-14.png","element":"img"}],[{"text":"We play a standard LEA algorithm with modified outputs where, at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"outputs","element":"span"}],[{"style":{"width":"42%"},"width":673,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-15.png","element":"img"}],[{"text":"A standard LEA algorithm gives an upper bound on the regret Reg","element":"span"},{"style":{"height":16.26},"width":78.6,"height":40.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-16.png","element":"img","alt":"T (k)","inline":true,"padRight":true},{"text":"with respect to each expert ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":". Using that ","element":"span"},{"style":{"height":20.4},"width":322.4,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-17.png","element":"img","alt":"�Kk=1 ˜ut,kvt,k = ˆut","inline":true},{"text":", we obtain that","element":"span"}],[{"style":{"width":"46%"},"width":745,"height":476,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/14-18.png","element":"img"}],[{"text":"Consequently, the cumulative regret with respect to each expert during the times it is active is upper bounded by the standard regret of playing a LEA algorithm with the modified outputs.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Exponentially Weighted Average forecaster","element":"span"}],[{"text":"The exponentially weighted average forecaster (EWA), also called Hedge, is a LEA algorithm that chooses a weight that decreases exponentially fast with past errors. We present EWA in Alg. ","element":"span"},{"href":"#id-86","text":"2.","element":"a"}],[{"id":"id-86","style":{"width":"100%"},"width":1587,"height":529,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-0.png","element":"img"}],[{"text":"We recall two results of playing EWA with convex losses, and with exp-concave losses, used in the main paper:","element":"span"}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"Theorem A.1 ","element":"span"},{"text":"(EWA with convex losses: Corollary ","element":"span"},{"href":"#id-17","referenceIndex":9,"style":{"height":16.19},"width":353.4,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-1.png","element":"img","alt":" 2.2 from [9]). If the ℓt ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"losses are convex and take value in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1]","element":"span"},{"style":{"fontStyle":"italic"},"text":", then the regret of the learner playing EWA with any ","element":"span"},{"style":{"height":14.4},"width":94.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-2.png","element":"img","alt":" η > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfies, for any ","element":"span"},{"style":{"height":16},"width":140.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-3.png","element":"img","alt":" k ∈ [K],","inline":true}],[{"style":{"width":"28%"},"width":456,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"In particular, with ","element":"span"},{"style":{"height":19.2},"width":308.84,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-5.png","element":"img","alt":" η =�8 log(K)/T","inline":true},{"style":{"fontStyle":"italic"},"text":", the upper bound becomes","element":"span"},{"style":{"height":19.2},"width":274.2,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-6.png","element":"img","alt":"�(T/2) log(K).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Theorem A.2 ","element":"span"},{"text":"(EWA with exp-concave losses: Thm. ","element":"span"},{"href":"#id-17","referenceIndex":9,"style":{"height":16.59},"width":547.6,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-7.png","element":"img","alt":" 3.2 from [9]). If the ℓt losses are η","inline":true},{"style":{"fontStyle":"italic"},"text":"-exp concave, then the regret of the learner playing EWA (with the same value of ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-8.png","element":"img","alt":" η","inline":true},{"style":{"fontStyle":"italic"},"text":") satisfies, for any ","element":"span"},{"style":{"height":16},"width":140.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-9.png","element":"img","alt":" k ∈ [K],","inline":true}],[{"style":{"width":"21%"},"width":348,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/15-10.png","element":"img"}]]},{"heading":"B Algorithm for the online estimation of the probability kernel (ˆpt estimator)","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Algorithm 3 ","element":"span"},{"text":"Online estimation of the probability kernel (","element":"span"},{"style":{"height":16.18},"width":205.4,"height":40.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-0.png","element":"img","alt":"ˆpt estimator)","inline":true}],[{"id":"id-76","style":{"width":"99%"},"width":1585,"height":1379,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-1.png","element":"img"}]]},{"heading":"C Auxiliary lemmas","paragraphs":[[{"text":"We start with some auxiliary results. For ","element":"span"},{"style":{"height":16},"width":414.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-2.png","element":"img","alt":" t ∈ I := [ts +1, te] ⊆ [T]","inline":true},{"text":", we define the average probability distribution for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"as","element":"span"}],[{"style":{"width":"37%"},"width":598,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-3.png","element":"img"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"Lemma C.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.19},"width":66.56,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-4.png","element":"img","alt":" ˆpt,ts","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the empirical probability transition kernel computed with data from episodes ","element":"span"},{"style":{"height":16},"width":466.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-5.png","element":"img","alt":" [ts, t − 1]. For any δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability ","element":"span"},{"style":{"height":14},"width":98.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-6.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"67%"},"width":1077,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"simultaneously for all ","element":"span"},{"style":{"height":16},"width":958.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-8.png","element":"img","alt":" n ∈ [N], (x, a) ∈ X × A, ts ∈ [T − 1], and t ∈ [ts + 1, T].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For a fixed ","element":"span"},{"style":{"height":18.19},"width":736.8,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-9.png","element":"img","alt":" n ∈ [N], (x, a) ∈ X × A, and θ ∈ {−1, 1}|X|","inline":true},{"text":", we define for all ","element":"span"},{"style":{"height":13.2},"width":98.04,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-10.png","element":"img","alt":" s ∈ I,","inline":true}],[{"style":{"width":"39%"},"width":627,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-11.png","element":"img"}],[{"text":"a Bernoulli random variable with mean value given by ","element":"span"},{"style":{"height":17.58},"width":402,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/16-12.png","element":"img","alt":"�x′∈X θ(x′)psn(x′|x, a).","inline":true}],[{"text":"The sequence of random variables given by","element":"span"},{"style":{"height":20.9},"width":212.64,"height":52.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-0.png","element":"img","alt":"�Y sn,x,a,θ�s∈I","inline":true,"padRight":true},{"text":"is independent, as we assume that the ","element":"span"},{"text":"external noises observed at each episode are all independent. Hence, by Hoeffding’s inequality we get that, for all ","element":"span"},{"style":{"height":14},"width":102.32,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-1.png","element":"img","alt":" ξ > 0,","inline":true}],[{"style":{"width":"60%"},"width":958,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-2.png","element":"img"}],[{"text":"Therefore, we have that","element":"span"}],[{"style":{"width":"83%"},"width":1322,"height":460,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-3.png","element":"img"}],[{"text":"By applying an union bound on all ","element":"span"},{"style":{"height":16},"width":411.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-4.png","element":"img","alt":" n ∈ [N], (x, a) ∈ X × A","inline":true},{"text":", and ","element":"span"},{"style":{"height":20.42},"width":243.88,"height":51.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-5.png","element":"img","alt":" θ ∈ {−1, 1}|X|","inline":true,"padRight":true},{"text":"and noting that, for any two probability distributions ","element":"span"},{"style":{"height":14.8},"width":164.88,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-6.png","element":"img","alt":" p, q ∈ ∆X","inline":true,"padRight":true},{"text":", we have that","element":"span"}],[{"style":{"width":"48%"},"width":762,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-7.png","element":"img"}],[{"text":"we arrive at the final result.","element":"span"}],[{"style":{"width":"1%"},"width":28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-8.png","element":"img"}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"Lemma C.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":1051.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-9.png","element":"img","alt":" t ∈ I := [ts + 1, te] ⊆ [T]. For all n ∈ [N], and (x, a) ∈ X × A,","inline":true}],[{"style":{"width":"37%"},"width":601,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For ","element":"span"},{"style":{"height":11.6},"width":83.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-11.png","element":"img","alt":" t ∈ I","inline":true},{"text":", and for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"77%"},"width":1226,"height":693,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-12.png","element":"img"}],[{"text":"where recall that we define ","element":"span"},{"style":{"height":20.22},"width":734.32,"height":50.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-13.png","element":"img","alt":" ∆pj := maxn,s,a ∥pj+1n (·|x, a) − pjn(·|x, a)∥1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma C.3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16.18},"width":66.56,"height":40.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-14.png","element":"img","alt":" ˆpt,ts","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the empirical probability transition kernel computed with data from episodes ","element":"span"},{"style":{"height":16},"width":466.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-15.png","element":"img","alt":" [ts, t − 1]. For any δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability ","element":"span"},{"style":{"height":14},"width":98.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-16.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"78%"},"width":1250,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"simultaneously for all ","element":"span"},{"style":{"height":16},"width":958.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/17-18.png","element":"img","alt":" n ∈ [N], (x, a) ∈ X × A, ts ∈ [T − 1], and t ∈ [ts + 1, T].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The result follows immediately from the triangular inequality and Lemmas ","element":"span"},{"href":"#id-87","text":"C.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-88","text":"C.2.","element":"a"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"Lemma C.4 ","element":"span"},{"text":"(A version of the inverse of Pinsker’s inequality)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":10},"width":82.16,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-0.png","element":"img","alt":" p′, q′","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be any distributions over","element":"span"}],[{"style":{"width":"69%"},"width":1102,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Therefore,","element":"span"}],[{"style":{"width":"27%"},"width":435,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"is lower bounded by ","element":"span"},{"style":{"height":22.18},"width":61.24,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-3.png","element":"img","alt":"12|X|","inline":true},{"text":", hence KL","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":") ","element":"span"},{"text":"is well defined. Also, by ","element":"span"},{"text":"convexity of the simplex, ","element":"span"},{"style":{"height":14},"width":155.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-4.png","element":"img","alt":" p, q ∈ SX","inline":true,"padRight":true},{"text":", therefore","element":"span"}],[{"style":{"width":"79%"},"width":1265,"height":893,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-5.png","element":"img"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"Lemma C.5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any strategy ","element":"span"},{"style":{"height":17.38},"width":234.8,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-6.png","element":"img","alt":" π ∈ (SA)X×N","inline":true},{"style":{"fontStyle":"italic"},"text":", for any two probability kernels ","element":"span"},{"style":{"height":17.68},"width":233.76,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-7.png","element":"img","alt":" p = (pn)n∈[N]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.68},"width":868.12,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-8.png","element":"img","alt":"q = (qn)n∈[N] such that pn, qn : X × A × X → [0, 1]","inline":true},{"style":{"fontStyle":"italic"},"text":", and for all ","element":"span"},{"style":{"height":16},"width":142.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-9.png","element":"img","alt":" n ∈ [N],","inline":true}],[{"style":{"width":"68%"},"width":1082,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"From the definition of a state-action distribution sequence induced by a policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-11.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"in a MDP with probability kernel ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"in Eq. ","element":"span"},{"href":"#id-67","text":"(2)","element":"a"},{"text":", we have that for all ","element":"span"},{"style":{"height":16},"width":474.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-12.png","element":"img","alt":" (x, a) ∈ X × A and n ∈ [N],","inline":true}],[{"style":{"width":"51%"},"width":820,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/18-13.png","element":"img"}],[{"text":"Thus,","element":"span"}],[{"style":{"width":"100%"},"width":1631,"height":1068,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-0.png","element":"img"}],[{"id":"id-97","style":{"fontWeight":"bold"},"text":"Lemma C.6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any pair of strategies ","element":"span"},{"style":{"height":17.38},"width":296.48,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-1.png","element":"img","alt":" π, π′ ∈ (∆A)X×N","inline":true},{"style":{"fontStyle":"italic"},"text":", for any probability kernel ","element":"span"},{"style":{"height":17.68},"width":233.36,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-2.png","element":"img","alt":" p = (pn)n∈[N]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":414.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-3.png","element":"img","alt":" pn : X × A × X → [0, 1]","inline":true},{"style":{"fontStyle":"italic"},"text":", and for all ","element":"span"},{"style":{"height":16},"width":142.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-4.png","element":"img","alt":" n ∈ [N],","inline":true}],[{"style":{"width":"57%"},"width":907,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":18.03},"width":900.08,"height":45.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-6.png","element":"img","alt":" ρπ,pi (x) := �a∈A µπ,pi (x, a) for all x ∈ X and i ∈ [N].","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using the recursive relation from Eq. ","element":"span"},{"href":"#id-67","text":"(2) ","element":"a"},{"text":"of a state-action distribution induced by a policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-7.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"in a MDP with probability transition ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"86%"},"width":1372,"height":545,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-8.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":10},"width":40,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-9.png","element":"img","alt":" µ0","inline":true,"padRight":true},{"text":"is fixed for each state-action distribution sequence, by induction we obtain that","element":"span"}],[{"style":{"width":"59%"},"width":942,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/19-10.png","element":"img"}],[{"text":"completing the proof.","element":"span"}]]},{"heading":"D Proof of Prop. 5.2: Rˆp[T] regret analysis","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Here, we set an upper bound on the term ","element":"span"},{"style":{"height":23.49},"width":71.4,"height":58.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-0.png","element":"img","alt":" Rˆp[T ] ","inline":true,"padRight":true},{"text":"where we pay for errors in estimating ","element":"span"},{"style":{"height":16.59},"width":137.44,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-1.png","element":"img","alt":" pt by ˆpt.","inline":true}],[{"style":{"width":"99%"},"width":1578,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-2.png","element":"img"}],[{"text":"To obtain the first inequality, we use the convexity of ","element":"span"},{"style":{"height":12.99},"width":43.16,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-3.png","element":"img","alt":" F t","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":127.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-4.png","element":"img","alt":" t ∈ [T]","inline":true},{"text":", then we use Holder’s inequality and the fact that ","element":"span"},{"style":{"height":15.78},"width":142.8,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-5.png","element":"img","alt":" F t is LF","inline":true,"padRight":true},{"text":"-Lipschitz, and for the last inequality we use Lemma ","element":"span"},{"href":"#id-82","text":"C.5.","element":"a"}],[{"text":"The difficulty in analyzing the ","element":"span"},{"style":{"height":13.18},"width":43.12,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-6.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"difference between ","element":"span"},{"style":{"height":16.19},"width":142.76,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-7.png","element":"img","alt":" pt and ˆpt ","inline":true,"padRight":true},{"text":"arises from the non-stationarity of ","element":"span"},{"style":{"height":16.19},"width":44.08,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-8.png","element":"img","alt":" pt.","inline":true,"padRight":true},{"text":"To overcome this we want to use the scheme presented in Subsection ","element":"span"},{"href":"#id-68","text":"4.3. ","element":"a"},{"text":"By Cauchy-Schwartz, we get that","element":"span"}],[{"id":"id-91","style":{"width":"96%"},"width":1523,"height":216,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Analysis of the ","element":"span"},{"style":{"height":13.2},"width":43.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-10.png","element":"img","alt":" L1","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"norm. ","element":"span"},{"text":"We start by analysing the sum over ","element":"span"},{"style":{"height":16},"width":115.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-11.png","element":"img","alt":" t ∈ [T]","inline":true,"padRight":true},{"text":"of the ","element":"span"},{"style":{"height":13.2},"width":43.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-12.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm in term ","element":"span"},{"style":{"height":16},"width":51.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-13.png","element":"img","alt":" (∗)","inline":true},{"text":". For each ","element":"span"},{"style":{"height":16},"width":605.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-14.png","element":"img","alt":" n ∈ [N], j ∈ [n] and (x, a) ∈ X × A,","inline":true}],[{"style":{"width":"82%"},"width":1313,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-15.png","element":"img"}],[{"text":"where we apply Pinsker’s inequality.","element":"span"}],[{"text":"Consider a sequence of episodes ","element":"span"},{"style":{"height":16},"width":1067.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-16.png","element":"img","alt":" 1 = t1 < t2 < . . . < tm+1 = T + 1, with Ii := [ti, ti+1 − 1], such","inline":true,"padRight":true},{"text":"that ","element":"span"},{"style":{"height":18.74},"width":225.56,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-17.png","element":"img","alt":" ∆pIi ≤ ∆p/m","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-18.png","element":"img","alt":" i ∈ [m]","inline":true},{"text":". Decomposing the KL sum over ","element":"span"},{"style":{"height":16},"width":114,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-19.png","element":"img","alt":" t ∈ [T]","inline":true,"padRight":true},{"text":"as a sum on the intervals ","element":"span"},{"style":{"height":13.18},"width":32.68,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-20.png","element":"img","alt":"Ii","inline":true},{"text":", we obtain that","element":"span"}],[{"style":{"width":"100%"},"width":1604,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-21.png","element":"img"}],[{"text":"where the expectation of the second term is with respect to ","element":"span"},{"style":{"height":28.8},"width":498.36,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-22.png","element":"img","alt":" ˜xtj,x,a ∼�ptj(·|x, a) + 1|X|�/2.","inline":true}],[{"text":"We analyse each term separately:","element":"span"}],[{"text":"First, note that ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"ii","element":"span"},{"text":") ","element":"span"},{"text":"is the expectation over ","element":"span"},{"style":{"height":19.54},"width":89.84,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-23.png","element":"img","alt":" ˜xtj,x,a ","inline":true,"padRight":true},{"text":"of the cumulative regret of sleeping EWA on interval ","element":"span"},{"style":{"height":13.2},"width":32.68,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-24.png","element":"img","alt":"Ii","inline":true,"padRight":true},{"text":"with respect to the expert ","element":"span"},{"style":{"height":12.4},"width":25.4,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-25.png","element":"img","alt":" ti","inline":true,"padRight":true},{"text":"using the loss function ","element":"span"},{"style":{"height":12.98},"width":28.6,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-26.png","element":"img","alt":" ℓt","inline":true,"padRight":true},{"text":"defined in Eq. ","element":"span"},{"href":"#id-73","text":"(12)","element":"a"},{"text":". This term is upper bounded by ","element":"span"},{"text":"log(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":") ","element":"span"},{"text":"(see Subection ","element":"span"},{"href":"#id-68","text":"4.3 ","element":"a"},{"text":"of the main paper). From it we deduce that","element":"span"}],[{"style":{"width":"31%"},"width":498,"height":111,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/20-27.png","element":"img"}],[{"text":"Regarding term ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"text":", we start by using the inverse of Pinsker’s inequality presented in Lemma ","element":"span"},{"href":"#id-89","text":"C.4,","element":"a"}],[{"style":{"width":"65%"},"width":1034,"height":398,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-0.png","element":"img"}],[{"text":"To simplify notations, from now on we let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":":=","element":"span"}],[{"style":{"width":"23%"},"width":375,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-1.png","element":"img"}],[{"text":"we obtain that","element":"span"}],[{"style":{"width":"74%"},"width":1175,"height":508,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-2.png","element":"img"}],[{"text":"Joining the upper bounds of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"ii","element":"span"},{"text":") ","element":"span"},{"text":"we conclude that","element":"span"}],[{"id":"id-90","style":{"width":"100%"},"width":1586,"height":597,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Back to the analysis of ","element":"span"},{"style":{"height":23.49},"width":71.36,"height":58.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-4.png","element":"img","alt":" Rˆp[T ]","inline":true},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"Using the inequality in Eq. ","element":"span"},{"href":"#id-90","text":"(17) ","element":"a"},{"text":"to bound the ","element":"span"},{"style":{"height":13.18},"width":43.12,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-5.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"norm of ","element":"span"},{"style":{"height":16},"width":51.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-6.png","element":"img","alt":" (∗)","inline":true,"padRight":true},{"text":"on ","element":"span"},{"text":"Eq. ","element":"span"},{"href":"#id-91","text":"(16)","element":"a"},{"text":", we obtain that","element":"span"}],[{"style":{"width":"67%"},"width":1075,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/21-7.png","element":"img"}],[{"text":"concluding the proof.","element":"span"}],[{"text":"Note that for ","element":"span"},{"style":{"height":21.65},"width":412.28,"height":54.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-0.png","element":"img","alt":"�mi=1�t∈Ii RˆpIi(πt,ti,λ)","inline":true,"padRight":true},{"text":"(the third term of the meta-regret decomposition of Eq. ","element":"span"},{"href":"#id-92","text":"(13)","element":"a"},{"text":"), ","element":"span"},{"text":"following the same procedure as above, we obtain that","element":"span"}],[{"style":{"width":"92%"},"width":1460,"height":772,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-1.png","element":"img"}],[{"text":"Since the second term is independent of ","element":"span"},{"style":{"height":12.99},"width":68.68,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-2.png","element":"img","alt":" πt,ti","inline":true},{"text":", the analysis is the same as before and we obtain the same upper bound as for ","element":"span"},{"style":{"height":23.49},"width":152.6,"height":58.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-3.png","element":"img","alt":" Rˆp[T ](πt).","inline":true}]]},{"heading":"E Proof of Prop. 5.4: Rblack-box[T] regret analysis","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Assume a Black-box algorithm satisfying the dynamic regret bound of Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"text":", i.e., for any interval ","element":"span"},{"style":{"height":16},"width":124.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-4.png","element":"img","alt":" I ⊆ [T]","inline":true},{"text":", with respect to any sequence of policies ","element":"span"},{"style":{"height":16.99},"width":143.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-5.png","element":"img","alt":" (πt,∗)t∈I","inline":true},{"text":", and for any learning rate ","element":"span"},{"style":{"height":13.2},"width":33.24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-6.png","element":"img","alt":" λ,","inline":true}],[{"style":{"width":"74%"},"width":1186,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-7.png","element":"img"}],[{"text":"Consider any sequence of episodes ","element":"span"},{"style":{"height":14.8},"width":672.64,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-8.png","element":"img","alt":" 1 = t1 < t2 < . . . < tm+1 = T + 1","inline":true},{"text":", forming intervals ","element":"span"},{"style":{"height":16},"width":539,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-9.png","element":"img","alt":"Ii := [ti, ti+1 − 1] for all i ∈ [m]","inline":true},{"text":". We can decompose the black-box regret over ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"] ","element":"span"},{"text":"as","element":"span"}],[{"style":{"width":"64%"},"width":1015,"height":352,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-10.png","element":"img"}],[{"text":"In principle, we would like to select the optimal ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-11.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"that optimizes this dynamic regret. However, as ","element":"span"},{"style":{"height":14.59},"width":67.84,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-12.png","element":"img","alt":"∆π∗","inline":true,"padRight":true},{"text":"may be unknown in advance, this is not possible. We show here that running MetaCURL with the learning rate set ","element":"span"},{"style":{"height":16.98},"width":684.4,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-13.png","element":"img","alt":" Λ := {2−j|j = 0, 1, 2, . . . , ⌈[log2(T)/2⌉}","inline":true,"padRight":true},{"text":"ensures that the optimal empirical learning rate is close to the true optimal one by a factor of ","element":"span"},{"text":"2 ","element":"span"},{"text":"and that the learner always plays as well as the optimal empirical learning rate.","element":"span"}],[{"text":"Denote by ","element":"span"},{"style":{"height":10.98},"width":39.24,"height":27.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-14.png","element":"img","alt":" λ∗ ","inline":true,"padRight":true},{"text":"the optimal learning rate and ","element":"span"},{"style":{"height":15.01},"width":39.24,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-15.png","element":"img","alt":"ˆλ∗ ","inline":true,"padRight":true},{"text":"the empirical optimal learning rate in ","element":"span"},{"style":{"height":11.6},"width":28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-16.png","element":"img","alt":" Λ","inline":true},{"text":". Note that","element":"span"}],[{"style":{"width":"21%"},"width":341,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-17.png","element":"img"}],[{"text":"We consider three different cases:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"If ","element":"span"},{"style":{"height":13.39},"width":114.72,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-18.png","element":"img","alt":" λ∗ ≥ 1","inline":true},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"this implies that ","element":"span"},{"style":{"height":25.65},"width":224.08,"height":64.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-19.png","element":"img","alt":"c2∆π∗+c3c1T ≥ 1","inline":true},{"text":". Therefore, we have that ","element":"span"},{"style":{"height":25.65},"width":231.32,"height":64.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-20.png","element":"img","alt":" T ≤ c2∆π∗+c3c1","inline":true,"padRight":true},{"text":". As we assume ","element":"span"},{"style":{"height":16.99},"width":169.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-21.png","element":"img","alt":"f tn ∈ [0, 1]","inline":true,"padRight":true},{"text":"for all time steps ","element":"span"},{"style":{"height":16},"width":131.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-22.png","element":"img","alt":" n ∈ [N]","inline":true,"padRight":true},{"text":"and episodes ","element":"span"},{"style":{"height":16},"width":202.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-23.png","element":"img","alt":" t ∈ [T], then","inline":true},{"style":{"height":4},"width":15,"height":10,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-24.png","element":"img","alt":"∗","inline":true}],[{"style":{"width":"69%"},"width":1103,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/22-25.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"If ","element":"span"},{"style":{"height":18.3},"width":209.6,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-0.png","element":"img","alt":" λ∗ ≤ 1/√T:","inline":true,"padRight":true},{"text":"this implies that ","element":"span"},{"style":{"height":25.65},"width":224.08,"height":64.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-1.png","element":"img","alt":"c2∆π∗+c3c1 ≤ 1","inline":true},{"text":". Therefore, taking ","element":"span"},{"style":{"height":19.01},"width":273.28,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-2.png","element":"img","alt":"ˆλ∗ = 1/√T ∈ Λ","inline":true},{"text":", we have that","element":"span"}],[{"style":{"width":"63%"},"width":1009,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"If ","element":"span"},{"style":{"height":18.3},"width":279.12,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-4.png","element":"img","alt":" λ∗ ∈ [1/√T, 1]","inline":true},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"in this case, given the construction of ","element":"span"},{"style":{"height":11.6},"width":28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-5.png","element":"img","alt":" Λ","inline":true},{"text":", there is a ","element":"span"},{"style":{"height":15.81},"width":145.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-6.png","element":"img","alt":"ˆλ∗ ∈ Λ","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.41},"width":385,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-7.png","element":"img","alt":"λ∗ ≤ ˆλ∗ ≤ 2λ∗. Hence,","inline":true}],[{"style":{"width":"58%"},"width":931,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-8.png","element":"img"}],[{"text":"Therefore, by taking ","element":"span"},{"style":{"height":14.99},"width":115.64,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-9.png","element":"img","alt":" λ = ˆλ∗ ","inline":true,"padRight":true},{"text":"in the analysis, we can ensure that","element":"span"}],[{"style":{"width":"94%"},"width":1501,"height":317,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-10.png","element":"img"}]]},{"heading":"F Proof of Thm. 5.1: Main result","paragraphs":[[{"text":"Joining the results from the meta-regret bound and the black-box regret bound, we get the main result of the paper:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"text":"(Main result)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":157.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-11.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". Playing MetaCURL, with black-box algorithm with dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10)","element":"a"},{"style":{"fontStyle":"italic"},"text":", with a learning rate grid ","element":"span"},{"style":{"height":19.2},"width":853.56,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-12.png","element":"img","alt":" Λ :=�2−j|j = 0, 1, 2, . . . , ⌈1/2 log2(T)⌉�, and with","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"EWA as the sleeping expert subroutine, we obtain, with probability at least ","element":"span"},{"style":{"height":11.6},"width":107.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-13.png","element":"img","alt":" 1 − 2δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for any sequence of policies ","element":"span"},{"style":{"height":18.66},"width":182.64,"height":46.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-14.png","element":"img","alt":" (πt,∗)t∈[T ],","inline":true}],[{"style":{"width":"69%"},"width":1099,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Define a sequence of episodes ","element":"span"},{"style":{"height":16},"width":989.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-16.png","element":"img","alt":" 1 = t1 < t2 < . . . < tm+1 = T + 1, with Ii := [ti, ti+1 − 1],","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":18.74},"width":471.28,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-17.png","element":"img","alt":" ∆pIi ≤ ∆p/m for all i ∈ [m].","inline":true}],[{"text":"The dynamic regret of ","element":"span"},{"style":{"height":16},"width":149.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-18.png","element":"img","alt":" M(E, Λ)","inline":true,"padRight":true},{"text":"with respect to any sequence of policies ","element":"span"},{"style":{"height":18.67},"width":434.04,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-19.png","element":"img","alt":" (πt,∗)t∈[T ], and any λ ∈ Λ,","inline":true,"padRight":true},{"text":"can be decomposed as","element":"span"}],[{"style":{"width":"100%"},"width":1599,"height":342,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-20.png","element":"img"}],[{"text":"From Prop. ","element":"span"},{"href":"#id-93","text":"5.3, ","element":"a"},{"text":"we have that with probability at least ","element":"span"},{"style":{"height":14},"width":279.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-21.png","element":"img","alt":" 1 − 2δ, and C :=","inline":true}],[{"style":{"width":"85%"},"width":1357,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-22.png","element":"img"}],[{"text":"In addition, for ","element":"span"},{"style":{"height":19.2},"width":807.68,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-23.png","element":"img","alt":" Λ :=�2−j|j = 0, 1, 2, . . . , ⌈1/2 log2(T)⌉�, and λ","inline":true,"padRight":true},{"text":"equal the best empirical learning rate in ","element":"span"},{"style":{"height":11.6},"width":28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-24.png","element":"img","alt":" Λ","inline":true},{"text":", Prop. ","element":"span"},{"href":"#id-94","text":"5.4 ","element":"a"},{"text":"yields that, if the black-box algorithm has dynamic regret as in Eq. ","element":"span"},{"href":"#id-57","text":"(10) ","element":"a"},{"text":"for any interval in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", then","element":"span"}],[{"style":{"width":"89%"},"width":1421,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/23-25.png","element":"img"}],[{"text":"Therefore, joining both results, we get that,","element":"span"}],[{"style":{"width":"95%"},"width":1508,"height":213,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-0.png","element":"img"}],[{"text":"Thus, for ","element":"span"},{"style":{"height":33.1},"width":531.24,"height":82.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-1.png","element":"img","alt":" m =�2√T ∆pγ �2/3, with γ :=�","inline":true}],[{"style":{"width":"88%"},"width":1403,"height":475,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-2.png","element":"img"}]]},{"heading":"G Greedy MD-CURL dynamic regret analysis","paragraphs":[[{"text":"Here we introduce Greedy MD-CURL developed by [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"], a computationally efficient policyoptimization algorithm known for achieving sublinear static regret in online CURL with adversarial objective functions within a stationary MDP. We begin by detailing Greedy MD-CURL as presented in [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] in Alg. ","element":"span"},{"href":"#id-85","text":"4. ","element":"a"},{"text":"We then provide a new analysis upper bounding the dynamic regret of Greedy MD-CURL in a quasi-stationary interval valid for any learning rate ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-3.png","element":"img","alt":" λ","inline":true},{"text":". Hence, Greedy MD-CURL can be used as a black-box for MetaCURL. This is the first dynamic regret analysis for a CURL approach.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":17.33},"width":89.76,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-4.png","element":"img","alt":" Mp,∗µ0","inline":true,"padRight":true},{"text":"denote the subset of ","element":"span"},{"style":{"height":17.14},"width":81.24,"height":42.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-5.png","element":"img","alt":" Mpµ0","inline":true,"padRight":true},{"text":"where the corresponding policies ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-6.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"are such that ","element":"span"},{"style":{"height":16},"width":203.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-7.png","element":"img","alt":" πn(a|x) ̸= 0","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":16},"width":407,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-8.png","element":"img","alt":" (x, a) ∈ X × A, n ∈ [N]","inline":true},{"text":". For any probability transition ","element":"span"},{"style":{"height":17.33},"width":414.6,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-9.png","element":"img","alt":" p, Γ : Mpµ0 × Mp,∗µ0 → R","inline":true,"padRight":true},{"text":"such that, ","element":"span"},{"text":"for all ","element":"span"},{"style":{"height":17.15},"width":153.96,"height":42.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-10.png","element":"img","alt":" µ ∈ Mpµ0 ","inline":true,"padRight":true},{"text":"with its associated policy ","element":"span"},{"style":{"height":17.55},"width":285.2,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-11.png","element":"img","alt":" π, and µ′ ∈ Mp,∗µ0 ","inline":true,"padRight":true},{"text":"with its associated policy ","element":"span"},{"style":{"height":13.6},"width":185.8,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-12.png","element":"img","alt":" π′, we have","inline":true}],[{"id":"id-96","style":{"width":"75%"},"width":1202,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-13.png","element":"img"}],[{"text":"This divergence ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-14.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"is a Bregman divergence (see Proposition 4.3 of [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"]). Problem ","element":"span"},{"href":"#id-85","text":"(20) ","element":"a"},{"text":"implemented with this Bregman divergence ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/24-15.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"has a closed form solution, as showed in ","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"[39]","element":"a"},{"text":".","element":"span"}],[{"id":"id-85","style":{"width":"100%"},"width":1587,"height":1367,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"G.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Dynamic regret analysis of Greedy MD-CURL","element":"span"}],[{"text":"Let us assume we analyze our regret in an interval ","element":"span"},{"style":{"height":16},"width":291.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-1.png","element":"img","alt":" I := [ts, te] ⊆ [T]","inline":true},{"text":". We denote by ","element":"span"},{"style":{"height":13.2},"width":45.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-2.png","element":"img","alt":" RI","inline":true,"padRight":true},{"text":"the dynamic regret of an instance of Greedy MD-CURL started at episode ","element":"span"},{"style":{"height":12.4},"width":29.36,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-3.png","element":"img","alt":" ts","inline":true,"padRight":true},{"text":"until the end of interval ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"at episode ","element":"span"},{"style":{"height":12.38},"width":29.4,"height":30.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-4.png","element":"img","alt":"te","inline":true},{"text":". We denote by ","element":"span"},{"style":{"height":12.98},"width":36.16,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-5.png","element":"img","alt":" πt ","inline":true,"padRight":true},{"text":"the policy produced by this instance of Greedy MD-CURL at episode ","element":"span"},{"style":{"height":16.18},"width":135.76,"height":40.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-6.png","element":"img","alt":" t ∈ I, pt","inline":true,"padRight":true},{"text":"the true probability transition kernel, and","element":"span"}],[{"style":{"width":"43%"},"width":688,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-7.png","element":"img"}],[{"text":"the empirical estimate of the probability kernel at episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", with data from the beginning of the interval ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I","element":"span"},{"text":".","element":"span"}],[{"text":"We define and decompose the dynamic regret ","element":"span"},{"style":{"height":13.18},"width":45.28,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-8.png","element":"img","alt":" RI","inline":true,"padRight":true},{"text":"with respect to any sequence of policies ","element":"span"},{"style":{"height":16.99},"width":143.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-9.png","element":"img","alt":" (πt,∗)t∈I","inline":true,"padRight":true},{"text":"into three terms as follows:","element":"span"}],[{"id":"id-102","style":{"width":"95%"},"width":1511,"height":362,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-10.png","element":"img"}],[{"text":"The terms ","element":"span"},{"style":{"height":17.78},"width":162.68,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-11.png","element":"img","alt":" RMDPI (πt)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.78},"width":188.48,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-12.png","element":"img","alt":" RMDPI (πt,∗)","inline":true,"padRight":true},{"text":"pay for our lack of knowledge of the true MDP, forcing us to use its empirical estimate. The term ","element":"span"},{"style":{"height":20.59},"width":99.96,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/25-13.png","element":"img","alt":" RpolicyI","inline":true,"padRight":true},{"text":"corresponds to the loss incurred in calculating the policy ","element":"span"},{"text":"by solving the optimization problem given in Eq. ","element":"span"},{"href":"#id-85","text":"(20)","element":"a"},{"text":". Below, we present the first analysis of the dynamic regret for a CURL algorithm. We consider each term separately.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"G.1.1 ","element":"span"},{"style":{"height":17.78},"width":241.6,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-0.png","element":"img","alt":"RMDPI analysis","inline":true}],[{"text":"In Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"we assume that the deterministic part of the dynamics, given by ","element":"span"},{"style":{"height":10},"width":39,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-1.png","element":"img","alt":" gn","inline":true,"padRight":true},{"text":"in equation ","element":"span"},{"href":"#id-50","text":"(8) ","element":"a"},{"text":"for each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"text":", is known in advance. The source of uncertainty and non-stationarity in the MDP comes only from the external noise dynamics, that is independent of the agent’s state-action pair. Therefore, we do not need to explore in this setting, so the analysis of the two terms ","element":"span"},{"style":{"height":17.78},"width":162.68,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-2.png","element":"img","alt":" RMDPI (πt)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.78},"width":188.48,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-3.png","element":"img","alt":"RMDPI (πt,∗)","inline":true,"padRight":true},{"text":"are the same.","element":"span"}],[{"id":"id-100","style":{"fontWeight":"bold"},"text":"Proposition G.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":14},"width":98.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-4.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"63%"},"width":1002,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-5.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for all intervals ","element":"span"},{"style":{"height":16},"width":120.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-6.png","element":"img","alt":" I ∈ [T]","inline":true},{"style":{"fontStyle":"italic"},"text":". The same result is valid for ","element":"span"},{"style":{"height":17.38},"width":197.92,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-7.png","element":"img","alt":" RMDPI (πt,∗).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We start by using the convexity of ","element":"span"},{"style":{"height":12.99},"width":43.16,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-8.png","element":"img","alt":" F t","inline":true},{"text":", Holder’s inequality, that ","element":"span"},{"style":{"height":12.99},"width":43.16,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-9.png","element":"img","alt":" F t","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.58},"width":51.12,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-10.png","element":"img","alt":" LF","inline":true,"padRight":true},{"text":"-Lipschitz, and Lemma ","element":"span"},{"href":"#id-82","text":"C.5 ","element":"a"},{"text":"to obtain that","element":"span"}],[{"id":"id-95","style":{"width":"86%"},"width":1378,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-11.png","element":"img"}],[{"text":"Applying Lemmas ","element":"span"},{"href":"#id-87","text":"C.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-88","text":"C.2, ","element":"a"},{"text":"we have that for any ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-12.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"text":", with probability ","element":"span"},{"style":{"height":14},"width":97.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-13.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"83%"},"width":1321,"height":194,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-14.png","element":"img"}],[{"text":"Using this to continue the upper bound of Eq. ","element":"span"},{"href":"#id-95","text":"(22)","element":"a"},{"text":", we conclude our proof:","element":"span"}],[{"style":{"width":"87%"},"width":1393,"height":351,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-15.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"G.1.2 ","element":"span"},{"style":{"height":20.05},"width":249.4,"height":50.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-16.png","element":"img","alt":"RpolicyI analysis","inline":true}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"Proposition G.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a constant defined as","element":"span"}],[{"style":{"width":"79%"},"width":1256,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Then, Greedy MD-CURL obtains, for any sequence of policies ","element":"span"},{"style":{"height":16.99},"width":143.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-18.png","element":"img","alt":" (πt,∗)t∈I","inline":true},{"style":{"fontStyle":"italic"},"text":", and for any learning rate ","element":"span"},{"style":{"height":13.2},"width":107.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-19.png","element":"img","alt":"λ > 0,","inline":true}],[{"style":{"width":"47%"},"width":746,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-20.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We adapt the proof of Prop. ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"7 ","element":"span"},{"text":"of [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] that upper bounds the static regret incurred when solving the optimization Problem ","element":"span"},{"href":"#id-85","text":"(20)","element":"a"},{"text":", for a proof that upper bounds the dynamic regret. The main difference is that, in the case of static regret, we compare ourselves to the same policy throughout the interval, whereas in the case of dynamic regret, at each episode we compare ourselves to a different policy given by ","element":"span"},{"style":{"height":12.98},"width":61.68,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-21.png","element":"img","alt":" πt,∗","inline":true},{"text":". Consequently, the analysis remains the same as in [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] for all terms that do not depend on ","element":"span"},{"style":{"height":12.99},"width":61.64,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/26-22.png","element":"img","alt":" πt,∗","inline":true},{"text":", but requires a new analysis in terms that do depend on it.","element":"span"}],[{"text":"To simplify notation, we take ","element":"span"},{"style":{"height":19.79},"width":302.72,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-0.png","element":"img","alt":" ℓt := ∇F t(µπt,ˆpt)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.99},"width":196,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-1.png","element":"img","alt":" µt := µπt,ˆpt","inline":true},{"text":". We can use the same reasoning as in appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D.","element":"span"},{"text":"5 ","element":"span"},{"text":"of ","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"[39] ","element":"a"},{"text":"to show that","element":"span"}],[{"style":{"width":"100%"},"width":1617,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-2.png","element":"img"}],[{"text":"Since the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"term does not depend on ","element":"span"},{"style":{"height":12.98},"width":61.64,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-3.png","element":"img","alt":" πt,∗","inline":true},{"text":", its analysis follows directly from ","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"[39]","element":"a"},{"text":", and is given by","element":"span"}],[{"id":"id-98","style":{"width":"72%"},"width":1148,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.6},"width":51.12,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-5.png","element":"img","alt":" LF","inline":true,"padRight":true},{"text":"is the Lipschitz constant of ","element":"span"},{"style":{"height":13.38},"width":160.32,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-6.png","element":"img","alt":" F t and αt ","inline":true,"padRight":true},{"text":"is an input parameter of Greedy MD-CURL.","element":"span"}],[{"text":"We then proceed to analyze term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":". Again, following the procedure of appendix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D.","element":"span"},{"text":"5 ","element":"span"},{"text":"of [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"], we obtain that","element":"span"}],[{"style":{"width":"90%"},"width":1429,"height":389,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-7.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":17.39},"width":321.12,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-8.png","element":"img","alt":" ψ : (SX×A)N → R","inline":true,"padRight":true},{"text":"denote the function inducing the Bregman divergence ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-9.png","element":"img","alt":" Γ","inline":true,"padRight":true},{"text":"of Eq. ","element":"span"},{"href":"#id-96","text":"(19)","element":"a"},{"text":". [","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"39","element":"a"},{"text":"] further shows that:","element":"span"}],[{"style":{"width":"76%"},"width":1210,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-10.png","element":"img"}],[{"text":"• ","element":"span"},{"style":{"height":17.58},"width":312.48,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-11.png","element":"img","alt":" (ii) ≤ 2N �t∈I αt","inline":true},{"text":", and this upper bound is found independently of ","element":"span"},{"style":{"height":12.98},"width":61.64,"height":32.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-12.png","element":"img","alt":" πt,∗","inline":true}],[{"text":"• ","element":"span"},{"text":"(","element":"span"},{"style":{"height":20.19},"width":388.4,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-13.png","element":"img","alt":"iii) ≤ Γ(µπt−1,∗,ˆpt, µt).","inline":true}],[{"text":"Lemma ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D.","element":"span"},{"text":"6 ","element":"span"},{"text":"of ","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"[39] ","element":"a"},{"text":"shows that,","element":"span"}],[{"style":{"width":"54%"},"width":864,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-14.png","element":"img"}],[{"text":"Only term ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":")","element":"span"},{"text":", which involves ","element":"span"},{"style":{"height":20.99},"width":460.96,"height":52.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-15.png","element":"img","alt":" ∥µπ∗,t−1,ˆpt − µπ∗,t,ˆpt+1∥∞,1","inline":true},{"text":", depends on the sequence ","element":"span"},{"style":{"height":18.67},"width":169.6,"height":46.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-16.png","element":"img","alt":" (πt,∗)t∈[T ]","inline":true},{"text":", requiring then a new analysis. For this purpose, we rely on the following two results:","element":"span"}],[{"text":"• From Lemma ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6 ","element":"span"},{"text":"of ","element":"span"},{"href":"#id-14","referenceIndex":39,"text":"[39]","element":"a"},{"text":", we have that, for all strategies ","element":"span"},{"style":{"height":9.2},"width":34.16,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-17.png","element":"img","alt":" π,","inline":true}],[{"style":{"width":"32%"},"width":512,"height":89,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-18.png","element":"img"}],[{"text":"• From auxiliary Lemma ","element":"span"},{"href":"#id-97","text":"C.6 ","element":"a"},{"text":"proved in Appendix ","element":"span"},{"text":"C ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"83%"},"width":1328,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-19.png","element":"img"}],[{"text":"Therefore, using the triangular inequality and the two results above, we obtain that","element":"span"}],[{"style":{"height":21.79},"width":1550.36,"height":54.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-20.png","element":"img","alt":"∥µπ∗,t−1,ˆpt − µπt,∗,ˆpt+1∥∞,1 ≤ ∥µπt−1,∗,ˆpt − µπt−1,∗,ˆpt+1∥∞,1 + ∥µπt−1,∗,ˆpt+1 − µπt,∗,ˆpt+1∥∞,1","inline":true},{"style":{"height":32.51},"width":275.48,"height":81.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-21.png","element":"img","alt":"≤ 2Nt + N∆π∗t .","inline":true}],[{"text":"Therefore, the bound on term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"is given by","element":"span"}],[{"id":"id-99","style":{"width":"87%"},"width":1392,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/27-22.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Final step: joining all results","element":"span"}],[{"text":"Joining the upper bounds on term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"from Eq. ","element":"span"},{"href":"#id-98","text":"(23) ","element":"a"},{"text":"and on term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"from Eq. ","element":"span"},{"href":"#id-99","text":"(24)","element":"a"},{"text":", we have that","element":"span"}],[{"style":{"width":"96%"},"width":1524,"height":313,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-0.png","element":"img"}],[{"text":"If we take the learning rate as ","element":"span"},{"style":{"height":16},"width":146.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-1.png","element":"img","alt":" αt = 1/t","inline":true},{"text":", then, for all ","element":"span"},{"style":{"height":13.2},"width":106.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-2.png","element":"img","alt":" λ > 0,","inline":true}],[{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"policy","element":"span"},{"style":{"height":35.1},"width":640,"height":87.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-3.png","element":"img","alt":"I �(πt,∗)t ∈ I�≤ λL2F |I| + N 2λ ∆π∗I","inline":true}],[{"style":{"width":"100%"},"width":1600,"height":332,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"G.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Final Greedy MD-CURL regret analysis","element":"span"}],[{"text":"Replacing the bounds of Prop. ","element":"span"},{"href":"#id-100","text":"G.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-101","text":"G.2 ","element":"a"},{"text":"in Eq. ","element":"span"},{"href":"#id-102","text":"(21) ","element":"a"},{"text":"yields the final upper bound of Greedy MDCURL’s dynamic regret for any interval ","element":"span"},{"style":{"height":13.2},"width":102.76,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-5.png","element":"img","alt":" I ⊆ T","inline":true,"padRight":true},{"text":"with respect to any sequence of policies ","element":"span"},{"style":{"height":16.98},"width":157.76,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-6.png","element":"img","alt":" (πt,∗)t∈I:","inline":true,"padRight":true},{"id":"id-84","style":{"fontWeight":"bold"},"text":"Theorem G.3 ","element":"span"},{"text":"(Dynamic regret of Greedy MD-CURL)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a constant defined as","element":"span"}],[{"style":{"width":"79%"},"width":1256,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":157,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-8.png","element":"img","alt":" δ ∈ (0, 1)","inline":true},{"style":{"fontStyle":"italic"},"text":". With probability at least ","element":"span"},{"style":{"height":11.6},"width":106.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-9.png","element":"img","alt":" 1 − 2δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for any interval ","element":"span"},{"style":{"height":16},"width":124.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-10.png","element":"img","alt":" I ⊆ [T]","inline":true},{"style":{"fontStyle":"italic"},"text":", for any sequence of policies ","element":"span"},{"style":{"height":16.98},"width":143.48,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-11.png","element":"img","alt":"(πt,∗)t∈I","inline":true},{"style":{"fontStyle":"italic"},"text":", for any learning rate ","element":"span"},{"style":{"height":16},"width":400.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-12.png","element":"img","alt":" λ > 0, and for αt := 1/t","inline":true},{"style":{"fontStyle":"italic"},"text":", Greedy MD-CURL obtains","element":"span"}],[{"style":{"width":"94%"},"width":1497,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95035/images/28-13.png","element":"img"}],[{"text":"Hence, Greedy MD-CURL meets the requisite dynamic regret bound from Eq. ","element":"span"},{"href":"#id-57","text":"(10) ","element":"a"},{"text":"to serve as a black-box algorithm for MetaCURL achieving optimal dynamic regret.","element":"span"}]]},{"heading":"NeurIPS Paper Checklist","paragraphs":[[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Claims","element":"span"}],[{"text":"Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We clearly provide the contributions of our work and our settings in the Introduction. We formally detail our hypothesis in Section ","element":"span"},{"text":"2. ","element":"span"},{"text":"We present the new algorithm in Section ","element":"span"},{"text":"4. ","element":"span"},{"text":"All regret results related to the new algorithm claimed in the abstract and introduction are stated in Section ","element":"span"},{"text":"5 ","element":"span"},{"text":"and proved in the Appendix.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the abstract and introduction do not include the claims made in the paper.","element":"span"}],[{"text":"• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.","element":"span"}],[{"text":"• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.","element":"span"}],[{"text":"• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Limitations","element":"span"}],[{"text":"Question: Does the paper discuss the limitations of the work performed by the authors? Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: In Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"we formally detail our assumptions and discuss the strong assumption made about probability transitions, justifying it as a means of providing lowcomplexity methods. We also discuss real-world applications that satisfy these assumptions. In Section ","element":"span"},{"text":"6, ","element":"span"},{"text":"we discuss the aspirational goal of addressing this limitation as a future work. In Remark ","element":"span"},{"href":"#id-103","text":"3.1, ","element":"a"},{"text":"we discuss the complexity of the algorithm.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.","element":"span"}],[{"text":"• The authors are encouraged to create a separate \"Limitations\" section in their paper.","element":"span"}],[{"text":"• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.","element":"span"}],[{"text":"• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.","element":"span"}],[{"text":"• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.","element":"span"}],[{"text":"• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.","element":"span"}],[{"text":"• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.","element":"span"}],[{"text":"• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory Assumptions and Proofs","element":"span"}],[{"text":"Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: All the main results are in the paper, and all proofs are carefully stated in the Appendix. All auxiliary results used to obtain the main results are also included in the Appendix. A proof sketch is provided for the main results in the main paper. All results are properly referenced.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.","element":"span"}],[{"text":"• All assumptions should be clearly stated or referenced in the statement of any theorems.","element":"span"}],[{"text":"• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.","element":"span"}],[{"text":"• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.","element":"span"}],[{"text":"• Theorems and Lemmas that the proof relies upon should be properly referenced.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Result Reproducibility","element":"span"}],[{"text":"Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This is a theoretical paper, with no experiments. Guidelines: • The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.","element":"span"}],[{"text":"• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.","element":"span"}],[{"text":"• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.","element":"span"}],[{"text":"• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.","element":"span"}],[{"text":"(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.","element":"span"}],[{"text":"(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).","element":"span"}],[{"text":"(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.","element":"span"}],[{"text":"5. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Open access to data and code","element":"span"}],[{"text":"Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This is a theoretical paper, with no experiments. Guidelines:","element":"span"}],[{"text":"• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https://nips.cc/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).","element":"span"}],[{"text":"• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https: ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"//nips.cc/public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.","element":"span"}],[{"text":"• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.","element":"span"}],[{"text":"• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).","element":"span"}],[{"text":"• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Setting/Details","element":"span"}],[{"text":"Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This is a theoretical paper, with no experiments. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.","element":"span"}],[{"text":"• The full details can be provided either with the code, in appendix, or as supplemental material.","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Statistical Significance","element":"span"}],[{"text":"Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This is a theoretical paper, with no experiments. Guidelines: • The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.","element":"span"}],[{"text":"• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).","element":"span"}],[{"text":"• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)","element":"span"}],[{"text":"• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.","element":"span"}],[{"text":"• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.","element":"span"}],[{"text":"• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).","element":"span"}],[{"text":"• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments Compute Resources","element":"span"}],[{"text":"Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This is a theoretical paper, with no experiments. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.","element":"span"}],[{"text":"• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.","element":"span"}],[{"text":"• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).","element":"span"}],[{"text":"9. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Code Of Ethics","element":"span"}],[{"text":"Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics ","element":"span"},{"href":"https://neurips.cc/public/EthicsGuidelines","style":{"fontFamily":"monospace"},"text":"https://neurips.cc/public/EthicsGuidelines","element":"a"},{"text":"?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: This work conform with the NeurIPS Code of Ethics. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.","element":"span"}],[{"text":"• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).","element":"span"}],[{"text":"10. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Broader Impacts","element":"span"}],[{"text":"Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The results presented in this paper are largely theoretical. The framework provided in this paper is very general and could be applied to any reinforcement learning or concave utility reinforcement learning problem in a tabular MDP. Therefore, as with any reinforcement learning algorithm, it is possible that the algorithms developed from the ideas presented in this paper could be applied in contexts that have negative societal impacts, or in contexts where the reward function has a negative negative societal impacts.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.","element":"span"}],[{"text":"• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.","element":"span"}],[{"text":"• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.","element":"span"}],[{"text":"• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.","element":"span"}],[{"text":"• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).","element":"span"}],[{"text":"11. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Safeguards","element":"span"}],[{"text":"Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper poses no such risks. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper poses no such risks.","element":"span"}],[{"text":"• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.","element":"span"}],[{"text":"• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.","element":"span"}],[{"text":"• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.","element":"span"}],[{"text":"12. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Licenses for existing assets","element":"span"}],[{"text":"Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper does not use existing assets. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset. • The authors should state which version of the asset is used and, if possible, include a URL.","element":"span"}],[{"text":"• The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.","element":"span"}],[{"text":"• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"paperswithcode.com/datasets ","element":"span"},{"text":"has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.","element":"span"}],[{"text":"• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.","element":"span"}],[{"text":"• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.","element":"span"}],[{"text":"13. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"New Assets","element":"span"}],[{"text":"Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper does not release new assets. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not release new assets.","element":"span"}],[{"text":"• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.","element":"span"}],[{"text":"• The paper should discuss whether and how consent was obtained from people whose asset is used.","element":"span"}],[{"text":"• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.","element":"span"}],[{"text":"14. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Crowdsourcing and Research with Human Subjects","element":"span"}],[{"text":"Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA]","element":"span"}],[{"text":"Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.","element":"span"}],[{"text":"• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.","element":"span"}],[{"text":"15. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects","element":"span"}],[{"text":"Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: This paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.","element":"span"}],[{"text":"• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.","element":"span"}],[{"text":"• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]