36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"72879","publisher":"neurips","paperJSON":{"title":"Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback","paperID":"72879","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"uncoupled","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"convergent","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"rational","element":"span"},{"text":", with non-asymptotic convergence rates to Nash equilibrium. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an ","element":"span"},{"style":{"height":19.66},"width":128.76,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-0.png","element":"img","alt":" O(t− 18 )","inline":true,"padRight":true},{"text":"last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of ","element":"span"},{"style":{"height":20.77},"width":162.64,"height":51.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-1.png","element":"img","alt":" O(t− 19+ε )","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":11.6},"width":98.68,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-2.png","element":"img","alt":" ε > 0","inline":true},{"text":". Finally, we study Markov games without any assumptions on the dynamics, and show a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"path convergence ","element":"span"},{"text":"rate, a new notion of convergence we define, of ","element":"span"},{"style":{"height":19.66},"width":142.32,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-3.png","element":"img","alt":" O(t− 110 )","inline":true},{"text":". Our algorithm removes the coordination and prior knowledge requirement of [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":"] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"In multi-agent learning, a central question is how to design algorithms so that agents can ","element":"span"},{"style":{"fontStyle":"italic"},"text":"independently ","element":"span"},{"text":"learn (i.e., with little coordination overhead) how to interact with each other. Additionally, it is desirable to maximally reuse existing single-agent learning algorithms, so that the multi-agent system can be built in a modular way. Motivated by this question, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"decentralized ","element":"span"},{"text":"multi-agent learning emerges with the goal to design decentralized systems, in which no central controller governs the policies of the agents, and each agent learns based on only their local information – just like in a single-agent algorithm. In recent years, we have witnessed significant success of this new decentralized learning paradigm. For example, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"self-play","element":"span"},{"text":", where each agent independently deploys the same single-agent algorithm to play against each other without further direct supervision, plays a crucial role in the training of AlphaGo [","element":"span"},{"href":"#id-3","referenceIndex":50,"text":"SSS","element":"a"},{"style":{"height":13.01},"width":63,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-4.png","element":"img","alt":"+17","inline":true},{"text":"] and AI for Stratego [","element":"span"},{"href":"#id-4","referenceIndex":43,"text":"PDVH","element":"a"},{"style":{"height":15.39},"width":89.6,"height":38.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/0-5.png","element":"img","alt":"+22].","inline":true}],[{"text":"Despite the recent success, many important questions remain open in decentralized multi-agent learning. Indeed, unless the decentralized algorithm is carefully designed, self-play often falls short of attaining certain sought-after global characteristics, such as convergence to the global optimum or stability as seen in, for example, [","element":"span"},{"href":"#id-5","referenceIndex":39,"text":"MPP18","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"BP18","element":"a"},{"text":"].","element":"span"}],[{"text":"In this work, we revisit the problem of learning in two-player zero-sum Markov games, which has received extensive attention recently. Our goal is to design a decentralized algorithm that resembles ","element":"span"},{"text":"standard single-agent reinforcement learning (RL) algorithms, but with an additional crucial assurance, that is, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"guaranteed convergence ","element":"span"},{"text":"when both players deploy the algorithm. The simultaneous pursuit of independence and convergence has been advocated widely [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"BV01","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":2,"text":"AY16","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":51,"text":"SZL","element":"a"},{"style":{"height":12.8},"width":60.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-0.png","element":"img","alt":"+21","inline":true},{"text":"], while the results are still not entirely satisfactory. In particular, all of these results rely on assumptions on the dynamics of the Markov game. Our paper takes the first step to remove such assumptions.","element":"span"}],[{"text":"More specifically, our goal is to design algorithms that simultaneously satisfy the following three properties (the definitions are adapted from [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"BV01","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":17,"text":"DDK11","element":"a"},{"text":"]):","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Uncoupled","element":"span"},{"text":": Each player ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"’s action is generated by a standalone procedure ","element":"span"},{"style":{"height":13.6},"width":37,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-1.png","element":"img","alt":" Pi","inline":true,"padRight":true},{"text":"which, in every round, only receives the current state and player ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"’s own reward as feedback (in particular, it has no knowledge about the actions or policies used by the opponent). There is no communication or shared randomness between the players.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergent","element":"span"},{"text":": The policy pair of the two players converges to a Nash equilibrium.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Rational","element":"span"},{"text":": If ","element":"span"},{"style":{"height":13.39},"width":37,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-2.png","element":"img","alt":" Pi","inline":true,"padRight":true},{"text":"competes with an opponent who uses a policy sequence that converges to a stationary one, then ","element":"span"},{"style":{"height":13.6},"width":37,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-3.png","element":"img","alt":" Pi","inline":true,"padRight":true},{"text":"converges to the best response of this stationary policy.","element":"span"}],[{"text":"The uncoupledness and rationality property capture the independence of the algorithm, while the convergence property provides a desirable global guarantee. Interestingly, as argued in [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"], if an algorithm is uncoupled and convergent, then it is also rational, so we only need to ensure that the algorithm is uncoupled and convergent. Regarding the notion of convergence, the standard definition above only allows ","element":"span"},{"style":{"fontStyle":"italic"},"text":"last-iterate ","element":"span"},{"text":"convergence. Considering the difficulty of achieving such convergence, in the related work review (","element":"span"},{"text":"Section 2","element":"span"},{"text":") and in the design of our algorithm for general Markov games (","element":"span"},{"text":"Section 6","element":"span"},{"text":"), we also consider weaker notions of convergence, including the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"best-iterate ","element":"span"},{"text":"convergence, which only requires that the Cesaro mean of the duality gap is convergent, and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"path ","element":"span"},{"text":"convergence, which only requires the convergence of the Cesaro mean of the duality gap ","element":"span"},{"style":{"fontStyle":"italic"},"text":"assuming minimax/maximin policies are followed in future steps","element":"span"},{"text":". The precise definitions of these convergence notions are given at the end of ","element":"span"},{"text":"Section 3","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Our Contributions","element":"span"}],[{"text":"The main results in this work are as follows (see also ","element":"span"},{"href":"#id-11","text":"Table 1 ","element":"a"},{"text":"for comparisons with prior works):","element":"span"}],[{"text":"• As a warm-up, for the special case of matrix games with bandit feedback, we develop an uncoupled algorithm with a last-iterate convergence rate of ","element":"span"},{"style":{"height":19.39},"width":122.48,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-4.png","element":"img","alt":" O(t− 18 )","inline":true,"padRight":true},{"text":"under self-play (","element":"span"},{"text":"Section 4","element":"span"},{"text":"). To the best of our knowledge, this is the first algorithm with provable last-iterate convergence rate in the setting.","element":"span"}],[{"text":"• Generalizing the ideas from matrix games, we further develop an uncoupled algorithm for irreducible Markov games with a last-iterate convergence rate of ","element":"span"},{"style":{"height":20.4},"width":156.52,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-5.png","element":"img","alt":" O(t− 19+ε )","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":11.6},"width":103,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-6.png","element":"img","alt":" ε > 0","inline":true,"padRight":true},{"text":"under self-play (","element":"span"},{"text":"Section 5","element":"span"},{"text":").","element":"span"}],[{"text":"• Finally, for general Markov games without additional assumptions, we develop an uncoupled algorithm with a path convergence rate of ","element":"span"},{"style":{"height":19.41},"width":136.48,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/1-7.png","element":"img","alt":" O(t− 110 )","inline":true,"padRight":true},{"text":"under self-play (","element":"span"},{"text":"Section 6","element":"span"},{"text":").","element":"span"}],[{"text":"Our algorithms leverage recent advances on using entropy to regularize the policy updates [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":"] and the Nash-V-styled value updates [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"]. On the one hand, compared to [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":"], our algorithm has the following advantages: 1) it does not require the two players to exchange their entropy information, which allows our algorithm to be fully uncoupled; 2) it does not require the players to have coordinated policy updates, 3) it naturally extends to general Markov games without any assumptions on the dynamics (e.g., irreducibility). On the other hand, our algorithm inherits appealing properties of Nash-V [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"], but additionally guarantees path convergence during execution.","element":"span"}]]},{"heading":"2 Related Work","paragraphs":[[{"text":"The study of two-player zero-sum Markov games originated from [","element":"span"},{"href":"#id-13","referenceIndex":45,"text":"Sha53","element":"a"},{"text":"], with many other works further developing algorithms and establishing convergence properties [","element":"span"},{"href":"#id-14","referenceIndex":27,"text":"HK66","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":42,"text":"PAI69","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":55,"text":"VDW78","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":21,"text":"FT91","element":"a"},{"text":"]. However, these works primarily focused on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"solving ","element":"span"},{"text":"the game with full knowledge of its parameters (i.e., payoff function and transition kernel). The problem of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"learning ","element":"span"},{"text":"in zero-sum games was first formalized by [","element":"span"},{"href":"#id-18","referenceIndex":34,"text":"Lit94","element":"a"},{"text":"]. Designing a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"provably ","element":"span"},{"text":"uncoupled, rational, and convergent algorithm","element":"span"}],[{"id":"id-11","text":"Table 1: (Sample-based) Learning algorithms for finding NE in two-player zero-sum games. Our ","element":"figcaption","subtype":"caption"},{"text":"results are shaded. A halfcheck “","element":"figcaption","subtype":"caption"},{"style":{"height":11.79},"width":27,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/2-0.png","element":"img","alt":"✓","inline":true},{"text":"” in the convergent column means that the policy convergence is proven only for one player (typically this is a result of asymmetric updates). (L) and (B) stand for last-iterate convergence and best-iterate convergence, respectively. (P) stands for path convergence, a weaker convergence notion we introduce (see ","element":"figcaption","subtype":"caption"},{"text":"Section 3","element":"span","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"href":"#id-19","text":"6.1","element":"a","subtype":"caption"},{"text":").","element":"figcaption","subtype":"caption"}],[{"text":"*: While [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"] also proposes an uncoupled and convergent algorithm for irreducible Markov games, their algorithm requires coordinated updates and some prior knowledge of the game, while ours does not. See ","element":"span"},{"href":"#id-20","text":"Section 2.1 ","element":"a"},{"text":"for a more detailed discussion.","element":"span"}],[{"style":{"width":"93%"},"width":1490,"height":672,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/2-1.png","element":"img"}],[{"text":"is challenging, with many attempts [","element":"span"},{"href":"#id-21","referenceIndex":46,"text":"SL99","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"BV01","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":29,"text":"HW03","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":13,"text":"CS07","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":2,"text":"AY16","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":48,"text":"SPO22","element":"a"},{"text":"] falling short in one aspect or another, often lacking either uncoupledness or convergence. Moreover, these works only establish asymptotic convergence without providing a concrete convergence rate.","element":"span"}],[{"id":"id-20","style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Non-asymptotic convergence guarantees","element":"span"}],[{"text":"Recently, a large body of works on learning two-player zero-sum Markov games use regret minimization techniques to establish ","element":"span"},{"style":{"fontStyle":"italic"},"text":"non-asymptotic ","element":"span"},{"text":"guarantees. They focus on fast computation under full information of payoff and transitions [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":9,"text":"CCDX23","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":64,"text":"ZLW","element":"a"},{"style":{"height":12.8},"width":64,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/2-2.png","element":"img","alt":"+22","inline":true},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":47,"text":"SLY23","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":63,"text":"YM23","element":"a"},{"text":"], though many of their algorithms are decentralized and can be viewed as the first step towards the learning setting.","element":"span"}],[{"text":"With rationality and uncoupledness satisfied, [","element":"span"},{"href":"#id-29","referenceIndex":18,"text":"DFG20","element":"a"},{"text":"] established one-sided policy convergence for players using independent policy gradient with asymmetric learning rates. Such an asymmetric update rule is also adopted by [","element":"span"},{"href":"#id-30","referenceIndex":65,"text":"ZTLD22","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":1,"text":"AVHC22","element":"a"},{"text":"] to establish one-sided policy convergence guarantees. When using a symmetric update rule, [","element":"span"},{"href":"#id-9","referenceIndex":51,"text":"SZL","element":"a"},{"style":{"height":12.8},"width":61,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/2-3.png","element":"img","alt":"+21","inline":true},{"text":"] developed a decentralized-Q learning algorithm. However, the convergence is only shown for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"-function maintained by the players instead of the policies being used, so the policies may still cycle and are not provably convergent in our definition. [","element":"span"},{"href":"#id-32","referenceIndex":20,"text":"ELS","element":"a"},{"style":{"height":13.2},"width":62.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/2-4.png","element":"img","alt":"+23","inline":true},{"text":"] studied regret minimization in general-sum Markov games and provided an algorithm with sublinear regret under self-play and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"average-iterate ","element":"span"},{"text":"convergence rates to equibria, while our work focuses on last-iterate convergence rates to Nash equilibria.","element":"span"}],[{"text":"To our knowledge, [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"] first provided an uncoupled, rational, and convergent algorithm with non-asymptotic convergence guarantee, albeit only for irreducible Markov game. They achieved this via ","element":"span"},{"style":{"fontStyle":"italic"},"text":"optimistic gradient descent/ascent","element":"span"},{"text":". Despite satisfying all our criteria, their algorithm still has unnatural coordination between the players and a requirement on some prior knowledge of the game such as the maximum revisiting time of the Markov game. Our algorithm removes all these extra requirements. A follow-up work by [","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":"] improved the rate of [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"] using entropy regularization; however, this requires their players to inform the opponent about the entropy of their own policy, making the algorithm coupled again. We show that such an exchange of information is unnecessary under entropy regularization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Further handling exploration","element":"span"}],[{"text":"The algorithms introduced above all require full information or some assumption on the dynamics of the Markov game. To handle exploration, some works design coupled learning algorithms ","element":"span"},{"text":"which guarantee that the player’s long-term payoff is at least the minimax value [","element":"span"},{"href":"#id-33","referenceIndex":7,"text":"BT02","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":58,"text":"WHL17","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":61,"text":"XCWY20","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":28,"text":"HLWY22","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":32,"text":"JLY22","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":30,"text":"JJJN21","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":62,"text":"XZS","element":"a"},{"style":{"height":12.99},"width":66.4,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-0.png","element":"img","alt":"+22","inline":true},{"text":"]. Interestingly, as shown in [","element":"span"},{"href":"#id-34","referenceIndex":58,"text":"WHL17","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":28,"text":"HLWY22","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":32,"text":"JLY22","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":62,"text":"XZS","element":"a"},{"style":{"height":12.99},"width":66.36,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-1.png","element":"img","alt":"+22","inline":true},{"text":"], if the player is paired with an optimistic best-response opponent (instead of using the same algorithm), the first player’s strategy can converge to the minimax policy. [","element":"span"},{"href":"#id-35","referenceIndex":61,"text":"XCWY20","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":3,"text":"BJ20","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":36,"text":"LYBJ21","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":16,"text":"CZG22","element":"a"},{"text":"] developed another coupled learning framework to handle exploration, but with symmetric updates on both players. In each round, the players need to jointly solve a general-sum equilibrium problem due to the different exploration bonus added by each player. Hence, the execution of these algorithms is more similar to the Nash-Q algorithm by [","element":"span"},{"href":"#id-22","referenceIndex":29,"text":"HW03","element":"a"},{"text":"].","element":"span"}],[{"text":"So far, exploration has been handled through coupled approaches that are also not rational. To our knowledge, the first uncoupled and rational algorithm that handles exploration is the Nash-V algorithm by [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"]. Nash-V can output a nearly-minimax policy through weighted averaging [","element":"span"},{"href":"#id-43","referenceIndex":31,"text":"JLWY21","element":"a"},{"text":"]; however, it is not provably convergent during execution. A major remaining open problem is whether one can design a natural algorithm that is provably rational, uncoupled, and convergent with exploration capability. Our work provides the first progress towards this goal.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Other works on last-iterate convergence","element":"span"}],[{"text":"Uncoupled Learning dynamics in normal-form games with provable last-iterate convergence rate receives extensive attention recently. ","element":"span"},{"text":"Most of the works assume that the players receive gradient feedback, and convergence results under bandit feedback remain sparse. ","element":"span"},{"text":"Linear convergence is shown for strongly monotone games or bilinear games under gradient feedback [","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"Tse95","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":35,"text":"LS19","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":38,"text":"MOP20","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":60,"text":"WLZL21b","element":"a"},{"text":"] and sublinear rates are proven for strongly monotone games with bandit feedback [","element":"span"},{"href":"#id-48","referenceIndex":5,"text":"BLM18","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":26,"text":"HIMM19","element":"a"},{"text":", ","element":"span"},{"href":"#id-50","referenceIndex":37,"text":"LZBZ21","element":"a"},{"text":", ","element":"span"},{"href":"#id-51","referenceIndex":52,"text":"TK22","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":19,"text":"DFR22","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":25,"text":"HH23","element":"a"},{"text":"]. Convergence rate to strict Nash equilibrium is analyzed by [","element":"span"},{"href":"#id-54","referenceIndex":24,"text":"GVGM21","element":"a"},{"text":"]. For monotone games that includes two-player zero-sum games as a special case, the last-iterate convergence rate of no-regret learning under gradient feedback has been shown recently [","element":"span"},{"href":"#id-55","referenceIndex":22,"text":"GPD20","element":"a"},{"text":", ","element":"span"},{"href":"#id-56","referenceIndex":12,"text":"COZ22","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":23,"text":"GTG22","element":"a"},{"text":", ","element":"span"},{"href":"#id-58","referenceIndex":15,"text":"CZ23","element":"a"},{"text":"]. With bandit feedback, [","element":"span"},{"href":"#id-59","referenceIndex":40,"text":"MPS20","element":"a"},{"text":"] showed an impossibility result that certain algorithms with optimal ","element":"span"},{"style":{"height":18.4},"width":120.48,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-2.png","element":"img","alt":" O(√T)","inline":true,"padRight":true},{"text":"regret do not converge in last-iterate. To the best of our knowledge, there is no natural uncoupled learning dynamics with provable last-iterate convergence rate in two-player zero-sum games with bandit feedback.","element":"span"}]]},{"heading":"3 Preliminaries","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Basic Notations ","element":"span"},{"text":"Throughout the paper, we assume for simplicity that the action set for the two players are the same, denoted by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"with cardinality ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|A|","element":"span"},{"text":".","element":"span"},{"text":"1 ","element":"span"},{"text":"We usually call player 1 the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"-player and player ","element":"span"},{"text":"2 ","element":"span"},{"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player. The set of mixed strategies over an action set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is denoted as ","element":"span"},{"style":{"height":17.58},"width":906.88,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-3.png","element":"img","alt":" ∆A := {x : �a∈A xa = 1; 0 ≤ xa ≤ 1, ∀a ∈ A}","inline":true},{"text":". To simplify notation, we denote by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"the concatenated strategy of the players. We use ","element":"span"},{"style":{"height":14.4},"width":21.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-4.png","element":"img","alt":" ϕ","inline":true,"padRight":true},{"text":"as the entropy function such that ","element":"span"},{"style":{"height":17.58},"width":554.32,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-5.png","element":"img","alt":"ϕ(x) = − �a∈A xa ln xa, and KL","inline":true,"padRight":true},{"text":"as the Kullback–Leibler (KL) divergence such that KL","element":"span"},{"style":{"height":16},"width":147.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-6.png","element":"img","alt":"(x, x′) =","inline":true},{"style":{"height":20.64},"width":241.44,"height":51.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-7.png","element":"img","alt":"�a∈A xa ln xax′a ","inline":true,"padRight":true},{"text":". The all-one vector is denoted by ","element":"span"},{"style":{"height":16},"width":299.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-8.png","element":"img","alt":" 1 = (1, 1, · · · , 1) .","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Matrix Games ","element":"span"},{"text":"In a two-player zero-sum matrix game with a loss matrix ","element":"span"},{"style":{"height":17.79},"width":236.72,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-9.png","element":"img","alt":" G ∈ [0, 1]A×A","inline":true},{"text":", when the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"-player chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"-player suffers loss ","element":"span"},{"style":{"height":16},"width":70,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-10.png","element":"img","alt":" Ga,b","inline":true,"padRight":true},{"text":"and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player suffers loss ","element":"span"},{"style":{"height":13.81},"width":99.52,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-11.png","element":"img","alt":" −Ga.b","inline":true},{"text":". A pair of mixed strategy ","element":"span"},{"style":{"height":16},"width":193.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-12.png","element":"img","alt":" (x⋆, y⋆) is a","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Nash equilibrium ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"if for any strategy profile ","element":"span"},{"style":{"height":16},"width":310.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-13.png","element":"img","alt":" (x, y) ∈ ∆A × ∆A","inline":true},{"text":", it holds that ","element":"span"},{"style":{"height":17.6},"width":541,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-14.png","element":"img","alt":" (x⋆)⊤Gy ≤ (x⋆)⊤Gy⋆ ≤ x⊤Gy⋆","inline":true},{"text":". Similarly, ","element":"span"},{"style":{"height":16},"width":122,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-15.png","element":"img","alt":"(x⋆, y⋆)","inline":true,"padRight":true},{"text":"is a Nash equilibrium for a two-player zero-sum game with a general convex-concave loss function ","element":"span"},{"style":{"height":16},"width":414.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-16.png","element":"img","alt":" f(x, y) : ∆A × ∆A → R","inline":true,"padRight":true},{"text":"if for all ","element":"span"},{"style":{"height":16},"width":866.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-17.png","element":"img","alt":" (x, y) ∈ ∆A × ∆A, f(x⋆, y) ≤ f(x⋆, y⋆) ≤ f(x, y⋆)","inline":true},{"text":". The celebrated minimax theorem [","element":"span"},{"href":"#id-60","referenceIndex":56,"text":"vN28","element":"a"},{"text":"] guarantees the existence of Nash equilibria in two-player zero-sum games. For a pair of strategy ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":")","element":"span"},{"text":", we use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"duality gap ","element":"span"},{"text":"defined as G","element":"span"},{"text":"AP","element":"span"},{"style":{"height":17.81},"width":187,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-18.png","element":"img","alt":"(G, x, y) ≜","inline":true},{"style":{"height":18.4},"width":485.48,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-19.png","element":"img","alt":"maxy′ x⊤Gy′ − minx′ x′⊤Gy","inline":true,"padRight":true},{"text":"to measure its proximity to Nash equilibria.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Markov Games ","element":"span"},{"text":"A generalization of matrix games, which models dynamically changing environment, is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov games","element":"span"},{"text":". We consider infinite-horizon discounted two-player zero-sum Markov games, denoted by a tuple ","element":"span"},{"style":{"height":16},"width":455,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-20.png","element":"img","alt":" (S, A, (Gs)s∈S, (P s)s∈S, γ)","inline":true,"padRight":true},{"text":"where (1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is a finite state space; (2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a finite action space for both players; (3) Player 1 suffers loss ","element":"span"},{"style":{"height":18.59},"width":200.52,"height":46.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/3-21.png","element":"img","alt":" Gsa,b ∈ [0, 1]","inline":true,"padRight":true},{"text":"(respectively player 2 suffers","element":"span"}],[{"text":"loss ","element":"span"},{"style":{"height":18},"width":99.48,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-0.png","element":"img","alt":" −Gsa,b","inline":true},{"text":") when player 1 chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and player 2 chooses action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"at state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":"; (4) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is the ","element":"span"},{"text":"transition function such that ","element":"span"},{"style":{"height":18.93},"width":129.68,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-1.png","element":"img","alt":" P sa,b(s′)","inline":true,"padRight":true},{"text":"is the probability of transiting to state ","element":"span"},{"style":{"height":12.19},"width":25.48,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-2.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"when player 1 plays ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and player 2 plays ","element":"span"},{"style":{"height":19.2},"width":401,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-3.png","element":"img","alt":" b at state s; (5) γ ∈ [ 12, 1)","inline":true,"padRight":true},{"text":"is a discount factor.","element":"span"}],[{"text":"A stationary policy for player 1 is a mapping ","element":"span"},{"style":{"height":14.8},"width":150.16,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-4.png","element":"img","alt":" S → ∆A","inline":true,"padRight":true},{"text":"that specifies player 1’s strategy ","element":"span"},{"style":{"height":14.8},"width":149.6,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-5.png","element":"img","alt":" xs ∈ ∆A","inline":true,"padRight":true},{"text":"at each state ","element":"span"},{"style":{"height":12.21},"width":104,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-6.png","element":"img","alt":" s ∈ S","inline":true},{"text":". We denote ","element":"span"},{"style":{"height":16},"width":216.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-7.png","element":"img","alt":" x = (xs)s∈S","inline":true},{"text":". Similar notations apply to player 2. We denote ","element":"span"},{"style":{"height":16},"width":218.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-8.png","element":"img","alt":"zs = (xs, ys)","inline":true,"padRight":true},{"text":"as the concatenated strategy for the players and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":")","element":"span"},{"text":". The value function ","element":"span"},{"style":{"height":17.39},"width":65,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-9.png","element":"img","alt":" V sx,y","inline":true,"padRight":true},{"text":"denotes the expected loss of player 1 (or the expected payoff of player 2) given a pair of stationary policy ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"and initial state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"81%"},"width":1290,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-10.png","element":"img"}],[{"text":"The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"minimax game value ","element":"span"},{"text":"on state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is defined as ","element":"span"},{"style":{"height":17.39},"width":702,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-11.png","element":"img","alt":" V s⋆ = minx maxy V sx,y = maxy minx V sx,y.","inline":true,"padRight":true},{"text":"We ","element":"span"},{"text":"call a pair of policy ","element":"span"},{"style":{"height":16},"width":120.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-12.png","element":"img","alt":" (x⋆, y⋆)","inline":true,"padRight":true},{"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nash equilibrium ","element":"span"},{"text":"if it attains minimax game value of a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"(such policy pair necessarily attains the minimax game value over all states). The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"duality gap ","element":"span"},{"text":"of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":18.8},"width":549,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-13.png","element":"img","alt":" maxs (maxy′ V sx,y′ − minx′ V sx′,y)","inline":true},{"text":". The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function on state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"under policy pair ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"is defined ","element":"span"},{"text":"via ","element":"span"},{"style":{"height":20.35},"width":674.96,"height":50.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-14.png","element":"img","alt":" Qsx,y(a, b) = Gsa,b + γ · Es′∼P sa,b(·)[V s′x,y]","inline":true},{"text":", which can be rewritten as a matrix ","element":"span"},{"style":{"height":17.14},"width":75.04,"height":42.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-15.png","element":"img","alt":" Qsx,y","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.79},"width":273.52,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-16.png","element":"img","alt":"V sx,y = xsQsx,yys","inline":true},{"text":". We denote ","element":"span"},{"style":{"height":17.81},"width":203,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-17.png","element":"img","alt":" Qs⋆ = Qsx⋆,y⋆","inline":true,"padRight":true},{"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function under a Nash equilibrium ","element":"span"},{"style":{"height":16},"width":120,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-18.png","element":"img","alt":" (x⋆, y⋆)","inline":true},{"text":". It is ","element":"span"},{"text":"known that ","element":"span"},{"style":{"height":15.2},"width":45,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-19.png","element":"img","alt":" Qs⋆ ","inline":true,"padRight":true},{"text":"is unique for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"even when multiple equilibria exist.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Uncoupled Learning with Bandit Feedback ","element":"span"},{"text":"We assume the following uncoupled interaction protocol: at each round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , T","element":"span"},{"text":", the players both observe the current state ","element":"span"},{"style":{"height":9.79},"width":27.52,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-20.png","element":"img","alt":" st","inline":true},{"text":", and then, with the policy ","element":"span"},{"style":{"height":14.19},"width":139.48,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-21.png","element":"img","alt":" xt and yt","inline":true,"padRight":true},{"text":"in mind, they independently choose actions ","element":"span"},{"style":{"height":16.19},"width":342.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-22.png","element":"img","alt":" at ∼ xstt and bt ∼ ystt ","inline":true,"padRight":true},{"text":", respectively. Both ","element":"span"},{"text":"of them then observe ","element":"span"},{"style":{"height":19.01},"width":488,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-23.png","element":"img","alt":" σt ∈ [0, 1] with E[σt] = Gstat,bt","inline":true},{"text":", and proceed to the next state ","element":"span"},{"style":{"height":19.01},"width":268.48,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-24.png","element":"img","alt":" st+1 ∼ P stat,bt(·).","inline":true,"padRight":true},{"text":"Importantly, they do not observe each other’s action.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Notions of Convergence ","element":"span"},{"text":"For Markov games with the irreducible assumption (","element":"span"},{"href":"#id-61","text":"Assumption 1","element":"a"},{"text":"), given players’ history of play ","element":"span"},{"style":{"height":17.6},"width":237.04,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-25.png","element":"img","alt":" (st, xt, yt)t∈[T ]","inline":true},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"best-iterate ","element":"span"},{"text":"convergence rate is measured by the average duality gap ","element":"span"},{"style":{"height":21.97},"width":535.12,"height":54.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-26.png","element":"img","alt":"1T�Tt=1 maxs,x,y (V sxt,y − V sx,yt)","inline":true},{"text":", while the stronger ","element":"span"},{"style":{"fontStyle":"italic"},"text":"last-iterate ","element":"span"},{"text":"convergence ","element":"span"},{"text":"rate is measured by ","element":"span"},{"style":{"height":18.4},"width":410,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-27.png","element":"img","alt":" maxs,x,y (V sxT ,y − V sx,yT )","inline":true},{"text":", i.e., the duality gap of ","element":"span"},{"style":{"height":16},"width":134,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-28.png","element":"img","alt":" (xT , yT )","inline":true},{"text":". For general Markov ","element":"span"},{"text":"games, we propose the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"path ","element":"span"},{"text":"convergence rate, which is measured by the average duality gap at the visited states with respect to the optimal ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"-function:","element":"span"},{"style":{"height":24.61},"width":694,"height":61.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-29.png","element":"img","alt":"1T�Tt=1 maxx,y (xs⊤tt Qst⋆ yst − xs⊤t Qst⋆ ystt )","inline":true},{"text":". ","element":"span"},{"text":"We remark that the path convergence guarantee is weaker than the counterpart of the other two notions of convergence in general Markov games, but still provides meaningful implications (see detailed discussion in ","element":"span"},{"href":"#id-19","text":"Section 6.1 ","element":"a"},{"text":"and ","element":"span"},{"text":"Appendix F","element":"span"},{"text":").","element":"span"}]]},{"heading":"4 Matrix Games","paragraphs":[[{"text":"In this section, we consider two-player zero-sum matrix games. We propose ","element":"span"},{"href":"#id-62","text":"Algorithm 1 ","element":"a"},{"text":"for decentralized learning of Nash equilibria. We only present the algorithm for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"-player as the algorithm for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player is symmetric.","element":"span"}],[{"id":"id-62","style":{"width":"100%"},"width":1592,"height":498,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/4-30.png","element":"img"}],[{"text":"The algorithm is similar to the E","element":"span"},{"text":"XP","element":"span"},{"text":"3-IX algorithm by [","element":"span"},{"href":"#id-63","referenceIndex":41,"text":"Neu15","element":"a"},{"text":"] that achieves a high-probability regret bound for adversarial multi-armed bandits, but with several modifications. First (and most importantly), in addition to the standard loss estimators used in [","element":"span"},{"href":"#id-63","referenceIndex":41,"text":"Neu15","element":"a"},{"text":"], we add another negative term ","element":"span"},{"style":{"height":15.6},"width":138,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-0.png","element":"img","alt":" ϵt ln xt,a","inline":true,"padRight":true},{"text":"to the loss estimator of action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"(see Line ","element":"span"},{"href":"#id-62","text":"5","element":"a"},{"text":"). This is equivalent to the entropy regularization approach in, e.g., [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":"], since the gradient of the negative entropy ","element":"span"},{"style":{"height":16},"width":116,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-1.png","element":"img","alt":"−ϕ(xt)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":16.4},"width":265,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-2.png","element":"img","alt":" (ln xt,a + 1)a∈A","inline":true,"padRight":true},{"text":"and the constant ","element":"span"},{"text":"1 ","element":"span"},{"text":"takes no effect in Line ","element":"span"},{"href":"#id-62","text":"6","element":"a"},{"text":". Like [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":"], the entropy regularization drives last-iterate convergence; however, while their results require full-information feedback, our result holds in the bandit feedback setting. The second difference is that instead of choosing the players’ strategies in the full probability simplex ","element":"span"},{"style":{"height":14.78},"width":59.2,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-3.png","element":"img","alt":" ∆A","inline":true},{"text":", our algorithm chooses from ","element":"span"},{"style":{"height":14},"width":38,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-4.png","element":"img","alt":" Ωt","inline":true},{"text":", a subset of ","element":"span"},{"style":{"height":14.8},"width":59.2,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-5.png","element":"img","alt":" ∆A","inline":true,"padRight":true},{"text":"where every coordinate is lower bounded by","element":"span"},{"style":{"height":19.39},"width":52,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-6.png","element":"img","alt":"1At2","inline":true,"padRight":true},{"text":". The third is the choices of ","element":"span"},{"text":"the learning rate ","element":"span"},{"style":{"height":10.8},"width":29.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-7.png","element":"img","alt":" ηt","inline":true},{"text":", clipping factor ","element":"span"},{"style":{"height":14.59},"width":32.48,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-8.png","element":"img","alt":" βt","inline":true},{"text":", and the amount of regularization ","element":"span"},{"style":{"height":9.6},"width":25.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-9.png","element":"img","alt":" ϵt","inline":true},{"text":". The main result of this section is the following last-iterate convergence rate of ","element":"span"},{"href":"#id-62","text":"Algorithm 1","element":"a"},{"text":".","element":"span"}],[{"id":"id-104","style":{"fontWeight":"bold"},"text":"Theorem 1 ","element":"span"},{"text":"(Last-Iterate Convergence Rate)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"href":"#id-62","style":{"fontStyle":"italic"},"text":"Algorithm 1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"guarantees with probability at least ","element":"span"},{"style":{"height":16},"width":385.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-10.png","element":"img","alt":"1 − O(δ), for any t ≥ 1,","inline":true}],[{"style":{"width":"57%"},"width":916,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-11.png","element":"img"}],[{"href":"#id-62","text":"Algorithm 1 ","element":"a"},{"text":"also guarantees ","element":"span"},{"style":{"height":19.41},"width":122.52,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-12.png","element":"img","alt":" O(t− 18 )","inline":true,"padRight":true},{"text":"regret even when the other player is adversarial. If we only target at an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"expected ","element":"span"},{"text":"bound instead of a high-probability bound, the last-iterate convergence rate can be improved to ","element":"span"},{"style":{"height":19.41},"width":351,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-13.png","element":"img","alt":" O(√A ln3/2(At)t− 16 )","inline":true},{"text":". The details are provided in ","element":"span"},{"text":"Appendix C","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Analysis Overview","element":"span"}],[{"text":"We define a regularized zero-sum game with loss function ","element":"span"},{"style":{"height":16},"width":612.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-14.png","element":"img","alt":" ft(x, y) = x⊤Gy − ϵtϕ(x) + ϵtϕ(y)","inline":true,"padRight":true},{"text":"over domain ","element":"span"},{"style":{"height":13.81},"width":129.48,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-15.png","element":"img","alt":" Ωt × Ωt","inline":true},{"text":", and denote by ","element":"span"},{"style":{"height":16},"width":215.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-16.png","element":"img","alt":" z⋆t = (x⋆t , y⋆t )","inline":true,"padRight":true},{"text":"its unique Nash equilibrium since ","element":"span"},{"style":{"height":14.8},"width":28.48,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-17.png","element":"img","alt":" ft ","inline":true,"padRight":true},{"text":"is strongly ","element":"span"},{"text":"convex-strongly concave. The regularized game is a slight perturbation of the original matrix game ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"over a smaller domain ","element":"span"},{"style":{"height":13.81},"width":130,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-18.png","element":"img","alt":" Ωt × Ωt","inline":true},{"text":", and we prove that ","element":"span"},{"style":{"height":15.79},"width":222,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-19.png","element":"img","alt":" z⋆t is an O(ϵt)","inline":true},{"text":"-approximate Nash equilibrium of ","element":"span"},{"text":"the original matrix game ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"(","element":"span"},{"href":"#id-64","text":"Lemma 9","element":"a"},{"text":"). Therefore, it suffices to bound KL","element":"span"},{"style":{"height":16.19},"width":112.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-20.png","element":"img","alt":"(z⋆t , zt)","inline":true,"padRight":true},{"text":"since the duality ","element":"span"},{"text":"gap of ","element":"span"},{"style":{"height":9.79},"width":28,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-21.png","element":"img","alt":" zt","inline":true,"padRight":true},{"text":"is at most ","element":"span"},{"style":{"height":19.39},"width":361.52,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-22.png","element":"img","alt":" O(�KL(z⋆t , zt) + ϵt).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 1: Single-Step Analysis ","element":"span"},{"text":"We start with a single-step analysis of ","element":"span"},{"href":"#id-62","text":"Algorithm 1","element":"a"},{"text":", which shows:","element":"span"}],[{"style":{"width":"88%"},"width":1408,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-23.png","element":"img"}],[{"text":"where we define ","element":"span"},{"style":{"height":17.55},"width":596.8,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-24.png","element":"img","alt":" vt = KL(z⋆t+1, zt+1) − KL(z⋆t , zt+1)","inline":true,"padRight":true},{"text":"(see ","element":"span"},{"text":"Appendix B ","element":"span"},{"text":"for definitions of ","element":"span"},{"style":{"height":14.4},"width":130.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-25.png","element":"img","alt":" λt, ξt, ζt","inline":true},{"text":") ","element":"span"},{"text":"The instability penalty comes from some local-norm of the gradient estimator ","element":"span"},{"style":{"height":10.4},"width":29.52,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-26.png","element":"img","alt":" gt","inline":true},{"text":". The estimation error comes from the bias between the gradient estimator ","element":"span"},{"style":{"height":10.4},"width":29.52,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-27.png","element":"img","alt":" gt","inline":true,"padRight":true},{"text":"and the real gradient ","element":"span"},{"style":{"height":14.8},"width":60,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-28.png","element":"img","alt":" Gyt","inline":true},{"text":". We pay the last term ","element":"span"},{"style":{"height":9.6},"width":29,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-29.png","element":"img","alt":" vt","inline":true,"padRight":true},{"text":"since the Nash equilibrium ","element":"span"},{"style":{"height":15.01},"width":33,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-30.png","element":"img","alt":" z∗t ","inline":true,"padRight":true},{"text":"of the regularized game ","element":"span"},{"style":{"height":14.8},"width":28.52,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-31.png","element":"img","alt":" ft ","inline":true,"padRight":true},{"text":"is changing over time.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 2: Strategy Convergence to NE of the Regularized Game","element":"span"},{"text":"Expanding the above recursion up to ","element":"span"},{"style":{"height":13.6},"width":157,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-32.png","element":"img","alt":" t0, we get","inline":true}],[{"id":"id-69","style":{"width":"97%"},"width":1546,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-33.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":22.21},"width":391,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-34.png","element":"img","alt":" wit ≜ �tj=i+1(1 − ηjϵj)","inline":true},{"text":". To upper bound ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":7.41},"width":11,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-35.png","element":"img","alt":"1","inline":true},{"text":"-","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":7.6},"width":14,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-36.png","element":"img","alt":"4","inline":true},{"text":", we apply careful sequence analysis ","element":"span"},{"text":"(","element":"span"},{"href":"#id-65","text":"Appendix A.1","element":"a"},{"text":") and properties of the E","element":"span"},{"text":"XP","element":"span"},{"text":"3-IX algorithm with changing step size (","element":"span"},{"href":"#id-66","text":"Appendix A.2","element":"a"},{"text":"). The analysis of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":7.6},"width":12.96,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-37.png","element":"img","alt":"5","inline":true,"padRight":true},{"text":"uses ","element":"span"},{"href":"#id-67","text":"Lemma 13","element":"a"},{"text":", which states ","element":"span"},{"style":{"height":17.55},"width":679.28,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-38.png","element":"img","alt":" vt = KL(z⋆t+1, zt+1) − KL(z⋆t , zt+1) ≤","inline":true},{"style":{"height":24},"width":614.48,"height":60,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-39.png","element":"img","alt":"O(ln(At)∥z⋆t+1 − z⋆t ∥1) = O( ln2(At)t )","inline":true,"padRight":true},{"text":"and is slightly involved as ","element":"span"},{"style":{"height":14},"width":145,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-40.png","element":"img","alt":" Ωt and ϵt","inline":true,"padRight":true},{"text":"are both changing. With these steps, we conclude that with probability at least ","element":"span"},{"style":{"height":28.99},"width":748.52,"height":72.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/5-41.png","element":"img","alt":" 1− O(δ), KL(z⋆t , zt) = O�A ln3(At/δ)t− 14�.","inline":true}]]},{"heading":"5 Irreducible Markov Games","paragraphs":[[{"text":"We now extend our results on matrix games to two-player zero-sum Markov games. Similarly to many previous works, our first result makes the assumption that the Markov game is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"irreducible ","element":"span"},{"text":"with bounded travel time between any pair of states. The assumption is formally stated below:","element":"span"}],[{"id":"id-61","style":{"fontWeight":"bold"},"text":"Assumption 1 ","element":"span"},{"text":"(Irreducible Game)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We assume that under any pair of stationary policies of the two players, and any pair of states ","element":"span"},{"style":{"height":15.2},"width":62,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-0.png","element":"img","alt":" s, s′","inline":true},{"style":{"fontStyle":"italic"},"text":", the expected time to reach ","element":"span"},{"style":{"height":15.6},"width":138.48,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-1.png","element":"img","alt":" s′ from s","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is upper bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We propose ","element":"span"},{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"for uncoupled learning in irreducible two-player zero-sum games, which is closely related to the Nash-V algorithm by [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"], but with additional entropy regularization. It can also be seen as players using ","element":"span"},{"href":"#id-62","text":"Algorithm 1 ","element":"a"},{"text":"on each state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"to update the policies ","element":"span"},{"style":{"height":16.21},"width":282.52,"height":40.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-2.png","element":"img","alt":" (xst, yst ) whenever","inline":true,"padRight":true},{"text":"state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is visited, but with ","element":"span"},{"style":{"height":17.39},"width":197,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-3.png","element":"img","alt":" σt + γV st+1t","inline":true},{"text":"as the observed loss to construct loss estimators. Importantly, ","element":"span"},{"style":{"height":15.01},"width":175,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-4.png","element":"img","alt":"V s1 , V s2 , . . .","inline":true,"padRight":true},{"text":"is a slowly changing sequence of value estimations that ensures stable policy updates ","element":"span"},{"text":"[","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":51,"text":"SZL","element":"a"},{"style":{"height":12.8},"width":60.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-5.png","element":"img","alt":"+21","inline":true},{"text":"]. Note that in ","element":"span"},{"href":"#id-68","text":"Algorithm 2","element":"a"},{"text":", the updates of ","element":"span"},{"style":{"height":15.2},"width":44,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-6.png","element":"img","alt":" V st","inline":true},{"text":"only use players’ local ","element":"span"},{"text":"information (Line ","element":"span"},{"href":"#id-68","text":"8","element":"a"},{"text":").","element":"span"}],[{"id":"id-68","style":{"width":"100%"},"width":1596,"height":680,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Comparison to Previous Works ","element":"span"},{"text":"Although ","element":"span"},{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"shares similarity with previous works that also use entropy regularization, we believe that both the design and the analysis of our algorithm are novel and non-trivial. To the best of our knowledge, all previous entropy regularized two-player zero-sum Markov game algorithms are coupled (e.g., [","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"CWC21","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":11,"text":"CMZ21","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":9,"text":"CCDX23","element":"a"},{"text":"]), while ours is the first that achieves uncoupledness under entropy regularization. We further discuss this by comparing our algorithm to those in [","element":"span"},{"href":"#id-25","referenceIndex":9,"text":"CCDX23","element":"a"},{"text":"], highlighting the new technical challenges we encounter.","element":"span"}],[{"text":"The entropy-regularized OMWU algorithm in [","element":"span"},{"href":"#id-25","referenceIndex":9,"text":"CCDX23","element":"a"},{"text":"] is tailored to the full-information setting. Moreover, in the value function update step both players need to know the entropy value of the other player’s policy, which is unnatural. Indeed, the authors explicitly present the removal of this information sharing as an open question. We answer this open question affirmatively by giving a fully decentralized algorithm for zero-sum Markov games with provable last-iterate convergence rates. In ","element":"span"},{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"(Line ","element":"span"},{"href":"#id-68","text":"8","element":"a"},{"text":"), the update of the value function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"is simple and does not require any entropy information: ","element":"span"},{"style":{"height":18.61},"width":681.48,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-8.png","element":"img","alt":" V stt+1 ← (1 − ατ)V stt + ατ�σt + γV st+1t �","inline":true},{"text":". This modification results in a discrepancy between the policy update and the value update. While the policy now incorporates a regularization term, the value function does not. Such a mismatch is unprecedented in earlier studies and necessitates a non-trivial approach to resolve. Additionally, ","element":"span"},{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"operates on bandit feedback instead of full-information feedback, presenting further technical challenges.","element":"span"}],[{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"also offers improvement over the uncoupled algorithm of [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"]. The algorithm of [","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"] requires coordinated policy update where the players interact with each other using the current policy for several iterations to get an approximately accurate gradient (the number of iterations required depends on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"as defined in ","element":"span"},{"href":"#id-61","text":"Assumption 1","element":"a"},{"text":"), and then simultaneously update the policy pair on all states. We do not require such unnatural coordination between the players or prior knowledge on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":".","element":"span"}],[{"text":"Our main result is the following theorem on the last-iterate convergence rate of ","element":"span"},{"href":"#id-68","text":"Algorithm 2","element":"a"},{"text":".","element":"span"}],[{"id":"id-73","style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"text":"(Last-Iterate Convergence Rate)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":14.61},"width":147.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-9.png","element":"img","alt":" ε, δ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"href":"#id-68","style":{"fontStyle":"italic"},"text":"Algorithm 2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":20.61},"width":175.52,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-10.png","element":"img","alt":" kα = 99+ε","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":20.4},"width":336.48,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-11.png","element":"img","alt":"kϵ = 19+ε, kβ = 39+ε","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":20.61},"width":155.48,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-12.png","element":"img","alt":" kη = 59+ε","inline":true},{"style":{"fontStyle":"italic"},"text":"guarantees, with probability at least ","element":"span"},{"style":{"height":16},"width":146.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-13.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for any time ","element":"span"},{"style":{"height":12.8},"width":92.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-14.png","element":"img","alt":"t ≥ 1,","inline":true}],[{"style":{"width":"82%"},"width":1310,"height":110,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/6-15.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Analysis Overview","element":"span"}],[{"text":"We introduce some notations for simplicity. We denote by ","element":"span"},{"style":{"height":16},"width":378.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-0.png","element":"img","alt":" Es′∼P s[V s′t ] the A × A","inline":true,"padRight":true},{"text":"matrix such that ","element":"span"},{"style":{"height":23.01},"width":555,"height":57.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-1.png","element":"img","alt":"(Es′∼P s[V s′t ])a,b = Es′∼P sa,b[V s′t ]","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":79.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-2.png","element":"img","alt":" tτ(s)","inline":true,"padRight":true},{"text":"be the ","element":"span"},{"style":{"height":7.2},"width":20,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-3.png","element":"img","alt":" τ","inline":true},{"text":"-th time the players visit state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", and define ","element":"span"},{"style":{"height":19.22},"width":454.52,"height":48.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-4.png","element":"img","alt":"ˆxsτ = xstτ (s) and ˆysτ = ystτ (s)","inline":true},{"text":". Then, define the regularized game for each state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"via the loss function ","element":"span"},{"style":{"height":22.4},"width":962,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-5.png","element":"img","alt":"f sτ (x, y) = x⊤(Gs + γEs′∼P s[V s′tτ (s)])y − ϵτϕ(x) + ϵτϕ(y)","inline":true},{"text":". Furthermore, let ","element":"span"},{"style":{"height":16},"width":266.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-6.png","element":"img","alt":" ˆzsτ⋆ = (ˆxsτ⋆, ˆysτ⋆)","inline":true,"padRight":true},{"text":"be ","element":"span"},{"text":"the equilibrium of ","element":"span"},{"style":{"height":16.19},"width":364,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-7.png","element":"img","alt":" f sτ (x, y) over Ωτ × Ωτ","inline":true},{"text":". In the following analysis, we fix some ","element":"span"},{"style":{"height":13.01},"width":94,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-8.png","element":"img","alt":" t ≥ 1.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 1: Policy Convergence to NE of Regularized Game ","element":"span"},{"text":"Using similar techniques to Step 1 and Step 2 in the analysis of ","element":"span"},{"href":"#id-62","text":"Algorithm 1","element":"a"},{"text":", we can upper bound KL","element":"span"},{"style":{"height":17.55},"width":223.12,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-9.png","element":"img","alt":"(ˆzsτ+1⋆, ˆzsτ+1)","inline":true,"padRight":true},{"text":"like ","element":"span"},{"href":"#id-69","text":"Eq. (1) ","element":"a"},{"text":"with ","element":"span"},{"text":"similar subsequent analysis for ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":12.4},"width":128.76,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-10.png","element":"img","alt":"1-term4","inline":true},{"text":". The analysis for ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":17.74},"width":520.28,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-11.png","element":"img","alt":"5 where vsi = KL(ˆzsi+1⋆, ˆzsi+1)−","inline":true,"padRight":true},{"text":"KL","element":"span"},{"style":{"height":17.74},"width":169.48,"height":44.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-12.png","element":"img","alt":"(ˆzsi⋆, ˆzsi+1)","inline":true,"padRight":true},{"text":"is more challenging compared to the matrix game case since here ","element":"span"},{"style":{"height":18.74},"width":87.4,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-13.png","element":"img","alt":" V sti(s)","inline":true,"padRight":true},{"text":"is changing ","element":"span"},{"text":"between two visits to state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". To handle this term, we leverage the following facts for any ","element":"span"},{"style":{"height":15.01},"width":156,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-14.png","element":"img","alt":" s′: (1) the","inline":true,"padRight":true},{"text":"irreducibility assumption ensures that ","element":"span"},{"style":{"height":16},"width":539.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-15.png","element":"img","alt":" ti+1(s) − ti(s) ≤ O(L ln(St/δ))","inline":true,"padRight":true},{"text":"thus the number of updates of the value function at state ","element":"span"},{"style":{"height":12.19},"width":25.48,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-16.png","element":"img","alt":" s′","inline":true},{"text":"is bounded; (2) until time ","element":"span"},{"style":{"height":16},"width":155.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-17.png","element":"img","alt":" ti(s) ≥ i","inline":true},{"text":", state ","element":"span"},{"style":{"height":6.8},"width":32.68,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-18.png","element":"img","alt":" s′","inline":true,"padRight":true},{"text":"has been visited at least ","element":"span"},{"style":{"height":21.78},"width":213.6,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-19.png","element":"img","alt":" Ω( iL ln(St/δ))","inline":true,"padRight":true},{"text":"times thus each change of the value function between ","element":"span"},{"style":{"height":16},"width":272,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-20.png","element":"img","alt":" ti(s) and ti+1(s)","inline":true,"padRight":true},{"text":"is at most ","element":"span"},{"style":{"height":21.78},"width":312.72,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-21.png","element":"img","alt":"O(( iL ln(St/δ))−kα)","inline":true},{"text":". With these arguments, we can bound ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":20.22},"width":596.32,"height":50.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-22.png","element":"img","alt":"5 by O�ln4(SAt/δ)Lτ −kα+kη+2kϵ�.","inline":true,"padRight":true},{"text":"Overall, we have the following policy convergence of NE of the regularized game (","element":"span"},{"href":"#id-70","text":"Lemma 17","element":"a"},{"text":"): KL","element":"span"},{"style":{"height":20.22},"width":1469.36,"height":50.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-23.png","element":"img","alt":"(ˆzsτ⋆, ˆzsτ) ≤ O�A ln4(SAt/δ)Lτ −k♯�, where k♯ = min{kβ − kϵ, kη − kβ, kα − kη − 2kϵ}.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Step 2: Value Convergence ","element":"span"},{"text":"Unlike matrix games, policy convergence to NE of the regularized game is not enough for convergence in duality gap. We also need to bound ","element":"span"},{"style":{"height":16},"width":170.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-24.png","element":"img","alt":" |V st − V s⋆ |","inline":true,"padRight":true},{"text":"since ","element":"span"},{"text":"the regularized game is defined using ","element":"span"},{"style":{"height":15.2},"width":44,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-25.png","element":"img","alt":" V st","inline":true},{"text":", the value function maintained by the algorithm, in- ","element":"span"},{"text":"stead of the minimax game value ","element":"span"},{"style":{"height":14.61},"width":43.48,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-26.png","element":"img","alt":" V s⋆","inline":true},{"text":". We use the following weighted regret quantities as a proxy: ","element":"span"},{"text":"Reg","element":"span"},{"style":{"height":19.2},"width":1520.92,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-27.png","element":"img","alt":"sτ ≜maxx,y� �τi=1 αiτ (f si (ˆxsi, ˆysi ) − f si (xs, ˆysi )) , �τi=1 αiτ (f si (ˆxsi, ysi ) − f si (ˆxsi, ˆysi ))�, where","inline":true},{"style":{"height":20.16},"width":417,"height":50.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-28.png","element":"img","alt":"αiτ = αi�τj=i+1(1 − αj)","inline":true},{"text":". We can upper bound the weighted regret Reg","element":"span"},{"style":{"height":15.81},"width":15.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-29.png","element":"img","alt":"sτ ","inline":true,"padRight":true},{"text":"using a similar analysis as ","element":"span"},{"text":"in Step 1 (","element":"span"},{"href":"#id-71","text":"Lemma 19","element":"a"},{"text":"). We then show a contraction for ","element":"span"},{"style":{"height":19.94},"width":215.08,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-30.png","element":"img","alt":" |V stτ (s) − V s⋆ |","inline":true,"padRight":true},{"text":"with the weighted regret quanti- ","element":"span"},{"text":"ties: ","element":"span"},{"style":{"height":22.75},"width":1062.68,"height":56.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-31.png","element":"img","alt":" |V stτ (s) − V s⋆ | ≤ γ �τi=1 αiτ maxs′ |V s′ti(s) − V s′⋆ | + ˜O(ϵτ + Regsτ)","inline":true},{"text":". This leads to the following ","element":"span"},{"text":"convergence of ","element":"span"},{"href":"#id-72","style":{"height":19.6},"width":1273,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-32.png","element":"img","alt":" V st (Lemma 20):|V st − V s⋆ | ≤ ˜O(t−k∗), where k∗ = min {kη, kβ, kα − kβ, kϵ}.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Obtaining Last-Iterate Convergence Rate ","element":"span"},{"text":"Fix any ","element":"span"},{"style":{"height":11.6},"width":161.88,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-33.png","element":"img","alt":" t and let τ","inline":true,"padRight":true},{"text":"be the number of visits to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"before time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". So far we have shown (1) policy convergence of KL","element":"span"},{"style":{"height":16},"width":142.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-34.png","element":"img","alt":"(ˆzsτ⋆, ˆzsτ)","inline":true,"padRight":true},{"text":"in the regularized game; (2) and ","element":"span"},{"text":"value convergence of ","element":"span"},{"style":{"height":16},"width":169.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-35.png","element":"img","alt":" |V st − V s⋆ |","inline":true},{"text":". Using the fact that the regularized game is at most ","element":"span"},{"style":{"height":16},"width":289.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-36.png","element":"img","alt":" O(ϵτ +|V st −V s⋆ |)","inline":true,"padRight":true},{"text":"away from the minimax game martrix ","element":"span"},{"style":{"height":14.61},"width":45,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-37.png","element":"img","alt":" Q⋆ ","inline":true,"padRight":true},{"text":"and appropriate choices of parameters proves ","element":"span"},{"href":"#id-73","text":"Theorem 2","element":"a"},{"text":".","element":"span"}]]},{"heading":"6 General Markov Games","paragraphs":[[{"text":"In this section, we consider general two-player zero-sum Markov games without ","element":"span"},{"href":"#id-61","text":"Assumption 1","element":"a"},{"text":". We propose ","element":"span"},{"href":"#id-74","text":"Algorithm 3","element":"a"},{"text":", an uncoupled learning algorithm that handles exploration and has path convergence rate. Compared to ","element":"span"},{"href":"#id-68","text":"Algorithm 2","element":"a"},{"text":", the update of value function in ","element":"span"},{"href":"#id-74","text":"Algorithm 3 ","element":"a"},{"text":"uses a bonus term ","element":"span"},{"style":{"height":14.19},"width":79,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-38.png","element":"img","alt":" bnsτ","inline":true,"padRight":true},{"text":"based on the optimism principle to handle exploration.","element":"span"}],[{"href":"#id-75","text":"Theorem 3 ","element":"a"},{"text":"below implies that we can achieve ","element":"span"},{"style":{"height":20.72},"width":770.36,"height":51.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-39.png","element":"img","alt":"1t�tτ=1 maxx,y (xs⊤ττ Qsτ⋆ ysτ − xs⊤τ Qsτ⋆ ysττ ) =","inline":true},{"style":{"height":19.68},"width":142.32,"height":49.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-40.png","element":"img","alt":"O(t− 110 )","inline":true,"padRight":true},{"text":"path convergence rate if we use the doubling trick to tune down ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"text":"at a rate of ","element":"span"},{"style":{"height":15.81},"width":84.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-41.png","element":"img","alt":" t− 110 .","inline":true}],[{"id":"id-75","style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":28.8},"width":220.04,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-42.png","element":"img","alt":" u ∈�0, 11−γ�","inline":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.2},"width":106.2,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-43.png","element":"img","alt":" T ≥ 1","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists a proper choice of parameters ","element":"span"},{"style":{"height":14.8},"width":95,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-44.png","element":"img","alt":" ϵ, β, η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"href":"#id-74","style":{"fontStyle":"italic"},"text":"Algorithm 3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"guarantees with probability at least ","element":"span"},{"style":{"height":16},"width":154.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-45.png","element":"img","alt":" 1 − O(δ),","inline":true}],[{"id":"id-80","style":{"width":"89%"},"width":1420,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/7-46.png","element":"img"}],[{"id":"id-74","style":{"width":"100%"},"width":1586,"height":704,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-0.png","element":"img"}],[{"id":"id-19","style":{"fontWeight":"bold"},"text":"6.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Path Convergence","element":"span"}],[{"text":"Path convergence has multiple meaningful game-theoretic implications. By definition, It implies that frequent visits to a state bring players’ policies closer to equilibrium, leading to both players using near-equilibrium policies for all but ","element":"span"},{"style":{"fontStyle":"italic"},"text":"o","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":") ","element":"span"},{"text":"number of steps over time.","element":"span"}],[{"text":"Path convergence also implies that both players have no regret compared to the game value ","element":"span"},{"style":{"height":15.01},"width":43.52,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-1.png","element":"img","alt":" V s⋆","inline":true,"padRight":true},{"text":", ","element":"span"},{"text":"which has been considered and motivated in previous works such as [","element":"span"},{"href":"#id-33","referenceIndex":7,"text":"BT02","element":"a"},{"text":", ","element":"span"},{"href":"#id-76","referenceIndex":54,"text":"TWYS20","element":"a"},{"text":"]. To see this, we apply the results to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"episodic ","element":"span"},{"text":"setting, where in every step, with probability ","element":"span"},{"style":{"height":14.4},"width":90.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-2.png","element":"img","alt":" 1 − γ","inline":true},{"text":", the state is redrawn from ","element":"span"},{"style":{"height":10.59},"width":90.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-3.png","element":"img","alt":" s ∼ ρ","inline":true,"padRight":true},{"text":"for some initial distribution ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-4.png","element":"img","alt":" ρ","inline":true},{"text":". If the learning dynamics enjoys path convergence, then ","element":"span"},{"style":{"height":20.42},"width":842.8,"height":51.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-5.png","element":"img","alt":" E[�Tt=1 xs⊤tt Gstystt ] = (1 − γ)Es∼ρ[V s⋆ ]T ± o(T)","inline":true},{"text":". Hence the one-step average reward is ","element":"span"},{"style":{"height":16.8},"width":279.72,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-6.png","element":"img","alt":"(1 − γ)Es∼ρ[V s⋆ ]","inline":true,"padRight":true},{"text":"and both players have no regret compared to the game value. A more important ","element":"span"},{"text":"implication of path convergence is that it guarantees stability of players’ policies, while cycling behaviour is inevitable for any FTRL-type algorithms even in zero-sum matrix games [","element":"span"},{"href":"#id-5","referenceIndex":39,"text":"MPP18","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"BP18","element":"a"},{"text":"]. We defer the proof and more discussion of path convergence to ","element":"span"},{"text":"Appendix F","element":"span"},{"text":".","element":"span"}],[{"text":"Finally, we remark that our algorithm is built upon Nash V-learning [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"], so it inherits properties of Nash V-learning, e.g., one can still output near-equilibrium policies through policy averaging [","element":"span"},{"href":"#id-43","referenceIndex":31,"text":"JLWY21","element":"a"},{"text":"], or having no regret compared to the game value when competing with an arbitrary opponent [","element":"span"},{"href":"#id-76","referenceIndex":54,"text":"TWYS20","element":"a"},{"text":"]. We demonstrate extra benefits brought by entropy regularization regarding the stability of the dynamics.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Analysis Overview of ","element":"span"},{"href":"#id-75","style":{"fontWeight":"bold"},"text":"Theorem 3","element":"a"}],[{"text":"For general Markov games, it no longer holds that every state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is visited often, and thus the analysis is much more challenging. We first define two regularized games based on ","element":"span"},{"style":{"height":16},"width":45.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-7.png","element":"img","alt":" V st","inline":true},{"text":"and the ","element":"span"},{"text":"corresponding quantity ","element":"span"},{"style":{"height":19.41},"width":45.48,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-8.png","element":"img","alt":" Vst","inline":true},{"text":"for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player. Define ","element":"span"},{"style":{"height":16},"width":205.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-9.png","element":"img","alt":" tτ(s), ˆxsτ, ˆysτ","inline":true,"padRight":true},{"text":"the same way as in the previous ","element":"span"},{"text":"section. Then define ","element":"span"},{"style":{"height":22.21},"width":1266.4,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-10.png","element":"img","alt":" f sτ(x, y) ≜ x⊤(Gs +γEs′∼P s[V s′tτ (s)])y−ϵϕ(x)+ϵϕ(y), fsτ(x, y) ≜ x⊤(Gs +","inline":true},{"style":{"height":21.7},"width":602.12,"height":54.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-11.png","element":"img","alt":"γEs′∼P s[Vs′tτ (s)])y − ϵϕ(x) + ϵϕ(y)","inline":true,"padRight":true},{"text":"and denote ","element":"span"},{"style":{"height":19.74},"width":771.24,"height":49.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-12.png","element":"img","alt":" Jt = maxx,y(xs⊤tt (Gst + γEs′∼P st [Vs′t ]yst −","inline":true}],[{"id":"id-79","style":{"width":"100%"},"width":1592,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Value Convergence: Bounding ","element":"span"},{"style":{"height":15.74},"width":150.64,"height":39.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-14.png","element":"img","alt":" V st − V s⋆","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"and ","element":"span"},{"style":{"height":19.2},"width":147.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-15.png","element":"img","alt":" V s⋆ − Vst","inline":true},{"text":"This step is similar to Step 2 in the analysis of ","element":"span"},{"href":"#id-68","text":"Algorithm 2","element":"a"},{"text":". We first show an upper bound of the weighted regret (","element":"span"},{"href":"#id-77","text":"Lemma 23","element":"a"},{"text":"): ","element":"span"},{"style":{"height":20.96},"width":745.88,"height":52.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-16.png","element":"img","alt":"�τi=1 αiτ(f si(ˆxsi, ˆysi ) − f si(xs, ˆysi )) ≤ 12bnsτ","inline":true},{"text":", where ","element":"span"},{"style":{"height":20.18},"width":440.76,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-17.png","element":"img","alt":" αiτ = αi�τj=i+1(1 − αj)","inline":true},{"text":". Note that the ","element":"span"},{"text":"value function ","element":"span"},{"style":{"height":16.21},"width":46,"height":40.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-18.png","element":"img","alt":" V st ","inline":true,"padRight":true},{"text":"is updated using ","element":"span"},{"style":{"height":18.61},"width":331.44,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-19.png","element":"img","alt":" σt + γV st+1t − bnsτ","inline":true},{"text":". Thus when relating ","element":"span"},{"style":{"height":16},"width":164.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-20.png","element":"img","alt":" |V st − V s⋆ |","inline":true,"padRight":true},{"text":"to the regret, ","element":"span"},{"text":"the regret term and the bonus term cancel out and we get ","element":"span"},{"href":"#id-78","style":{"height":24.03},"width":680.56,"height":60.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-21.png","element":"img","alt":" V st ≤ V s⋆ + O( ϵ ln(AT )1−γ ) (Lemma 26). The","inline":true,"padRight":true},{"text":"analysis for ","element":"span"},{"style":{"height":19.2},"width":135.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-22.png","element":"img","alt":" V s⋆ −Vst ","inline":true,"padRight":true},{"text":"is symmetric. By proper choice of ","element":"span"},{"style":{"height":7.2},"width":13.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-23.png","element":"img","alt":" ϵ","inline":true},{"text":", both terms are bounded by ","element":"span"},{"style":{"height":19.2},"width":43,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/8-24.png","element":"img","alt":" 18u","inline":true},{"text":". Combining ","element":"span"},{"text":"the above with ","element":"span"},{"href":"#id-79","text":"Eq. (3)","element":"a"},{"text":", we can upper bound the left-hand side of the desired inequality ","element":"span"},{"href":"#id-80","text":"Eq. (2) ","element":"a"},{"text":"by ","element":"span"},{"style":{"height":21.39},"width":293.52,"height":53.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-0.png","element":"img","alt":"�Tt=1 1�Jt ≥ 34u�","inline":true},{"text":", which is further upper bounded in ","element":"span"},{"href":"#id-81","text":"Eq. (29) ","element":"a"},{"text":"by","element":"span"}],[{"id":"id-84","style":{"width":"99%"},"width":1572,"height":268,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Policy Convergence to NE of Regularized Games ","element":"span"},{"text":"To bound the first two terms, we show convergence of the policy ","element":"span"},{"style":{"height":16},"width":131.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-2.png","element":"img","alt":" (ˆxsτ, ˆysτ)","inline":true,"padRight":true},{"text":"to Nash equilibria of both games ","element":"span"},{"style":{"height":19.2},"width":40.48,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-3.png","element":"img","alt":" f sτ","inline":true},{"text":"and ","element":"span"},{"style":{"height":19.41},"width":40.52,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-4.png","element":"img","alt":" fsτ","inline":true},{"text":". To this end, fix any ","element":"span"},{"style":{"height":16},"width":167.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-5.png","element":"img","alt":"p ∈ [0, 1]","inline":true},{"text":", we define ","element":"span"},{"style":{"height":22.19},"width":401.68,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-6.png","element":"img","alt":" f sτ = pf sτ + (1 − p)fsτ","inline":true,"padRight":true},{"text":"and let ","element":"span"},{"style":{"height":16},"width":291.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-7.png","element":"img","alt":" ˆzsτ⋆ = (ˆxsτ⋆, ˆysτ⋆)","inline":true,"padRight":true},{"text":"be the equilibrium of ","element":"span"},{"style":{"height":16},"width":133.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-8.png","element":"img","alt":"f sτ (x, y)","inline":true},{"text":". The analysis is similar to previous algorithms where we first conduct single-step analysis ","element":"span"},{"text":"(","element":"span"},{"href":"#id-82","text":"Lemma 22","element":"a"},{"text":") and then carefully bound the weighted recursive terms. We show in ","element":"span"},{"href":"#id-83","text":"Lemma 27 ","element":"a"},{"text":"that for any ","element":"span"},{"style":{"height":14.4},"width":177,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-9.png","element":"img","alt":" 0 < ϵ′ ≤ 1","inline":true},{"text":": ","element":"span"},{"style":{"height":26.43},"width":922.72,"height":66.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-10.png","element":"img","alt":"�s�nT +1(s)τ=1 1 [KL(ˆzsτ⋆, ˆzsτ) ≥ ϵ′] ≤ O( S2A ln5(SAT/δ)ηϵ2ϵ′(1−γ)3 )","inline":true},{"text":". This proves policy convergence: the number of iterations where the policy is far away from Nash equilibria of the regularized games is bounded, which can then be translated to upper bounds on the first two terms.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Value Convergence: Bounding ","element":"span"},{"style":{"height":18.93},"width":170.12,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-11.png","element":"img","alt":" |Vst − V st|","inline":true,"padRight":true},{"text":"It remains to bound the last term in ","element":"span"},{"href":"#id-84","text":"Eq. (4)","element":"a"},{"text":". Define ","element":"span"},{"style":{"height":18.94},"width":656.96,"height":47.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-12.png","element":"img","alt":"ct = 1[xstt (Es′∼P st [Vs′t − V s′t ])ystt ≥ ˜ϵ]","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16.58},"width":93.12,"height":41.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-13.png","element":"img","alt":" ˜ϵ = u4","inline":true},{"text":". Then we only need to bound ","element":"span"},{"style":{"height":20.59},"width":213.48,"height":51.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-14.png","element":"img","alt":" C ≜ �Tt=1 ct","inline":true},{"text":". ","element":"span"},{"text":"We use the weighted sum ","element":"span"},{"style":{"height":24.4},"width":690,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-15.png","element":"img","alt":" PT ≜ �Tt=1 ctxstt (Es′∼P st [Vs′t − V s′t ])ystt","inline":true},{"text":"as a proxy. On the one hand, ","element":"span"},{"style":{"height":13.2},"width":155.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-16.png","element":"img","alt":"PT ≥ C˜ϵ","inline":true},{"text":". On the other hand, in ","element":"span"},{"href":"#id-85","text":"Lemma 25","element":"a"},{"text":", by recursively tracking the update of the value function and carefully choosing ","element":"span"},{"style":{"height":14.8},"width":121.76,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-17.png","element":"img","alt":" η and β","inline":true},{"text":", we upper bound ","element":"span"},{"style":{"height":26.43},"width":539.52,"height":66.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-18.png","element":"img","alt":" PT by ≤ C˜ϵ2 + O( AS ln4(AST/δ)η(1−γ)3 )","inline":true},{"text":". Combining the upper and lower bound of ","element":"span"},{"style":{"height":13.41},"width":46.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-19.png","element":"img","alt":" PT","inline":true,"padRight":true},{"text":"gives ","element":"span"},{"style":{"height":26.45},"width":383.76,"height":66.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-20.png","element":"img","alt":" C ≤ O( AS ln4(AST/δ)ηu(1−γ)3 )","inline":true,"padRight":true},{"text":"(","element":"span"},{"href":"#id-86","text":"Corollary 2","element":"a"},{"text":"). Plugging appropriate choices of ","element":"span"},{"style":{"height":14.99},"width":166.52,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-21.png","element":"img","alt":" ϵ, η, and β","inline":true,"padRight":true},{"text":"in the above bounds proves ","element":"span"},{"href":"#id-75","text":"Theorem 3 ","element":"a"},{"text":"(see ","element":"span"},{"text":"Appendix E","element":"span"},{"text":").","element":"span"}]]},{"heading":"7 Conclusion and Future Directions","paragraphs":[[{"text":"In this work, we study decentralized learning in two-player zero-sum Markov games with bandit feedback. We propose the first uncoupled and convergent algorithms with non-asymptotic last-iterate convergence rates for matrix games and irreducible Markov games, respectively. We also introduce a novel notion of path convergence and provide algorithm with path convergence in Markov games without any assumption on the dynamics. Previous results either focus on average-iterate convergence or require stronger feedback/coordination or lack non-asymptotic convergence rates. Our results contribute to the theoretical understanding of the practical success of regularization and last-iterate convergence in multi-agent reinforcement learning.","element":"span"}],[{"text":"Settling the optimal last-iterate convergence rate that is achievable by uncoupled learning dynamics is an important open question. The following directions are promising towards closing the gap between current upper bounds ","element":"span"},{"style":{"height":18.19},"width":168.12,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-22.png","element":"img","alt":" O(T −1/8)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18.19},"width":231.24,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-23.png","element":"img","alt":" O(T −1/(9+ε))","inline":true,"padRight":true},{"text":"and lower bound ","element":"span"},{"style":{"height":19.66},"width":139.12,"height":49.16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-24.png","element":"img","alt":" Ω(T − 12 )","inline":true},{"text":", The impossibility result by [","element":"span"},{"href":"#id-59","referenceIndex":40,"text":"MPS20","element":"a"},{"text":"] demonstrates that certain algorithms with ","element":"span"},{"style":{"height":18.3},"width":126.36,"height":45.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-25.png","element":"img","alt":" O(√T)","inline":true,"padRight":true},{"text":"regret diverge in last-iterate. Their result indicates that the current ","element":"span"},{"style":{"height":22.74},"width":119.16,"height":56.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/9-26.png","element":"img","alt":" Ω( 1√T )","inline":true,"padRight":true},{"text":"lower bound on convergence rate may not be tight. On ","element":"span"},{"text":"the other hand, our algorithms provides insights and useful templates to potential improvements on the upper bound. For instance, instead of using EXP3-IX update, adapting optimistic policy update or other accelerated first-order methods to the bandit feedback setting is an interesting future direction.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Acknowledgement ","element":"span"},{"text":"We thank Chanwoo Park and Kaiqing Zhang for pointing out a mistake in our previous proof. We also thank the anonymous reviewers for their constructive feedback. HL is supported by NSF Award IIS-1943607 and a Google Research Scholar Award.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-31","text":"[AVHC22] ","element":"span"},{"text":"Ahmet Alacaoglu, Luca Viano, Niao He, and Volkan Cevher. A natural actor-critic framework for zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 307–366. PMLR, 2022.","element":"span"}],[{"id":"id-8","text":"[AY16] ","element":"span"},{"text":"Gürdal Arslan and Serdar Yüksel. Decentralized q-learning for stochastic teams and games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Automatic Control","element":"span"},{"text":", 62(4):1545–1558, 2016.","element":"span"}],[{"id":"id-40","text":"[BJ20] ","element":"span"},{"text":"Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 551–560. PMLR, 2020.","element":"span"}],[{"id":"id-12","text":"[BJY20] ","element":"span"},{"text":"Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 33:2159–2170, 2020.","element":"span"}],[{"id":"id-48","text":"[BLM18] Mario Bravo, David Leslie, and Panayotis Mertikopoulos. Bandit learning in concave ","element":"span"},{"text":"n-person games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 31, 2018.","element":"span"}],[{"id":"id-6","text":"[BP18] ","element":"span"},{"text":"James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2018 ACM Conference on Economics and Computation","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-33","text":"[BT02] ","element":"span"},{"text":"Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 3(Oct):213–231, 2002.","element":"span"}],[{"id":"id-7","text":"[BV01] ","element":"span"},{"text":"Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 17th international joint conference on Artificial intelligence-Volume 2","element":"span"},{"text":", pages 1021–1026, 2001.","element":"span"}],[{"id":"id-25","text":"[CCDX23] ","element":"span"},{"text":"Shicong Cen, Yuejie Chi, Simon Shaolei Du, and Lin Xiao. Faster last-iterate convergence of policy optimization in zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-103","text":"[CLW21] ","element":"span"},{"text":"Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible tuning made possible: A new expert algorithm and its applications. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Learning Theory","element":"span"},{"text":", pages 1216–1259. PMLR, 2021.","element":"span"}],[{"id":"id-1","text":"[CMZ21] ","element":"span"},{"text":"Ziyi Chen, Shaocong Ma, and Yi Zhou. Sample efficient stochastic policy extragradient algorithm for zero-sum markov game. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-56","text":"[COZ22] ","element":"span"},{"text":"Yang Cai, Argyris Oikonomou, and Weiqiang Zheng. Finite-time last-iterate convergence for learning in multi-player games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems (NeurIPS)","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-23","text":"[CS07] ","element":"span"},{"text":"Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning","element":"span"},{"text":", 67(1-2):23–43, 2007.","element":"span"}],[{"id":"id-2","text":"[CWC21] ","element":"span"},{"text":"Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:27952–27964, 2021.","element":"span"}],[{"id":"id-58","text":"[CZ23] ","element":"span"},{"text":"Yang Cai and Weiqiang Zheng. Doubly optimal no-regret learning in monotone games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", 2023. to appear.","element":"span"}],[{"id":"id-42","text":"[CZG22] ","element":"span"},{"text":"Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum markov games with linear function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Algorithmic Learning Theory","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-10","text":"[DDK11] ","element":"span"},{"text":"Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms","element":"span"},{"text":", pages 235–254. SIAM, 2011.","element":"span"}],[{"id":"id-29","text":"[DFG20] ","element":"span"},{"text":"Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 33:5527–5540, 2020.","element":"span"}],[{"id":"id-52","text":"[DFR22] ","element":"span"},{"text":"Dmitriy Drusvyatskiy, Maryam Fazel, and Lillian J Ratliff. Improved rates for derivative free gradient play in strongly monotone games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE 61st Conference on Decision and Control (CDC)","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-32","text":"[ELS","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/11-0.png","element":"img","alt":"+23]","inline":true,"padRight":true},{"text":"Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, and Yishay Mansour. Regret minimization and convergence to equilibria in general-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 9343–9373. PMLR, 2023.","element":"span"}],[{"id":"id-17","text":"[FT91] ","element":"span"},{"text":"Jerzy A Filar and Boleslaw Tolwinski. On the algorithm of pollatschek and avi-ltzhak. 1991.","element":"span"}],[{"id":"id-55","text":"[GPD20] ","element":"span"},{"text":"Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-57","text":"[GTG22] ","element":"span"},{"text":"Eduard Gorbunov, Adrien Taylor, and Gauthier Gidel. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-54","text":"[GVGM21] ","element":"span"},{"text":"Angeliki Giannou, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Panayotis Mertikopoulos. On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:22655–22666, 2021.","element":"span"}],[{"id":"id-53","text":"[HH23] ","element":"span"},{"text":"Yuanhanqing Huang and Jianghai Hu. Zeroth-order learning in continuous games via residual pseudogradient estimates. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2301.02279","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-49","text":"[HIMM19] ","element":"span"},{"text":"Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the convergence of single-call stochastic extra-gradient methods. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-14","text":"[HK66] ","element":"span"},{"text":"Alan J Hoffman and Richard M Karp. On nonterminating stochastic games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Management Science","element":"span"},{"text":", 1966.","element":"span"}],[{"id":"id-36","text":"[HLWY22] ","element":"span"},{"text":"Baihe Huang, Jason D. Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-22","text":"[HW03] ","element":"span"},{"text":"Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of machine learning research","element":"span"},{"text":", 4(Nov):1039–1069, 2003.","element":"span"}],[{"id":"id-38","text":"[JJJN21] ","element":"span"},{"text":"Mehdi Jafarnia-Jahromi, Rahul Jain, and Ashutosh Nayyar. Learning zero-sum stochastic games with posterior sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2109.03396","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-43","text":"[JLWY21] ","element":"span"},{"text":"Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent rl. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2110.14555","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-37","text":"[JLY22] ","element":"span"},{"text":"Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent rl in large state spaces. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 10251–10279. PMLR, 2022.","element":"span"}],[{"id":"id-140","text":"[LH14] ","element":"span"},{"text":"Tor Lattimore and Marcus Hutter. Near-optimal pac bounds for discounted mdps. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Theoretical Computer Science","element":"span"},{"text":", 558:125–143, 2014.","element":"span"}],[{"id":"id-18","text":"[Lit94] ","element":"span"},{"text":"Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning proceedings 1994","element":"span"},{"text":", pages 157–163. Elsevier, 1994.","element":"span"}],[{"id":"id-45","text":"[LS19] ","element":"span"},{"text":"Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The 22nd International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-41","text":"[LYBJ21] ","element":"span"},{"text":"Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 7001–7010. PMLR, 2021.","element":"span"}],[{"id":"id-50","text":"[LZBZ21] ","element":"span"},{"text":"Tianyi Lin, Zhengyuan Zhou, Wenjia Ba, and Jiawei Zhang. Doubly optimal no-regret online learning in strongly monotone games with bandit feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Available at SSRN 3978421","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-46","text":"[MOP20] ","element":"span"},{"text":"Aryan Mokhtari, Asuman E Ozdaglar, and Sarath Pattathil. Convergence rate of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"o","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/k","element":"span"},{"text":") ","element":"span"},{"text":"for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Optimization","element":"span"},{"text":", 30(4):3230–3251, 2020.","element":"span"}],[{"id":"id-5","text":"[MPP18] ","element":"span"},{"text":"Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-Ninth Annual ACMSIAM Symposium on Discrete Algorithms","element":"span"},{"text":", pages 2703–2717. SIAM, 2018.","element":"span"}],[{"id":"id-59","text":"[MPS20] ","element":"span"},{"text":"Vidya Muthukumar, Soham Phade, and Anant Sahai. On the impossibility of convergence of mixed strategies with no regret learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2012.02125","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-63","text":"[Neu15] ","element":"span"},{"text":"Gergely Neu. Explore no more: Improved high-probability regret bounds for nonstochastic bandits. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 28, 2015.","element":"span"}],[{"id":"id-15","text":"[PAI69] ","element":"span"},{"text":"MA Pollatschek and B Avi-Itzhak. Algorithms for stochastic games with geometrical interpretation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Management Science","element":"span"},{"text":", 1969.","element":"span"}],[{"id":"id-4","text":"[PDVH","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/12-0.png","element":"img","alt":"+22]","inline":true,"padRight":true},{"text":"Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al. Mastering the game of stratego with model-free multiagent reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Science","element":"span"},{"text":", 378(6623):990–996, 2022.","element":"span"}],[{"id":"id-112","text":"[Put14] ","element":"span"},{"text":"Martin L Puterman. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes: discrete stochastic dynamic programming","element":"span"},{"text":". John Wiley & Sons, 2014.","element":"span"}],[{"id":"id-13","text":"[Sha53] ","element":"span"},{"text":"Lloyd S Shapley. Stochastic games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the national academy of sciences","element":"span"},{"text":", 1953.","element":"span"}],[{"id":"id-21","text":"[SL99] ","element":"span"},{"text":"Csaba Szepesvári and Michael L Littman. A unified analysis of value-function-based reinforcement-learning algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural computation","element":"span"},{"text":", 1999.","element":"span"}],[{"id":"id-27","text":"[SLY23] ","element":"span"},{"text":"Zhuoqing Song, Jason D. Lee, and Zhuoran Yang. Can we find nash equilibria at a linear rate in markov games? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-24","text":"[SPO22] ","element":"span"},{"text":"Muhammed O Sayin, Francesca Parise, and Asuman Ozdaglar. Fictitious play in zero-sum stochastic games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SIAM Journal on Control and Optimization","element":"span"},{"text":", 60(4):2095–2114, 2022.","element":"span"}],[{"id":"id-100","text":"[SSBD14] ","element":"span"},{"text":"Shai Shalev-Shwartz and Shai Ben-David. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Understanding machine learning: From theory to algorithms","element":"span"},{"text":". Cambridge university press, 2014.","element":"span"}],[{"id":"id-3","text":"[SSS","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/12-1.png","element":"img","alt":"+17]","inline":true,"padRight":true},{"text":"David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Nature","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-9","text":"[SZL","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/12-2.png","element":"img","alt":"+21]","inline":true,"padRight":true},{"text":"Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Basar, and Asuman Ozdaglar. Decentralized q-learning in zero-sum markov games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 34:18320–18334, 2021.","element":"span"}],[{"id":"id-51","text":"[TK22] ","element":"span"},{"text":"Tatiana Tatarenko and Maryam Kamgarpour. On the rate of convergence of payoffbased algorithms to nash equilibrium in strongly monotone games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.11147","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-44","text":"[Tse95] ","element":"span"},{"text":"Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computational and Applied Mathematics","element":"span"},{"text":", 60(1-2):237–252, 1995.","element":"span"}],[{"id":"id-76","text":"[TWYS20] ","element":"span"},{"text":"Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Provably efficient online agnostic learning in markov games. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2010.15020","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-16","text":"[VDW78] ","element":"span"},{"text":"J Van Der Wal. Discounted markov games: Generalized policy iteration method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Optimization Theory and Applications","element":"span"},{"text":", 1978.","element":"span"}],[{"id":"id-60","text":"[vN28] ","element":"span"},{"text":"J v. Neumann. Zur theorie der gesellschaftsspiele. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mathematische annalen","element":"span"},{"text":", 100(1):295– 320, 1928.","element":"span"}],[{"id":"id-141","text":"[WDCW20] ","element":"span"},{"text":"Yuanhao Wang, Kefan Dong, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-34","text":"[WHL17] ","element":"span"},{"text":"Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 4987–4997, 2017.","element":"span"}],[{"id":"id-0","text":"[WLZL21a] ","element":"span"},{"text":"Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on learning theory","element":"span"},{"text":", pages 4259–4299. PMLR, 2021.","element":"span"}],[{"id":"id-47","text":"[WLZL21b] ","element":"span"},{"text":"Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-35","text":"[XCWY20] ","element":"span"},{"text":"Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. ","element":"span"},{"text":"Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on learning theory","element":"span"},{"text":", pages 3674–3682. PMLR, 2020.","element":"span"}],[{"id":"id-39","text":"[XZS","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/13-0.png","element":"img","alt":"+22]","inline":true,"padRight":true},{"text":"Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, and Tong Zhang. A self-play posterior sampling algorithm for zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR 2022 Workshop on Gamification and Multiagent Solutions","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-28","text":"[YM23] ","element":"span"},{"text":"Yuepeng Yang and Cong Ma. ","element":"span"},{"style":{"height":17.2},"width":128.52,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/13-1.png","element":"img","alt":"O(T −1)","inline":true,"padRight":true},{"text":"convergence of optimistic-follow-the-regularized-leader in two-player zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-26","text":"[ZLW","element":"span"},{"style":{"height":15.41},"width":74.48,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/13-2.png","element":"img","alt":"+22]","inline":true,"padRight":true},{"text":"Runyu Zhang, Qinghua Liu, Huan Wang, Caiming Xiong, Na Li, and Yu Bai. Policy optimization for markov games: Unified framework and faster convergence. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-30","text":"[ZTLD22] ","element":"span"},{"text":"Yulai Zhao, Yuandong Tian, Jason Lee, and Simon Du. Provably efficient policy optimization for two-player zero-sum markov games. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 2736–2761. PMLR, 2022.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A Auxiliary Lemmas ","element":"span"},{"style":{"fontWeight":"bold"},"text":"15","element":"span"}],[{"text":"A.1 ","element":"span"},{"href":"#id-65","text":"Sequence Properties ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"15","element":"span"}],[{"text":"A.2 ","element":"span"},{"href":"#id-66","text":"Properties Related to E","element":"a"},{"text":"XP","element":"span"},{"text":"3-IX ","element":"span"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"17","element":"span"}],[{"text":"A.3 ","element":"span"},{"href":"#id-87","text":"Markov Games ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"18","element":"span"}],[{"text":"A.4 ","element":"span"},{"href":"#id-88","text":"Online Mirror Descent ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"18","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Last-Iterate Convergence Rate of Algorithm 1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"19","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"C Improved Last-Iterate Convergence under Expectation ","element":"span"},{"style":{"fontWeight":"bold"},"text":"23","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D Last-Iterate Convergence Rate of Algorithm 2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"24","element":"span"}],[{"text":"D.1 ","element":"span"},{"href":"#id-89","text":"On the Assumption of Irreducible Markov Game ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"24","element":"span"}],[{"text":"D.2 ","element":"span"},{"href":"#id-90","text":"Part I. Basic Iteration Properties ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"25","element":"span"}],[{"text":"D.3 ","element":"span"},{"href":"#id-91","text":"Part II. Policy Convergence to the Nash of Regularized Game ","element":"a"},{"text":". . . . . . . . . . . . ","element":"span"},{"text":"25","element":"span"}],[{"text":"D.4 ","element":"span"},{"href":"#id-92","text":"Part III. Value Convergence ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"27","element":"span"}],[{"text":"D.5 ","element":"span"},{"href":"#id-93","text":"Part IV. Combining ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"31","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Convergent Analysis of Algorithm 3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"32","element":"span"}],[{"text":"E.1 ","element":"span"},{"href":"#id-94","text":"Part I. Basic Iteration Properties ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"32","element":"span"}],[{"text":"E.2 ","element":"span"},{"href":"#id-95","text":"Part II. Value Convergence ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"32","element":"span"}],[{"text":"E.3 ","element":"span"},{"href":"#id-96","text":"Part III. Policy Convergence to the Nash of the Regularized Game ","element":"a"},{"text":". . . . . . . . . ","element":"span"},{"text":"37","element":"span"}],[{"text":"E.4 ","element":"span"},{"href":"#id-97","text":"Part IV. Combining ","element":"a"},{"text":". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ","element":"span"},{"text":"40","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"F ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Discussions on Convergence Notions for General Markov Games ","element":"span"},{"style":{"fontWeight":"bold"},"text":"41","element":"span"}]]},{"heading":"A Auxiliary Lemmas","paragraphs":[[{"id":"id-65","style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Sequence Properties","element":"span"}],[{"id":"id-106","style":{"fontWeight":"bold"},"text":"Lemma 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":34.58},"width":961.52,"height":86.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-0.png","element":"img","alt":" 0 < h < 1, 0 ≤ k ≤ 2, and let t ≥�241−h ln 121−h� 11−h . Then","inline":true}],[{"style":{"width":"45%"},"width":728,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Define","element":"span"}],[{"style":{"width":"20%"},"width":326,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-2.png","element":"img"}],[{"text":"We first show that ","element":"span"},{"style":{"height":18.8},"width":90.52,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-3.png","element":"img","alt":" s ≤ t2","inline":true},{"text":". Suppose not, then we have","element":"span"}],[{"style":{"width":"75%"},"width":1192,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-4.png","element":"img"}],[{"text":"and thus ","element":"span"},{"style":{"height":17.38},"width":485,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-5.png","element":"img","alt":" t1−h < 4(k + 1) ln t ≤ 12 ln t","inline":true},{"text":". However, by the condition for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"href":"#id-98","text":"Lemma 3","element":"a"},{"text":", it holds that ","element":"span"},{"style":{"height":15.6},"width":228,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/14-6.png","element":"img","alt":"t1−h ≥ 12 ln t","inline":true},{"text":", which leads to contradiction.","element":"span"}],[{"text":"Then the sum can be decomposed as","element":"span"}],[{"style":{"width":"56%"},"width":888,"height":514,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-0.png","element":"img"}],[{"id":"id-107","style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":34.4},"width":959.48,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-1.png","element":"img","alt":" 0 < h < 1, 0 ≤ k ≤ 2, and let t ≥�241−h ln 121−h� 11−h . Then","inline":true}],[{"style":{"width":"39%"},"width":630,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"85%"},"width":1362,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-3.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"we use ","element":"span"},{"href":"#id-98","text":"Lemma 3","element":"a"},{"text":". Combining the two inequalities finishes the proof.","element":"span"}],[{"id":"id-98","style":{"fontWeight":"bold"},"text":"Lemma 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":34.58},"width":963.32,"height":86.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-4.png","element":"img","alt":" 0 < h < 1 and t ≥�241−h ln 121−h� 11−h . Then t1−h ≥ 12 ln t.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By the condition, we have","element":"span"}],[{"style":{"width":"29%"},"width":460,"height":88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-5.png","element":"img"}],[{"text":"Applying ","element":"span"},{"href":"#id-99","text":"Lemma 12","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"34%"},"width":546,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-6.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 4 ","element":"span"},{"text":"(Lemma A.1 of [","element":"span"},{"href":"#id-100","referenceIndex":49,"text":"SSBD14","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":677.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-7.png","element":"img","alt":" a > 0. Then x ≥ 2a ln(a) ⇒ x ≥ a ln(x).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma 5 ","element":"span"},{"text":"(Freedman’s Inequality)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":13.58},"width":347.6,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-8.png","element":"img","alt":" F0 ⊂ F1 ⊂ · · · ⊂ Fn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a filtration, and ","element":"span"},{"style":{"height":14.4},"width":309,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-9.png","element":"img","alt":" X1, . . . , Xn be real","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"random variables such that ","element":"span"},{"style":{"height":13.6},"width":125.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-10.png","element":"img","alt":" Xi is Fi","inline":true},{"style":{"fontStyle":"italic"},"text":"-measurable, ","element":"span"},{"style":{"height":18.18},"width":843.12,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-11.png","element":"img","alt":" E[Xi|Fi−1] = 0, |Xi| ≤ b, and �ni=1 E[X2i |Fi−1] ≤","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for some fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b > ","element":"span"},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V > ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then with probability at least ","element":"span"},{"style":{"height":13.6},"width":90.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-12.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"39%"},"width":632,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/15-13.png","element":"img"}],[{"id":"id-66","style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Properties Related to E","element":"span"},{"style":{"fontWeight":"bold"},"text":"XP","element":"span"},{"style":{"fontWeight":"bold"},"text":"3-IX","element":"span"}],[{"text":"In ","element":"span"},{"href":"#id-101","text":"Lemma 6 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-102","text":"Lemma 7","element":"a"},{"text":", we assume that ","element":"span"},{"style":{"height":13.58},"width":345.4,"height":33.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-0.png","element":"img","alt":" F0 ⊂ F1 ⊂ F2 ⊂ · · ·","inline":true,"padRight":true},{"text":"is a filtration, and assume that ","element":"span"},{"style":{"height":14.61},"width":79.48,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-1.png","element":"img","alt":" xi, ℓi","inline":true,"padRight":true},{"text":"are ","element":"span"},{"style":{"height":13.6},"width":78,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-2.png","element":"img","alt":" Fi−1","inline":true},{"text":"-measurable, where ","element":"span"},{"style":{"height":17.79},"width":858.88,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-3.png","element":"img","alt":" xi ∈ ∆A, ℓi ∈ [0, 1]A. Besides, ai ∈ [A] and σi are Fi","inline":true},{"text":"-measurable with ","element":"span"},{"style":{"height":25.18},"width":1278,"height":62.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-4.png","element":"img","alt":"E[ai = a|Fi−1] = xi,a and E[σi|Fi−1] = ℓi. Define �ℓi,a = σi,a1[ai=a]xi,a+βi where βi","inline":true,"padRight":true},{"text":"is non-increasing.","element":"span"}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"Lemma 6 ","element":"span"},{"text":"(Lemma 20 of [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":10.21},"width":203.52,"height":25.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-5.png","element":"img","alt":" c1, c2, . . . , ct","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be fixed positive numbers. Then with probability at least ","element":"span"},{"style":{"height":13.6},"width":90,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-6.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"62%"},"width":992,"height":154,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-7.png","element":"img"}],[{"id":"id-102","style":{"fontWeight":"bold"},"text":"Lemma 7 ","element":"span"},{"text":"(Adapted from Lemma 18 of [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":10.21},"width":203.48,"height":25.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-8.png","element":"img","alt":" c1, c2, . . . , ct","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be fixed positive numbers. Then for any sequence ","element":"span"},{"style":{"height":15.74},"width":609.96,"height":39.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-9.png","element":"img","alt":" x⋆1, . . . , x⋆t ∈ ∆A such that x⋆i is Fi−1","inline":true},{"style":{"fontStyle":"italic"},"text":"-measurable, with probability at least ","element":"span"},{"style":{"height":14},"width":98.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-10.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"46%"},"width":740,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-11.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Lemma 18 of [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"] states that for any sequence of coefficients ","element":"span"},{"style":{"height":10.4},"width":244.48,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-12.png","element":"img","alt":" w1, w2, · · · , wt","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":17.81},"width":355.48,"height":44.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-13.png","element":"img","alt":"wi ∈ [0, 2βi]A is Fi−1","inline":true},{"text":"-measurable, we have with probability ","element":"span"},{"style":{"height":14},"width":97.84,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-14.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"40%"},"width":638,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-15.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":15.74},"width":259.68,"height":39.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-16.png","element":"img","alt":" x⋆i ∈ ∆A and βi","inline":true,"padRight":true},{"text":"is decreasing, we know ","element":"span"},{"style":{"height":16.14},"width":290.44,"height":40.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-17.png","element":"img","alt":" 2βt · x⋆i ∈ [0, 2βi]","inline":true},{"text":". Thus we can apply Lemma 18 of ","element":"span"},{"text":"[","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"] and get with probability ","element":"span"},{"style":{"height":13.81},"width":93,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-18.png","element":"img","alt":" 1 − δ,","inline":true}],[{"style":{"width":"71%"},"width":1138,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-19.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 8 ","element":"span"},{"text":"(Lemma 21 of [","element":"span"},{"href":"#id-12","referenceIndex":4,"text":"BJY20","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":10.4},"width":203.52,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-20.png","element":"img","alt":" c1, c2, . . . , ct","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be fixed positive numbers. Then with probability at least ","element":"span"},{"style":{"height":14.8},"width":376.24,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-21.png","element":"img","alt":" 1 − δ, for all x⋆ ∈ ∆A,","inline":true}],[{"style":{"width":"47%"},"width":750,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-22.png","element":"img"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Lemma 9. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-23.png","element":"img","alt":" (x1, y1)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":119.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-24.png","element":"img","alt":" (x2, y2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be equilibria of ","element":"span"},{"style":{"height":16},"width":102.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-25.png","element":"img","alt":" f1(·, ·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in the domain ","element":"span"},{"style":{"height":13.39},"width":41,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-26.png","element":"img","alt":" Z1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":108.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-27.png","element":"img","alt":" f2(·, ·)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in the domain ","element":"span"},{"style":{"height":13.41},"width":42,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-28.png","element":"img","alt":" Z2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"respectively. Suppose that ","element":"span"},{"style":{"height":13.2},"width":154.76,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-29.png","element":"img","alt":" Z1 ⊆ Z2","inline":true},{"style":{"fontStyle":"italic"},"text":", and that ","element":"span"},{"style":{"height":18.8},"width":617,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-30.png","element":"img","alt":" sup(x,y)∈Z1 |f1(x, y) − f2(x, y)| ≤ ϵ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Then for any ","element":"span"},{"style":{"height":16},"width":190,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-31.png","element":"img","alt":" (x, y) ∈ Z2,","inline":true}],[{"style":{"width":"78%"},"width":1246,"height":146,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-32.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"height":16},"width":119.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-33.png","element":"img","alt":" (x1, y1)","inline":true,"padRight":true},{"text":"is an equilibrium of ","element":"span"},{"style":{"height":14.4},"width":31.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-34.png","element":"img","alt":" f1","inline":true},{"text":", we have for any ","element":"span"},{"style":{"height":16.19},"width":214.52,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-35.png","element":"img","alt":" (x′, y′) ∈ Z1,","inline":true}],[{"style":{"width":"28%"},"width":454,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-36.png","element":"img"}],[{"text":"which implies","element":"span"}],[{"style":{"width":"29%"},"width":468,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-37.png","element":"img"}],[{"text":"For any ","element":"span"},{"style":{"height":16},"width":180.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-38.png","element":"img","alt":" (x, y) ∈ Z2","inline":true},{"text":", we can find ","element":"span"},{"style":{"height":16},"width":754.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-39.png","element":"img","alt":" (x′, y′) ∈ Z1 such that ∥(x, y) − (x′, y′)∥1 ≤ d","inline":true},{"text":". Therefore, for any ","element":"span"},{"style":{"height":16},"width":192.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-40.png","element":"img","alt":"(x, y) ∈ Z2,","inline":true}],[{"style":{"width":"84%"},"width":1346,"height":196,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/16-41.png","element":"img"}],[{"id":"id-87","style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Markov Games","element":"span"}],[{"id":"id-127","style":{"fontWeight":"bold"},"text":"Lemma 10 ","element":"span"},{"text":"([","element":"span"},{"href":"#id-0","referenceIndex":59,"text":"WLZL21a","element":"a"},{"text":"])","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any policy pair ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"style":{"fontStyle":"italic"},"text":", the duality gap on a two player zero-sum game can be related to duality gap on individual states:","element":"span"}],[{"style":{"width":"61%"},"width":970,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-0.png","element":"img"}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"A.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Online Mirror Descent","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 11. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let","element":"span"}],[{"style":{"width":"54%"},"width":862,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for some convex set ","element":"span"},{"style":{"height":21.78},"width":727.32,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-2.png","element":"img","alt":" Ω ⊆ ∆A, ℓ ∈ [0, ∞)A, and ϵ ∈ [0, 1η]A. Then","inline":true}],[{"style":{"width":"83%"},"width":1318,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"height":13.6},"width":311.8,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-4.png","element":"img","alt":" u ∈ Ω, where ϵ ln x","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the vector ","element":"span"},{"style":{"height":16},"width":231.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-5.png","element":"img","alt":" (ϵa ln xa)a∈A.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By the standard analysis of online mirror descent, we have for any ","element":"span"},{"style":{"height":12.19},"width":97.48,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-6.png","element":"img","alt":" u ∈ Ω","inline":true}],[{"style":{"width":"88%"},"width":1396,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-7.png","element":"img"}],[{"text":"Below, we abuse the notation by defining KL","element":"span"},{"style":{"height":21.22},"width":550.12,"height":53.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-8.png","element":"img","alt":"(˜x, x) = �a(˜xa ln ˜xaxa − ˜xa + xa)","inline":true,"padRight":true},{"text":"without restricting ","element":"span"},{"style":{"height":10.99},"width":20.52,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-9.png","element":"img","alt":" ˜x","inline":true,"padRight":true},{"text":"to be a probability vector. Then following the analysis in the proof of Lemma 1 of [","element":"span"},{"href":"#id-103","referenceIndex":10,"text":"CLW21","element":"a"},{"text":"], we have","element":"span"}],[{"style":{"width":"94%"},"width":1494,"height":1112,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-10.png","element":"img"}],[{"id":"id-99","style":{"fontWeight":"bold"},"text":"Lemma 12. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For ","element":"span"},{"style":{"height":17.2},"width":909.52,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/17-11.png","element":"img","alt":" x ∈ (0, 1) and y > 0, we have x1−y − x ≤ −yx1−y ln x.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"79%"},"width":1258,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-0.png","element":"img"}],[{"text":"where the second equality is by the mean value theorem.","element":"span"}]]},{"heading":"B Last-Iterate Convergence Rate of Algorithm 1","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Proof of ","element":"span"},{"href":"#id-104","style":{"fontWeight":"bold"},"text":"Theorem 1","element":"a"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"The proof is divided into three parts. In Part I, we establish a descent inequality for KL","element":"span"},{"style":{"height":16.02},"width":112.52,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-1.png","element":"img","alt":"(z⋆t , zt)","inline":true},{"text":". In Part II, we give an upper bound KL","element":"span"},{"style":{"height":16.02},"width":112.52,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-2.png","element":"img","alt":"(z⋆t , zt)","inline":true,"padRight":true},{"text":"by recursively applying the descent ","element":"span"},{"text":"inequality. Finally in Part III, we show last-iterate convergence rate on the duality gap of ","element":"span"},{"style":{"height":16},"width":214.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-3.png","element":"img","alt":" zt = (xt, yt).","inline":true,"padRight":true},{"text":"In the proof, we assume without loss of generality that ","element":"span"},{"style":{"height":27.2},"width":706,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-4.png","element":"img","alt":" t ≥ t0 = ( 241−kη−kϵ ln( 121−kη−kϵ ))","inline":true},{"style":{"height":17.41},"width":193,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-5.png","element":"img","alt":"(96 ln(48))4 ","inline":true,"padRight":true},{"text":"since the theorem holds trivially for constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Part I.","element":"span"}],[{"style":{"width":"100%"},"width":1744,"height":638,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-6.png","element":"img"}],[{"id":"id-115","style":{"width":"100%"},"width":1728,"height":712,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-7.png","element":"img"}],[{"text":"Rearranging the above inequality, we get","element":"span"}],[{"style":{"width":"100%"},"width":1672,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/18-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.55},"width":632.48,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-0.png","element":"img","alt":" vt ≜ KL(x⋆t+1, xt+1) − KL(x⋆t , xt+1)","inline":true},{"text":". Similarly, since the algorithm for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"-player is ","element":"span"},{"text":"symmetric, we have the following:","element":"span"}],[{"text":"KL","element":"span"},{"style":{"height":17.54},"width":197.04,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-1.png","element":"img","alt":"(y⋆t+1, yt+1)","inline":true},{"style":{"height":18.8},"width":1635.48,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-2.png","element":"img","alt":"≤ (1 − ηtϵt)KL(y⋆t , yt) + ηt(ft(xt, yt) − ft(xt, y⋆t )) + 10η2t A ln2 (At) + 2η2t Aλt + ηtξt + ηtζt + vt","inline":true}],[{"text":"where","element":"span"}],[{"id":"id-105","style":{"width":"58%"},"width":926,"height":434,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-3.png","element":"img"}],[{"text":"Adding the two inequalities above up and using the fact that ","element":"span"},{"style":{"height":16.19},"width":560.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-4.png","element":"img","alt":" ft(x⋆t , yt) − ft(xt, y⋆t ) ≤ 0, we get","inline":true}],[{"style":{"width":"96%"},"width":1536,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.8},"width":537.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-6.png","element":"img","alt":" □ ≜ □ + □ for □ = λt, ξt, ζt, vt.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Part II. ","element":"span"},{"text":"Expanding the recursion in ","element":"span"},{"href":"#id-105","text":"Eq. (6)","element":"a"},{"text":", and using the fact that ","element":"span"},{"style":{"height":14.4},"width":335.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-7.png","element":"img","alt":" 1 − η1ϵ1 = 0, we get","inline":true}],[{"style":{"width":"98%"},"width":1562,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":22.19},"width":391,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-9.png","element":"img","alt":" wit ≜ �tj=i+1(1 − ηjϵj)","inline":true},{"text":". We can bound each term as follows.","element":"span"}],[{"text":"By ","element":"span"},{"href":"#id-106","text":"Lemma 1 ","element":"a"},{"text":"and the fact that that ","element":"span"},{"style":{"height":13.6},"width":248.96,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-10.png","element":"img","alt":" t ≥ t0, we have","inline":true}],[{"style":{"width":"98%"},"width":1558,"height":298,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-11.png","element":"img"}],[{"text":"where in ","element":"span"},{"href":"#id-101","text":"(","element":"a"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"we use ","element":"span"},{"href":"#id-107","text":"Lemma 2 ","element":"a"},{"text":"with the fact that ","element":"span"},{"style":{"height":12.8},"width":107,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-12.png","element":"img","alt":" t ≥ t0.","inline":true}],[{"text":"Using ","element":"span"},{"href":"#id-101","text":"Lemma 6 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":17.41},"width":155,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-13.png","element":"img","alt":" ci = witηi","inline":true},{"text":", we have with probability at least ","element":"span"},{"style":{"height":20},"width":110.48,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-14.png","element":"img","alt":" 1 − δt2 ,","inline":true}],[{"style":{"width":"100%"},"width":1686,"height":518,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-15.png","element":"img"}],[{"text":"Using ","element":"span"},{"href":"#id-102","text":"Lemma 7 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16.93},"width":157.32,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-16.png","element":"img","alt":" ci = witηi","inline":true},{"text":", we get with probability at least ","element":"span"},{"style":{"height":20.18},"width":115.8,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-17.png","element":"img","alt":" 1 − δt2 ,","inline":true}],[{"style":{"width":"85%"},"width":1354,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/19-18.png","element":"img"}],[{"text":"where ","element":"span"},{"href":"#id-107","style":{"height":16},"width":481,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-0.png","element":"img","alt":" (a) is by Lemma 2 and t ≥ t0.","inline":true,"padRight":true},{"text":"By ","element":"span"},{"href":"#id-67","text":"Lemma 13 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-106","text":"Lemma 1","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"84%"},"width":1332,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-1.png","element":"img"}],[{"text":"Combining all terms above, we get that with probability at least ","element":"span"},{"style":{"height":20},"width":114.52,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-2.png","element":"img","alt":" 1 − 3δt2 ,","inline":true}],[{"id":"id-108","style":{"width":"71%"},"width":1136,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-3.png","element":"img"}],[{"text":"Using an union bound over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we see that ","element":"span"},{"href":"#id-108","text":"Eq. (7) ","element":"a"},{"text":"holds for all ","element":"span"},{"style":{"height":12.8},"width":109,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-4.png","element":"img","alt":" t ≥ t0","inline":true,"padRight":true},{"text":"with probability at least ","element":"span"},{"style":{"height":16},"width":156,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-5.png","element":"img","alt":"1 − O(δ).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Part III. ","element":"span"},{"text":"Using ","element":"span"},{"href":"#id-64","text":"Lemma 9 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":16},"width":126.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-6.png","element":"img","alt":" ft(x, y)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.8},"width":100,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-7.png","element":"img","alt":" x⊤Gy","inline":true,"padRight":true},{"text":"with domains ","element":"span"},{"style":{"height":13.2},"width":132.24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-8.png","element":"img","alt":" Ωt × Ωt","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":162,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-9.png","element":"img","alt":" ∆A × ∆A","inline":true},{"text":", we get that for any ","element":"span"},{"style":{"height":16},"width":312.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-10.png","element":"img","alt":" (x, y) ∈ ∆A × ∆A,","inline":true}],[{"style":{"width":"79%"},"width":1264,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-11.png","element":"img"}],[{"text":"Further using ","element":"span"},{"href":"#id-108","text":"Eq. (7)","element":"a"},{"text":", we get that with probability at least ","element":"span"},{"style":{"height":16},"width":692.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-12.png","element":"img","alt":" 1−3δ, for any t and any (x, y) ∈ ∆A×∆A,","inline":true}],[{"style":{"width":"85%"},"width":1362,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-13.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is by Pinsker’s inequality. This completes the proof of ","element":"span"},{"href":"#id-104","text":"Theorem 1","element":"a"},{"text":".","element":"span"}],[{"id":"id-67","style":{"fontWeight":"bold"},"text":"Lemma 13. ","element":"span"},{"style":{"height":20.21},"width":378.16,"height":50.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-14.png","element":"img","alt":" |vt| = O�ln2(At)t−1�.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"70%"},"width":1111,"height":264,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-15.png","element":"img"}],[{"id":"id-118","style":{"fontWeight":"bold"},"text":"Lemma 14. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.4},"width":333.12,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-16.png","element":"img","alt":" x, x1, x2 ∈ Ωt. Then","inline":true}],[{"style":{"width":"52%"},"width":832,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-17.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"style":{"width":"61%"},"width":976,"height":408,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-18.png","element":"img"}],[{"text":"Similarly, KL","element":"span"},{"style":{"height":16},"width":755.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-19.png","element":"img","alt":"(x2, x) − KL(x1, x) ≤ O (ln(At)∥x1 − x2∥1).","inline":true}],[{"id":"id-119","style":{"fontWeight":"bold"},"text":"Lemma 15. ","element":"span"},{"style":{"height":28.99},"width":456.48,"height":72.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/20-20.png","element":"img","alt":" ∥z⋆t − z⋆t+1∥1 = O�ln(At)t �.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Notice that the feasible sets for the two time steps are different. Let ","element":"span"},{"style":{"height":17.55},"width":200.28,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-0.png","element":"img","alt":" (x′t+1, y′t+1)","inline":true,"padRight":true},{"text":"be such that ","element":"span"},{"style":{"height":18.94},"width":1591.2,"height":47.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-1.png","element":"img","alt":"x′t+1 = pt+1A 1 + (1 − pt+1)x⋆t+1 and y′t+1 = pt+1A 1 + (1 − pt+1) y⋆t+1 where pt+1 = min{1, 2t−3}.","inline":true,"padRight":true},{"text":"Since ","element":"span"},{"style":{"height":17.55},"width":446.52,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-2.png","element":"img","alt":" (x∗t+1, y∗t+1) ∈ Ωt+1×Ωt+1","inline":true},{"text":", we have that for any ","element":"span"},{"style":{"height":22.18},"width":712.6,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-3.png","element":"img","alt":" a, x′t+1,a ≥ pt+1A +(1−pt+1) 1A(t+1)2 ≥ 1At2 .","inline":true,"padRight":true},{"text":"Hence, ","element":"span"},{"style":{"height":17.55},"width":392.8,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-4.png","element":"img","alt":" (x′t+1, y′t+1) ∈ Ωt × Ωt.","inline":true}],[{"text":"Because ","element":"span"},{"style":{"height":17.41},"width":192.48,"height":43.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-5.png","element":"img","alt":" (x⋆t+1, y⋆t+1)","inline":true,"padRight":true},{"text":"is the equilibrium of ","element":"span"},{"style":{"height":14.8},"width":72.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-6.png","element":"img","alt":" ft+1","inline":true},{"text":"in ","element":"span"},{"style":{"height":14.8},"width":218.12,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-7.png","element":"img","alt":" Ωt+1 × Ωt+1","inline":true},{"text":", we have that for any ","element":"span"},{"style":{"height":16},"width":129.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-8.png","element":"img","alt":" (x, y) ∈","inline":true},{"style":{"height":14.8},"width":225.04,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-9.png","element":"img","alt":"Ωt+1 × Ωt+1,","inline":true}],[{"style":{"width":"87%"},"width":1392,"height":360,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-10.png","element":"img"}],[{"text":"where the first inequality is due to the following calculation:","element":"span"}],[{"style":{"height":17.39},"width":1440,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-11.png","element":"img","alt":"t+1KL(x, x⋆t+1) = ft+1(x, y⋆t+1) − ft+1(x⋆t+1, y⋆t+1) − ∇xft+1(x⋆t+1, y⋆t+1)⊤(x − x⋆t+1)","inline":true},{"style":{"height":17.54},"width":584.24,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-12.png","element":"img","alt":"≤ ft+1(x, y⋆t+1) − ft+1(x⋆t+1, y⋆t+1)","inline":true}],[{"text":"where we use ","element":"span"},{"style":{"height":17.57},"width":602.52,"height":43.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-13.png","element":"img","alt":" ∇xft+1(x⋆t+1, y⋆t+1)⊤(x − x⋆t+1) ≥ 0","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":16.59},"width":72,"height":41.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-14.png","element":"img","alt":" x⋆t+1","inline":true,"padRight":true},{"text":"is the minimizer of ","element":"span"},{"style":{"height":17.57},"width":208.12,"height":43.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-15.png","element":"img","alt":" ft+1(·, y⋆t+1)","inline":true,"padRight":true},{"text":"in ","element":"span"},{"style":{"height":14.8},"width":81.28,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-16.png","element":"img","alt":"Ωt+1","inline":true},{"text":". Specially, we have","element":"span"}],[{"id":"id-109","style":{"width":"78%"},"width":1252,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-17.png","element":"img"}],[{"text":"Similarly, because ","element":"span"},{"style":{"height":16},"width":129.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-18.png","element":"img","alt":" (x⋆t , y⋆t )","inline":true,"padRight":true},{"text":"is the equilibrium of ","element":"span"},{"style":{"height":14.4},"width":369.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-19.png","element":"img","alt":" ft in Ωt × Ωt, we have","inline":true}],[{"style":{"width":"49%"},"width":790,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-20.png","element":"img"}],[{"text":"which implies","element":"span"}],[{"id":"id-110","style":{"width":"100%"},"width":1710,"height":448,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-21.png","element":"img"}],[{"text":"In the first inequality, we use the fact that ","element":"span"},{"style":{"height":16},"width":126.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-22.png","element":"img","alt":" ft(x, y)","inline":true,"padRight":true},{"text":"is convex in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and concave in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and Hölder’s inequality. In the second inequality, we use the triangle inequality, ","element":"span"},{"style":{"height":16.77},"width":554.32,"height":41.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-23.png","element":"img","alt":" ∥∇xft(x, y)∥∞ ≤ maxa{(Gy)a +","inline":true},{"style":{"height":17.54},"width":1357.44,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-24.png","element":"img","alt":"ln(xa)} ≤ O(ln(At)), and ∥∇yft(x, y)∥∞ ≤ maxb{(G⊤x)b+ln(yb)} ≤ O(ln(At))","inline":true},{"text":". In the second and third inequality, we use ","element":"span"},{"style":{"height":19.95},"width":406.88,"height":49.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-25.png","element":"img","alt":" ∥z′t+1 − z⋆t+1∥1 = O( 1t3 )","inline":true,"padRight":true},{"text":"by the definition of ","element":"span"},{"style":{"height":17.6},"width":79,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-26.png","element":"img","alt":" z′t+1.","inline":true}],[{"text":"Combining ","element":"span"},{"href":"#id-109","text":"Eq. (8) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-110","text":"Eq. (9)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"92%"},"width":1468,"height":426,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/21-27.png","element":"img"}],[{"style":{"width":"78%"},"width":1242,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-0.png","element":"img"}],[{"text":"Solving the inequality, we get","element":"span"}],[{"id":"id-120","style":{"width":"96%"},"width":1530,"height":510,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-1.png","element":"img"}]]},{"heading":"C Improved Last-Iterate Convergence under Expectation","paragraphs":[[{"text":"In this section, we analyze ","element":"span"},{"href":"#id-111","text":"Algorithm 4","element":"a"},{"text":", which is almost identical to ","element":"span"},{"href":"#id-62","text":"Algorithm 1 ","element":"a"},{"text":"but does not involve the parameter ","element":"span"},{"style":{"height":14.4},"width":32.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-2.png","element":"img","alt":" βt","inline":true},{"text":". The choices of stepsize ","element":"span"},{"style":{"height":10.8},"width":29.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-3.png","element":"img","alt":" ηt","inline":true,"padRight":true},{"text":"and amount of regularization ","element":"span"},{"style":{"height":9.6},"width":25,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-4.png","element":"img","alt":" ϵt","inline":true,"padRight":true},{"text":"are also tuned differently to obtain the best convergence rate.","element":"span"}],[{"id":"id-111","style":{"width":"100%"},"width":1592,"height":498,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-5.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 4. ","element":"span"},{"href":"#id-111","style":{"fontStyle":"italic"},"text":"Algorithm 4 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"guarantees ","element":"span"},{"style":{"height":28.8},"width":993.6,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-6.png","element":"img","alt":" E�maxx,y∈∆A (x⊤t Gy − x⊤Gyt)�= O�√A ln3/2(At)t− 16�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"With the same analysis as in Part I of the proof of ","element":"span"},{"href":"#id-104","text":"Theorem 1","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"83%"},"width":1316,"height":158,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-7.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"89%"},"width":1416,"height":234,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-8.png","element":"img"}],[{"text":"Unlike in ","element":"span"},{"href":"#id-104","text":"Theorem 1","element":"a"},{"text":", here these three terms all have zero mean. Thus, following the same arguments that obtain ","element":"span"},{"href":"#id-105","text":"Eq. (6) ","element":"a"},{"text":"and taking expectations, we get","element":"span"}],[{"style":{"width":"72%"},"width":1146,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/22-9.png","element":"img"}],[{"style":{"width":"76%"},"width":1218,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.55},"width":746,"height":43.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-1.png","element":"img","alt":" vt = KL(z⋆t+1, zt+1) − KL(z⋆t , zt+1) and Et[·]","inline":true,"padRight":true},{"text":"is the expectation conditioned on history up to ","element":"span"},{"text":"round ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Then following the same arguments as in Part II of the proof of ","element":"span"},{"href":"#id-104","text":"Theorem 1","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"96%"},"width":1526,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-2.png","element":"img"}],[{"text":"Finally, following the arguments in Part III, we get","element":"span"}],[{"style":{"width":"96%"},"width":1536,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-3.png","element":"img"}]]},{"heading":"D Last-Iterate Convergence Rate of Algorithm 2","paragraphs":[[{"id":"id-89","style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"On the Assumption of Irreducible Markov Game","element":"span"}],[{"id":"id-113","style":{"fontWeight":"bold"},"text":"Proposition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"href":"#id-61","style":{"fontStyle":"italic"},"text":"Assumption 1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, then for any ","element":"span"},{"style":{"height":16},"width":309.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-4.png","element":"img","alt":" L′ = 2L log2(S/δ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"consecutive steps, under any (non-stationary) policies of the two players, with probability at least ","element":"span"},{"style":{"height":11.6},"width":84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-5.png","element":"img","alt":" 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", every state is visited at least once.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We first show that for any pair of states ","element":"span"},{"style":{"height":15.39},"width":82.48,"height":38.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-6.png","element":"img","alt":" s′, s′′","inline":true},{"text":", under any non-stationary policy pair, the expected time to reach ","element":"span"},{"style":{"height":12.4},"width":35,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-7.png","element":"img","alt":" s′′","inline":true},{"text":"from ","element":"span"},{"style":{"height":12.4},"width":25.48,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-8.png","element":"img","alt":" s′","inline":true},{"text":"is upper bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":". For a particular pair of states ","element":"span"},{"style":{"height":16},"width":110.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-9.png","element":"img","alt":" (s′, s′′)","inline":true},{"text":", consider the following modified MDP: let the reward be ","element":"span"},{"style":{"height":16},"width":316.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-10.png","element":"img","alt":" r(s, a) = 1[s ̸= s′′]","inline":true},{"text":", and the transition be the same as the original MDP on all ","element":"span"},{"style":{"height":15.2},"width":116,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-11.png","element":"img","alt":" s ̸= s′′","inline":true},{"text":", while ","element":"span"},{"style":{"height":16},"width":265.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-12.png","element":"img","alt":" P(s′′|s′′, a) = 1","inline":true,"padRight":true},{"text":"(i.e., making ","element":"span"},{"style":{"height":12.4},"width":34.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-13.png","element":"img","alt":" s′′","inline":true},{"text":"an absorbing state). Also, let ","element":"span"},{"style":{"height":12.4},"width":25.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-14.png","element":"img","alt":" s′","inline":true},{"text":"be the initial state. By construction, the expected total reward of this MDP is the travelling time from ","element":"span"},{"style":{"height":12.4},"width":115,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-15.png","element":"img","alt":" s′ to s′′","inline":true},{"text":". By Theorem 7.1.9 of [","element":"span"},{"href":"#id-112","referenceIndex":44,"text":"Put14","element":"a"},{"text":"], there exists a stationary optimal policy in this MDP. The optimal expected total value is then upper bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"by ","element":"span"},{"href":"#id-61","text":"Assumption 1","element":"a"},{"text":". Therefore, for any (possibly sub-optimal) non-stationary policies, the travelling time from ","element":"span"},{"style":{"height":12.4},"width":115.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-16.png","element":"img","alt":" s′ to s′′ ","inline":true,"padRight":true},{"text":"must also be upper bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":".","element":"span"}],[{"text":"Divide ","element":"span"},{"style":{"height":16.19},"width":368.52,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-17.png","element":"img","alt":" L′ steps into log2(S/δ)","inline":true,"padRight":true},{"text":"intervals each of length ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":", and consider a particualr ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Conditioned on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"not visited in all intervals ","element":"span"},{"style":{"height":14},"width":224,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-18.png","element":"img","alt":" 1, 2, . . . , i − 1","inline":true},{"text":", the probability of still not visiting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"in interval ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is smaller than","element":"span"},{"style":{"height":19.2},"width":16.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-19.png","element":"img","alt":"12","inline":true},{"text":"(because for any ","element":"span"},{"style":{"height":21.78},"width":686.84,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-20.png","element":"img","alt":" s′, Pr[Ts′→s > 2L] ≤ E[Ts′→s]2L ≤ L2L = 12","inline":true},{"text":", where ","element":"span"},{"style":{"height":13.39},"width":94,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-21.png","element":"img","alt":" Ts′→s","inline":true,"padRight":true},{"text":"denotes ","element":"span"},{"text":"the travelling time from ","element":"span"},{"style":{"height":12.19},"width":95.48,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-22.png","element":"img","alt":" s′ to s","inline":true},{"text":"). Therefore, the probability of not visiting ","element":"span"},{"style":{"height":15.79},"width":422.48,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-23.png","element":"img","alt":" s in all log2(S/δ) intervals","inline":true,"padRight":true},{"text":"is upper bounded by ","element":"span"},{"style":{"height":20},"width":264,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-24.png","element":"img","alt":" 2− log2(S/δ) = δS","inline":true,"padRight":true},{"text":". Using a union bound, we conclude that with probability at ","element":"span"},{"text":"least ","element":"span"},{"style":{"height":11.6},"width":84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-25.png","element":"img","alt":" 1 − δ","inline":true},{"text":", every state is visited at least once within ","element":"span"},{"style":{"height":15.81},"width":134,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-26.png","element":"img","alt":" L′ steps.","inline":true}],[{"id":"id-114","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"href":"#id-61","style":{"fontStyle":"italic"},"text":"Assumption 1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds, then with probability ","element":"span"},{"style":{"height":11.6},"width":84,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-27.png","element":"img","alt":" 1 − δ","inline":true},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"style":{"height":12.99},"width":84,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-28.png","element":"img","alt":" t ≥ 1","inline":true},{"style":{"fontStyle":"italic"},"text":", players visit every state at least once in every ","element":"span"},{"style":{"height":16},"width":193.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-29.png","element":"img","alt":" 6L ln(St/δ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"consecutive iterations before time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"First, we fix time ","element":"span"},{"style":{"height":12.8},"width":88.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-30.png","element":"img","alt":" t ≥ 1","inline":true,"padRight":true},{"text":"and define ","element":"span"},{"style":{"height":17.2},"width":295,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-31.png","element":"img","alt":" t′ = 3L ln(St3/δ)","inline":true},{"text":". Let us consider the following time intervals: ","element":"span"},{"style":{"height":16},"width":452.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-32.png","element":"img","alt":" [1, t′], [t′, 2t′], . . . , [t − t′, t]","inline":true},{"text":". Using ","element":"span"},{"href":"#id-113","text":"Proposition 1","element":"a"},{"text":", we known for each interval, with probability at least ","element":"span"},{"style":{"height":20},"width":100,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-33.png","element":"img","alt":" 1 − δt3","inline":true,"padRight":true},{"text":", players visit every state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Using a union bound over all intervals, we ","element":"span"},{"text":"have with probability at least ","element":"span"},{"style":{"height":20},"width":99.48,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-34.png","element":"img","alt":" 1 − δt2","inline":true,"padRight":true},{"text":", in every interval, players visit every state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Since every ","element":"span"},{"style":{"height":12.4},"width":41.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-35.png","element":"img","alt":" 2t′","inline":true,"padRight":true},{"text":"consecutive iterations must contain an interval of length ","element":"span"},{"style":{"height":12.21},"width":34.52,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-36.png","element":"img","alt":" L′","inline":true},{"text":", we have with probability at least ","element":"span"},{"style":{"height":20},"width":110.52,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-37.png","element":"img","alt":" 1 − δt2 ,","inline":true,"padRight":true},{"text":"players visit every state ","element":"span"},{"style":{"height":15.6},"width":206.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-38.png","element":"img","alt":" s in every 2t′ ","inline":true,"padRight":true},{"text":"consecutive iterations until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Applying union bound over all ","element":"span"},{"style":{"height":12.8},"width":84,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-39.png","element":"img","alt":" t ≥ 1","inline":true,"padRight":true},{"text":"completes the proof.","element":"span"}],[{"text":"According to ","element":"span"},{"href":"#id-114","text":"Corollary 1","element":"a"},{"text":", in the remaining of this section , we assume that for any ","element":"span"},{"style":{"height":12.99},"width":86.48,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-40.png","element":"img","alt":" t ≥ 1","inline":true},{"text":", players visit every state at least once in every ","element":"span"},{"style":{"height":16},"width":193.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/23-41.png","element":"img","alt":" 6L ln(St/δ)","inline":true,"padRight":true},{"text":"iterations until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"id":"id-90","style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part I. Basic Iteration Properties","element":"span"}],[{"id":"id-116","style":{"fontWeight":"bold"},"text":"Lemma 16. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":14.8},"width":188.56,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-0.png","element":"img","alt":" xs ∈ Ωτ+1,","inline":true}],[{"style":{"width":"94%"},"width":1498,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-1.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"(see the proof for the definitions of ","element":"span"},{"style":{"height":19.41},"width":212.48,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-2.png","element":"img","alt":" λsτ, ξsτ, ζsτ(·))","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider a fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"and a fixed ","element":"span"},{"style":{"height":16},"width":311.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-3.png","element":"img","alt":" τ, and let t = tτ(s)","inline":true,"padRight":true},{"text":"be the time when the players visit ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"at the ","element":"span"},{"style":{"height":11.2},"width":153.48,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-4.png","element":"img","alt":"τ-th time.","inline":true}],[{"style":{"width":"100%"},"width":1674,"height":944,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-5.png","element":"img"}],[{"text":"where we omit some calculation steps due to the similarity to ","element":"span"},{"href":"#id-115","text":"Eq. (5)","element":"a"},{"text":".","element":"span"}],[{"id":"id-91","style":{"fontWeight":"bold"},"text":"D.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part II. Policy Convergence to the Nash of Regularized Game","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Lemma 17. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":16},"width":145,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-6.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"style":{"height":13.6},"width":196.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-7.png","element":"img","alt":" s ∈ S, t ≥ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":12.8},"width":91,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-8.png","element":"img","alt":" τ ≥ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":150,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-9.png","element":"img","alt":" tτ(s) ≤ t","inline":true},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"width":"43%"},"width":696,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"height":16.78},"width":714.76,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-11.png","element":"img","alt":"♯ = min{kβ − kϵ, kη − kβ, kα − kη − 2kϵ}.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In this proof, we abbreviate ","element":"span"},{"href":"#id-116","style":{"height":19.39},"width":816.52,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-12.png","element":"img","alt":" ζsi(ˆxsi⋆) as ζsi. By Lemma 16, for all i ≤ τ we have","inline":true}],[{"style":{"width":"73%"},"width":1158,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-13.png","element":"img"}],[{"text":"Similarly, for all ","element":"span"},{"style":{"height":13.41},"width":237.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-14.png","element":"img","alt":" i ≤ τ, we have","inline":true}],[{"style":{"width":"72%"},"width":1144,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/24-15.png","element":"img"}],[{"text":"Adding the two inequalities up, and using ","element":"span"},{"style":{"height":16.16},"width":487.48,"height":40.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-0.png","element":"img","alt":" f si (ˆxsi⋆, ˆysi ) − f si (ˆxsi, ˆysi⋆) ≤ 0","inline":true,"padRight":true},{"text":"because ","element":"span"},{"style":{"height":16.21},"width":143,"height":40.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-1.png","element":"img","alt":" (ˆxsi⋆, ˆysi⋆)","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"text":"equilibrium of ","element":"span"},{"style":{"height":15.81},"width":35.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-2.png","element":"img","alt":" f si ","inline":true,"padRight":true},{"text":", we get for ","element":"span"},{"style":{"height":12.8},"width":87.88,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-3.png","element":"img","alt":" i ≤ τ","inline":true}],[{"id":"id-117","style":{"width":"99%"},"width":1572,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.56},"width":1187.2,"height":51.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-5.png","element":"img","alt":" vsi = KL(ˆzsi+1⋆, ˆzsi+1) − KL(ˆzsi⋆, ˆzsi+1) and □s = □s + □s for □ = ξi, ζi.","inline":true}],[{"text":"Expanding ","element":"span"},{"href":"#id-117","text":"Eq. (11)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"100%"},"width":1770,"height":184,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-6.png","element":"img"}],[{"text":"These five terms correspond to those in ","element":"span"},{"href":"#id-105","text":"Eq. (6)","element":"a"},{"text":", and can be handled in the same way. For ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":7.39},"width":10.48,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-7.png","element":"img","alt":"1","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"style":{"height":7.6},"width":14.52,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-8.png","element":"img","alt":"4","inline":true},{"text":", we follow exactly the same arguments there, and bound their sum as with probability at least ","element":"span"},{"style":{"height":20},"width":214,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-9.png","element":"img","alt":"1 − O� δSτ 2�,","inline":true}],[{"style":{"width":"99%"},"width":1576,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-10.png","element":"img"}],[{"text":"To bound ","element":"span"},{"style":{"fontWeight":"bold"},"text":"term","element":"span"},{"href":"#id-118","style":{"height":14.8},"width":658.72,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-11.png","element":"img","alt":"5, by Lemma 14 and Lemma 18, we have","inline":true}],[{"style":{"width":"71%"},"width":1140,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-12.png","element":"img"}],[{"text":"Therefore, by ","element":"span"},{"href":"#id-106","text":"Lemma 1","element":"a"},{"text":",","element":"span"}],[{"style":{"width":"59%"},"width":936,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-13.png","element":"img"}],[{"text":"Combining all the terms with union bound over ","element":"span"},{"style":{"height":13.41},"width":262,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-14.png","element":"img","alt":" s ∈ S and τ ≥ 1","inline":true,"padRight":true},{"text":"finishes the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 18. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":20.29},"width":1288.96,"height":50.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-15.png","element":"img","alt":" sand τ ≥ 0 such that tτ(s) ≤ t, ∥ˆzsτ⋆−ˆzsτ+1⋆∥1 = O�ln3(SAt/δ)L2 · τ −kα+kϵ�.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The bound holds trivially when ","element":"span"},{"style":{"height":12.99},"width":120,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-16.png","element":"img","alt":" τ ≤ 2L","inline":true},{"text":". Below we focus on the case with ","element":"span"},{"style":{"height":11.81},"width":120,"height":29.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-17.png","element":"img","alt":" τ > 2L","inline":true},{"text":". By exactly the same arguments as in the proof of ","element":"span"},{"href":"#id-119","text":"Lemma 15","element":"a"},{"text":", we have an inequality similar to ","element":"span"},{"href":"#id-120","text":"Eq. (10)","element":"a"},{"text":":","element":"span"}],[{"id":"id-121","style":{"width":"88%"},"width":1410,"height":432,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-18.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":169.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-19.png","element":"img","alt":" tτ(s) ≤ t","inline":true,"padRight":true},{"text":"and we assume that every state is visited at least once in ","element":"span"},{"style":{"height":16},"width":211.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-20.png","element":"img","alt":" 6L log(St/δ)","inline":true,"padRight":true},{"text":"steps (","element":"span"},{"href":"#id-114","text":"Corollary 1","element":"a"},{"text":"), we have that for any state ","element":"span"},{"style":{"height":24.4},"width":446.48,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-21.png","element":"img","alt":" s′, ns′tτ (s) ≥ tτ (s)6L log(St/δ) − 1","inline":true},{"text":". Thus, whenever ","element":"span"},{"style":{"height":19.2},"width":52.52,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-22.png","element":"img","alt":" V s′t","inline":true},{"text":"updates between ","element":"span"},{"style":{"height":16},"width":80,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-23.png","element":"img","alt":" tτ(s)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":120,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-24.png","element":"img","alt":" tτ+1(s)","inline":true},{"text":", the change is upper bounded by","element":"span"},{"style":{"height":24.4},"width":409.48,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-25.png","element":"img","alt":"11−γ ( tτ (s)6L log(St/δ) − 1)−kα","inline":true},{"text":". Besides, ","element":"span"},{"text":"between ","element":"span"},{"style":{"height":19.01},"width":360,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-26.png","element":"img","alt":" tτ(s) and tτ+1(s), V s′t","inline":true},{"text":"can change at most ","element":"span"},{"style":{"height":16},"width":211.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-27.png","element":"img","alt":" 6L log(St/δ)","inline":true,"padRight":true},{"text":"times. Therefore,","element":"span"}],[{"id":"id-122","style":{"width":"100%"},"width":1760,"height":316,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-28.png","element":"img"}],[{"text":"where the last inequality holds since ","element":"span"},{"style":{"height":13.6},"width":114.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-29.png","element":"img","alt":" kα < 1","inline":true},{"text":". Combining ","element":"span"},{"href":"#id-121","text":"Eq. (12) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-122","text":"Eq. (13) ","element":"a"},{"text":"with the fact that ","element":"span"},{"style":{"height":21.39},"width":230.48,"height":53.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/25-30.png","element":"img","alt":"ϵτ = 11−γ τ −kϵ","inline":true,"padRight":true},{"text":"finishes the proof.","element":"span"}],[{"id":"id-92","style":{"fontWeight":"bold"},"text":"D.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part III. Value Convergence","element":"span"}],[{"text":"For positive integers ","element":"span"},{"style":{"height":12.8},"width":86.52,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-0.png","element":"img","alt":" τ ≥ i","inline":true},{"text":", we define ","element":"span"},{"style":{"height":20.4},"width":423,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-1.png","element":"img","alt":" αiτ = αi�τj=i+1(1 − αj).","inline":true}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Lemma 19 ","element":"span"},{"text":"(weighted regret bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability ","element":"span"},{"style":{"height":16},"width":149,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-2.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"fontStyle":"italic"},"text":", any visitation count ","element":"span"},{"style":{"height":15.01},"width":450,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-3.png","element":"img","alt":"τ ≥ τ0, and any xs ∈ Ωτ+1,","inline":true}],[{"style":{"width":"64%"},"width":1030,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-4.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"height":16.78},"width":439.44,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-5.png","element":"img","alt":"′ = min {kη, kβ, kα − kβ}.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We will be considering a weighted sum of the instantaneous regret bound established in ","element":"span"},{"href":"#id-116","text":"Lemma 16","element":"a"},{"text":". However, notice that for ","element":"span"},{"style":{"height":15.79},"width":35.52,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-6.png","element":"img","alt":" f si","inline":true,"padRight":true},{"text":", ","element":"span"},{"href":"#id-116","text":"Lemma 16 ","element":"a"},{"text":"only provides a regret bound with comparators ","element":"span"},{"text":"in ","element":"span"},{"style":{"height":15.2},"width":76.52,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-7.png","element":"img","alt":" Ωi+1","inline":true},{"text":". Therefore, for a fixed ","element":"span"},{"style":{"height":15.2},"width":177.48,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-8.png","element":"img","alt":" xs ∈ Ωτ+1","inline":true},{"text":", we define the following auxiliary comparators for all ","element":"span"},{"style":{"height":13.81},"width":204,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-9.png","element":"img","alt":"i = 1, . . . , τ:","inline":true}],[{"style":{"width":"24%"},"width":392,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-10.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":26.21},"width":348.48,"height":65.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-11.png","element":"img","alt":" pi ≜ (τ+1)2−(i+1)2(i+1)2[(τ+1)2−1]","inline":true},{"text":". Since ","element":"span"},{"style":{"height":14.8},"width":179.52,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-12.png","element":"img","alt":" xs ∈ Ωτ+1","inline":true},{"text":", we have that for any ","element":"span"},{"style":{"height":22.8},"width":431.48,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-13.png","element":"img","alt":" a, �xsi,a ≥ piA + 1−piA(τ+1)2 =","inline":true},{"style":{"height":22},"width":116.48,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-14.png","element":"img","alt":"1A(i+1)2","inline":true,"padRight":true},{"text":", and thus ","element":"span"},{"style":{"height":15.79},"width":178,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-15.png","element":"img","alt":" �xsi ∈ Ωi+1.","inline":true}],[{"text":"Applying ","element":"span"},{"href":"#id-116","text":"Lemma 16 ","element":"a"},{"text":"and considering the weighted sum of the bounds, we get","element":"span"}],[{"style":{"width":"100%"},"width":1622,"height":1340,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-16.png","element":"img"}],[{"text":"where ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"is by ","element":"span"},{"href":"#id-118","text":"Lemma 14 ","element":"a"},{"text":"and the following calculation:","element":"span"}],[{"style":{"width":"100%"},"width":1630,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/26-17.png","element":"img"}],[{"text":"We proceed to bound other terms as follows: with probability at least ","element":"span"},{"style":{"height":20},"width":199.52,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-0.png","element":"img","alt":" 1 − O� δSτ 2�","inline":true}],[{"style":{"width":"90%"},"width":1428,"height":1186,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-1.png","element":"img"}],[{"text":"Combining all terms, we get","element":"span"}],[{"id":"id-123","style":{"width":"96%"},"width":1532,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-2.png","element":"img"}],[{"text":"Finally,","element":"span"}],[{"id":"id-124","style":{"width":"91%"},"width":1450,"height":264,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-3.png","element":"img"}],[{"text":"Adding up ","element":"span"},{"href":"#id-123","text":"Eq. (14) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-124","text":"Eq. (15) ","element":"a"},{"text":"and applying union bound over all ","element":"span"},{"style":{"height":12.21},"width":189,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-4.png","element":"img","alt":" s ∈ S and τ","inline":true,"padRight":true},{"text":"finish the proof.","element":"span"}],[{"id":"id-72","style":{"fontWeight":"bold"},"text":"Lemma 20. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":16},"width":152.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-5.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for any state ","element":"span"},{"style":{"height":13.6},"width":487,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-6.png","element":"img","alt":" s ∈ S and time t ≥ 1, we have","inline":true}],[{"style":{"width":"84%"},"width":1336,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"height":16.8},"width":500.32,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-8.png","element":"img","alt":"∗ = min {kη, kβ, kα − kβ, kϵ}.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"and a visitation count ","element":"span"},{"style":{"height":13.2},"width":130,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-9.png","element":"img","alt":" τ. Let ti","inline":true,"padRight":true},{"text":"be the time index when the players visit ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th time. Then with probability at least ","element":"span"},{"style":{"height":19.79},"width":138,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-10.png","element":"img","alt":" 1 − δSτ 2 ,","inline":true}],[{"style":{"width":"100%"},"width":1710,"height":248,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/27-11.png","element":"img"}],[{"style":{"width":"100%"},"width":1714,"height":1054,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-0.png","element":"img"}],[{"text":"Similar inequality can be also obtained through the perspective of the other player: with probability at least ","element":"span"},{"style":{"height":20},"width":125,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-1.png","element":"img","alt":" 1 − δSτ 2","inline":true}],[{"style":{"width":"70%"},"width":1124,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-2.png","element":"img"}],[{"text":"which, combined with the previous inequality and union bound over ","element":"span"},{"style":{"height":12.19},"width":97.48,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-3.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.2},"width":101.24,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-4.png","element":"img","alt":" τ ≥ 1","inline":true},{"text":", gives the following relation: with probability at least ","element":"span"},{"style":{"height":16},"width":567,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-5.png","element":"img","alt":" 1 − O(δ), for any s ∈ S and τ ≥ 1,","inline":true}],[{"id":"id-125","style":{"width":"86%"},"width":1368,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-6.png","element":"img"}],[{"text":"Before continuing, we first some auxiliary quantities. For a fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", define","element":"span"}],[{"style":{"width":"49%"},"width":778,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-7.png","element":"img"}],[{"text":"for fixed ","element":"span"},{"style":{"height":16},"width":75.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-8.png","element":"img","alt":" (τ, t)","inline":true,"padRight":true},{"text":"we further define","element":"span"}],[{"style":{"width":"33%"},"width":524,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-9.png","element":"img"}],[{"text":"Now we continue to prove a bound for ","element":"span"},{"style":{"height":16},"width":160.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-10.png","element":"img","alt":" |V st − V s⋆ |","inline":true},{"text":". Suppose that ","element":"span"},{"href":"#id-125","text":"Eq. (16) ","element":"a"},{"text":"can be written as","element":"span"}],[{"style":{"width":"84%"},"width":1334,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-11.png","element":"img"}],[{"text":"for a universal constant ","element":"span"},{"style":{"height":13.2},"width":115,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-12.png","element":"img","alt":" C1 ≥ 1","inline":true},{"text":". Below we use induction to show that for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":",","element":"span"}],[{"id":"id-126","style":{"width":"86%"},"width":1370,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-13.png","element":"img"}],[{"text":"This is trivial for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1","element":"span"},{"text":".","element":"span"}],[{"text":"Suppose that ","element":"span"},{"href":"#id-126","text":"Eq. (18) ","element":"a"},{"text":"holds for all time ","element":"span"},{"style":{"height":14},"width":193.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-14.png","element":"img","alt":" 1, . . . , t − 1","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Now we consider time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and a fixed state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". We denote ","element":"span"},{"style":{"height":16},"width":297.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-15.png","element":"img","alt":" L′ = 6L ln(St/δ)","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16.19},"width":156,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-16.png","element":"img","alt":" τ = nst+1","inline":true},{"text":"and let ","element":"span"},{"style":{"height":13.39},"width":479,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/28-17.png","element":"img","alt":" 1 ≤ t1 < t2 < · · · < tτ ≤ t","inline":true,"padRight":true},{"text":"be the time indices when the players visit state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". If ","element":"span"},{"style":{"height":16.19},"width":276.52,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-0.png","element":"img","alt":" t ≤ L′(u(t) + 1)","inline":true},{"text":", then ","element":"span"},{"href":"#id-126","text":"Eq. (18) ","element":"a"},{"text":"is trivial. If ","element":"span"},{"style":{"height":18.61},"width":737.04,"height":46.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-1.png","element":"img","alt":"t ≥ L′(u(t) + 1), we have τ ≥ tL′ − 1 ≥ u(t)","inline":true},{"text":". Therefore,","element":"span"}],[{"style":{"width":"97%"},"width":1540,"height":1356,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-2.png","element":"img"}],[{"text":"In ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"we use the following property: if ","element":"span"},{"style":{"height":16},"width":479.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-3.png","element":"img","alt":" τ ≥ u(t) and i ≤ v(τ, t), then","inline":true}],[{"style":{"width":"76%"},"width":1220,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-4.png","element":"img"}],[{"text":"In ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":") ","element":"span"},{"text":"we use the following calculation:","element":"span"}],[{"style":{"width":"51%"},"width":810,"height":726,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/29-5.png","element":"img"}],[{"style":{"width":"13%"},"width":212,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-0.png","element":"img"}],[{"text":"where the first inequality is due to the fact that at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"has only been visited for ","element":"span"},{"style":{"height":7.2},"width":20,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-1.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"times; the second inequality is because for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k > j","element":"span"},{"text":", we have ","element":"span"},{"style":{"height":15.39},"width":100.48,"height":38.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-2.png","element":"img","alt":" tj ≥ j","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.01},"width":322.52,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-3.png","element":"img","alt":" tk − tj ≤ L′(k − j)","inline":true},{"text":"; the third inequality is by ","element":"span"},{"style":{"height":16},"width":168,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-4.png","element":"img","alt":" i ≥ v(τ, t)","inline":true},{"text":"; the fourth inequality is by the definition of ","element":"span"},{"style":{"height":16},"width":104.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-5.png","element":"img","alt":" v(τ, t)","inline":true},{"text":"; the fifth inequality is because ","element":"span"},{"style":{"height":22.4},"width":354,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-6.png","element":"img","alt":" 4τ kα−1 ln t1−γ ≤ 1−γ4L′","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":16},"width":139.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-7.png","element":"img","alt":" τ ≥ u(t)","inline":true},{"text":", and ","element":"span"},{"style":{"height":22.8},"width":265,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-8.png","element":"img","alt":" 1τ < 1u(t) ≤ 1−γ16L′","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":22.19},"width":190,"height":55.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-9.png","element":"img","alt":" u(t) ≥ 16L′1−γ","inline":true},{"text":"; the last ","element":"span"},{"text":"inequality is because","element":"span"},{"style":{"height":28.59},"width":491,"height":71.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-10.png","element":"img","alt":"1+ 116 a1− 14 a ≤ 1 + 12a for a ∈ [0, 1].","inline":true}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"D.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part IV. Combining","element":"span"}],[{"text":"In this subsection, we combine previous lemmas to show last-iterate convergence rate of ","element":"span"},{"href":"#id-68","text":"Algorithm 2 ","element":"a"},{"text":"and prove ","element":"span"},{"href":"#id-73","text":"Theorem 2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 21. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":16},"width":145,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-11.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for any time ","element":"span"},{"style":{"height":12.8},"width":92.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-12.png","element":"img","alt":" t ≥ 1,","inline":true}],[{"style":{"width":"81%"},"width":1290,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using ","element":"span"},{"href":"#id-127","text":"Lemma 10","element":"a"},{"text":", we can bound the duality gap of the whole game by the duality gap on an individual state:","element":"span"}],[{"style":{"width":"100%"},"width":1586,"height":401,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-14.png","element":"img"}],[{"text":"With probability at least ","element":"span"},{"style":{"height":16},"width":758.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-15.png","element":"img","alt":" 1 − O(δ), for any s, xs, ys, and t ≥ 1, denote τ","inline":true,"padRight":true},{"text":"the number of visitation to state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", then","element":"span"}],[{"style":{"width":"100%"},"width":1824,"height":570,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-16.png","element":"img"}],[{"text":"Combing the above two inequality with ","element":"span"},{"href":"#id-70","text":"Lemma 17 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-72","text":"Lemma 20 ","element":"a"},{"text":"and the choice of parameters ","element":"span"},{"style":{"height":20.8},"width":518,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-17.png","element":"img","alt":"kα = 99+ε, kε = 19+ε, kβ = 39+ε","inline":true},{"text":", and ","element":"span"},{"style":{"height":20.61},"width":153.48,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-18.png","element":"img","alt":" kη = 59+ε","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16.8},"width":640,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-19.png","element":"img","alt":" k♯ = min{kβ − kϵ, kη − kβ, kα − kη −","inline":true}],[{"style":{"width":"100%"},"width":1904,"height":400,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/30-20.png","element":"img"}],[{"style":{"width":"99%"},"width":1580,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-0.png","element":"img"}]]},{"heading":"E Convergent Analysis of Algorithm 3","paragraphs":[[{"id":"id-94","style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part I. Basic Iteration Properties","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Definition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":80,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-1.png","element":"img","alt":" tτ(s)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the ","element":"span"},{"style":{"height":7.2},"width":20,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-2.png","element":"img","alt":" τ","inline":true},{"style":{"fontStyle":"italic"},"text":"-th time the players visit state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"fontStyle":"italic"},"text":". Define ","element":"span"},{"style":{"height":18.8},"width":397,"height":47,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-3.png","element":"img","alt":" ˆxsτ = xstτ (s), ˆysτ = ystτ (s)","inline":true},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":20.8},"width":379.48,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-4.png","element":"img","alt":"ˆasτ = atτ (s), ˆbsτ = btτ (s)","inline":true},{"style":{"fontStyle":"italic"},"text":",. Furthermore, define","element":"span"}],[{"style":{"width":"63%"},"width":1008,"height":168,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-5.png","element":"img"}],[{"id":"id-82","style":{"fontWeight":"bold"},"text":"Lemma 22. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":13.2},"width":128.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-6.png","element":"img","alt":" xs ∈ Ω,","inline":true}],[{"style":{"width":"91%"},"width":1452,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where","element":"span"}],[{"style":{"width":"79%"},"width":1260,"height":384,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof is exactly the same as that of ","element":"span"},{"href":"#id-116","text":"Lemma 16","element":"a"},{"text":".","element":"span"}],[{"id":"id-95","style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part II. Value Convergence","element":"span"}],[{"id":"id-77","style":{"fontWeight":"bold"},"text":"Lemma 23 ","element":"span"},{"text":"(weighted regret bound)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a large enough universal constant ","element":"span"},{"style":{"height":7.39},"width":20,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-9.png","element":"img","alt":" κ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(used in the definition of ","element":"span"},{"style":{"height":14.19},"width":79,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-10.png","element":"img","alt":" bnsτ","inline":true},{"style":{"fontStyle":"italic"},"text":") such that with probability ","element":"span"},{"style":{"height":16},"width":145,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-11.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", for any state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"style":{"fontStyle":"italic"},"text":", visitation count ","element":"span"},{"style":{"height":7.2},"width":20,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-12.png","element":"img","alt":" τ","inline":true},{"style":{"fontStyle":"italic"},"text":", and any ","element":"span"},{"style":{"height":13.2},"width":128.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-13.png","element":"img","alt":"xs ∈ Ω,","inline":true}],[{"style":{"width":"45%"},"width":728,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-14.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"and visitation count ","element":"span"},{"style":{"height":12.99},"width":104,"height":32.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-15.png","element":"img","alt":" τ ≤ T","inline":true},{"text":". Applying ","element":"span"},{"href":"#id-82","text":"Lemma 22 ","element":"a"},{"text":"and considering the weighted sum of the bounds, we get","element":"span"}],[{"style":{"width":"100%"},"width":1592,"height":444,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/31-16.png","element":"img"}],[{"style":{"width":"87%"},"width":1384,"height":576,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-0.png","element":"img"}],[{"text":"We proceed to bound other terms as follows: wiht probability at least ","element":"span"},{"style":{"height":19.79},"width":125,"height":49.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-1.png","element":"img","alt":" 1 − δSτ 2","inline":true}],[{"style":{"width":"90%"},"width":1430,"height":1018,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-2.png","element":"img"}],[{"text":"Combining all terms and applying a union bound over ","element":"span"},{"style":{"height":11.6},"width":189,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-3.png","element":"img","alt":" s ∈ S and τ","inline":true},{"text":", we get with probability ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-4.png","element":"img","alt":" 1 − O(δ)","inline":true,"padRight":true},{"text":"such that for any ","element":"span"},{"style":{"height":12.19},"width":91.52,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-5.png","element":"img","alt":" s ∈ S","inline":true},{"text":", visitation count ","element":"span"},{"style":{"height":13.6},"width":234,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-6.png","element":"img","alt":" τ, and xs ∈ Ω,","inline":true}],[{"style":{"width":"94%"},"width":1498,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-7.png","element":"img"}],[{"text":"This implies the conclusion of the lemma.","element":"span"}],[{"id":"id-132","style":{"fontWeight":"bold"},"text":"Lemma 24. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For all ","element":"span"},{"style":{"height":18.93},"width":238.76,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-8.png","element":"img","alt":" t, s, Vst ≥ V st.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We prove it by induction on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". The inequality clearly holds for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"by the initialization. Suppose that the inequality holds for ","element":"span"},{"style":{"height":14},"width":223.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-9.png","element":"img","alt":" 1, 2, . . . , t − 1","inline":true,"padRight":true},{"text":"and for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Now consider time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Let ","element":"span"},{"style":{"height":14.59},"width":114,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-10.png","element":"img","alt":" τ = nst","inline":true},{"text":", and let ","element":"span"},{"style":{"height":13.2},"width":453,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-11.png","element":"img","alt":" 1 ≤ t1 < t2 < . . . < tτ < t","inline":true,"padRight":true},{"text":"be the time indices when the players visit state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". ","element":"span"},{"text":"By the update rule,","element":"span"}],[{"style":{"width":"56%"},"width":900,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/32-12.png","element":"img"}],[{"text":"where the inequality is by the induction hypothesis. Therefore,","element":"span"}],[{"style":{"width":"49%"},"width":782,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-0.png","element":"img"}],[{"text":"In the last inequality we also use the fact that ","element":"span"},{"style":{"height":15.74},"width":147.56,"height":39.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-1.png","element":"img","alt":" ∼V st ≤ H","inline":true,"padRight":true},{"text":"and","element":"span"},{"style":{"height":19.82},"width":130.36,"height":49.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-2.png","element":"img","alt":"∼V st ≥ 0","inline":true},{"text":". Note that by the induction ","element":"span"},{"text":"hypothesis and the update rule of ","element":"span"},{"style":{"height":18.93},"width":1069.64,"height":47.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-3.png","element":"img","alt":" Vst and V st, we have 0 ≤ V si < Vsi ≤ H for all s and 1 ≤ i ≤ t−1.","inline":true,"padRight":true},{"text":"Thus ","element":"span"},{"style":{"height":19.97},"width":629.36,"height":49.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-4.png","element":"img","alt":" ∼V st = �τi=1 αiτ(γVsti+1ti − bnsi) ≤ H","inline":true,"padRight":true},{"text":"and similarly ","element":"span"},{"style":{"height":20},"width":128,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-5.png","element":"img","alt":"∼V st ≥ 0.","inline":true}],[{"id":"id-85","style":{"fontWeight":"bold"},"text":"Lemma 25. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":262.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-6.png","element":"img","alt":" c = (c1, . . . , cT )","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be any non-negative sequence with ","element":"span"},{"style":{"height":20.4},"width":497,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-7.png","element":"img","alt":" ci ≤ cmax∀i and �Tt=1 ct = C.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Then","element":"span"}],[{"style":{"width":"80%"},"width":1276,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"id":"id-128","style":{"width":"97%"},"width":1548,"height":1884,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/33-9.png","element":"img"}],[{"text":"Note that ","element":"span"},{"style":{"height":16.19},"width":26.52,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-0.png","element":"img","alt":" c′t ","inline":true,"padRight":true},{"text":"is another sequence with","element":"span"}],[{"style":{"width":"49%"},"width":790,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-1.png","element":"img"}],[{"text":"and","element":"span"}],[{"style":{"width":"21%"},"width":334,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-2.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":18.19},"width":216,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-3.png","element":"img","alt":"�τi=1 αiτ = 1","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":12.8},"width":91.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-4.png","element":"img","alt":" τ ≥ 1","inline":true},{"text":". Thus, we can unroll the inequality ","element":"span"},{"href":"#id-128","text":"Eq. (19) ","element":"a"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H ","element":"span"},{"text":"times, which ","element":"span"},{"text":"gives","element":"span"}],[{"style":{"width":"89%"},"width":1426,"height":228,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-5.png","element":"img"}],[{"text":"where in the inequality we use that ","element":"span"},{"style":{"height":19.81},"width":958,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-6.png","element":"img","alt":" (1 + 1H )H ≤ e and γH = (1 − (1 − γ))H ≤ e−(1−γ)H = 1T .","inline":true}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"Corollary 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"There exists a universal constant ","element":"span"},{"style":{"height":13.79},"width":116.52,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-7.png","element":"img","alt":" C1 > 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":26.45},"width":421.04,"height":66.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-8.png","element":"img","alt":" ˜ϵ ≥ C1A ln3(AST/δ)β(1−γ)3 , with","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"probability at least ","element":"span"},{"style":{"height":16},"width":154.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-9.png","element":"img","alt":" 1 − O(δ),","inline":true}],[{"style":{"width":"74%"},"width":1186,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We apply ","element":"span"},{"href":"#id-85","text":"Lemma 25 ","element":"a"},{"text":"with the following definition of ","element":"span"},{"style":{"height":9.98},"width":37.52,"height":24.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-11.png","element":"img","alt":" ct:","inline":true}],[{"style":{"width":"47%"},"width":754,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-12.png","element":"img"}],[{"text":"which gives","element":"span"}],[{"id":"id-129","style":{"width":"89%"},"width":1412,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-13.png","element":"img"}],[{"text":"for some universal constant ","element":"span"},{"style":{"height":13.79},"width":41.52,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-14.png","element":"img","alt":" C2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":20.61},"width":228.52,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-15.png","element":"img","alt":" C = �Tt=1 ct","inline":true},{"text":". By Azuma’s inequality, for some universal ","element":"span"},{"text":"constant ","element":"span"},{"style":{"height":14},"width":116.48,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-16.png","element":"img","alt":" C3 > 0","inline":true},{"text":", with probability ","element":"span"},{"style":{"height":13.6},"width":93,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-17.png","element":"img","alt":" 1 − δ,","inline":true}],[{"id":"id-130","style":{"width":"83%"},"width":1318,"height":452,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-18.png","element":"img"}],[{"text":"Combining ","element":"span"},{"href":"#id-129","text":"Eq. (20) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-130","text":"Eq. (21)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"61%"},"width":980,"height":242,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/34-19.png","element":"img"}],[{"text":"By the definition of ","element":"span"},{"style":{"height":9.81},"width":27,"height":24.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-0.png","element":"img","alt":" ct","inline":true},{"text":", the left-hand side above is lower bounded by ","element":"span"},{"style":{"height":20.61},"width":275,"height":51.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-1.png","element":"img","alt":" ˜ϵ �Tt=1 ct = C˜ϵ","inline":true},{"text":". Define ","element":"span"},{"style":{"height":16},"width":302.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-2.png","element":"img","alt":"C1 = 2(C2 + C3)","inline":true},{"text":". Then by the condition on ","element":"span"},{"style":{"height":12.4},"width":23,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-3.png","element":"img","alt":" ϵ′","inline":true},{"text":", the right-hand side above is above inequality is bounded by","element":"span"}],[{"style":{"width":"31%"},"width":500,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-4.png","element":"img"}],[{"text":"by the condition on ","element":"span"},{"style":{"height":10.99},"width":15.48,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-5.png","element":"img","alt":" ˜ϵ","inline":true},{"text":". Combining the upper bound and the lower bound, we get","element":"span"}],[{"style":{"width":"33%"},"width":538,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-6.png","element":"img"}],[{"id":"id-78","style":{"fontWeight":"bold"},"text":"Lemma 26. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With probability at least ","element":"span"},{"style":{"height":16},"width":394.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-7.png","element":"img","alt":" 1 − O(δ), for any t ≥ 1,","inline":true}],[{"style":{"width":"65%"},"width":1034,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-8.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":", let ","element":"span"},{"style":{"height":16},"width":166.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-9.png","element":"img","alt":" τ = nt(s)","inline":true},{"text":", and let ","element":"span"},{"style":{"height":12.59},"width":23.48,"height":31.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-10.png","element":"img","alt":" ti","inline":true,"padRight":true},{"text":"be the time index in which ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is visited the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th time. With probability at least ","element":"span"},{"style":{"height":20.18},"width":272.92,"height":50.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-11.png","element":"img","alt":" 1 − δST , we have","inline":true}],[{"style":{"width":"100%"},"width":1716,"height":1226,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-12.png","element":"img"}],[{"text":"Therefore, using a union bound over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", we have with probability ","element":"span"},{"style":{"height":14},"width":334.56,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-13.png","element":"img","alt":" 1 − δ, for all s and t,","inline":true}],[{"id":"id-131","style":{"width":"94%"},"width":1500,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-14.png","element":"img"}],[{"text":"for some universal constant ","element":"span"},{"style":{"height":13.81},"width":42,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-15.png","element":"img","alt":" C4","inline":true},{"text":". Next, we use induction to show the first inequality. Suppose that","element":"span"}],[{"style":{"width":"25%"},"width":412,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/35-16.png","element":"img"}],[{"text":"for all ","element":"span"},{"href":"#id-131","style":{"height":15.6},"width":492.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-0.png","element":"img","alt":" s and t′ < t. Then by Eq. (22),","inline":true}],[{"style":{"width":"85%"},"width":1350,"height":600,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-1.png","element":"img"}],[{"text":"which proves the first desired inequality. The other inequality can be proven in the same way.","element":"span"}],[{"id":"id-96","style":{"fontWeight":"bold"},"text":"E.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part III. Policy Convergence to the Nash of the Regularized Game","element":"span"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"Lemma 27. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":13.79},"width":162,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-2.png","element":"img","alt":" 0 ≤ p ≤ 1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be arbitrarily chosen, and define","element":"span"}],[{"style":{"width":"86%"},"width":1372,"height":150,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Furthermore, let ","element":"span"},{"style":{"height":16},"width":282.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-4.png","element":"img","alt":" ˆzsτ⋆ = (ˆxsτ⋆, ˆysτ⋆)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the equilibrium of ","element":"span"},{"style":{"height":16},"width":133.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-5.png","element":"img","alt":" f sτ (x, y)","inline":true},{"style":{"fontStyle":"italic"},"text":", and define ","element":"span"},{"style":{"height":14.75},"width":168.36,"height":36.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-6.png","element":"img","alt":" zst⋆ = ˆzsτ⋆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16},"width":163.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-7.png","element":"img","alt":"τ = nt(s)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then with probability at least ","element":"span"},{"style":{"height":16},"width":152.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-8.png","element":"img","alt":" 1 − O(δ)","inline":true},{"style":{"fontStyle":"italic"},"text":", the following holds for any ","element":"span"},{"style":{"height":13.2},"width":186.48,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-9.png","element":"img","alt":" 0 < ϵ′ ≤ 1:","inline":true}],[{"style":{"width":"59%"},"width":948,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-10.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":14.8},"width":123.92,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-11.png","element":"img","alt":" η and β","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"satisfy the following","element":"span"}],[{"id":"id-134","style":{"width":"62%"},"width":990,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-12.png","element":"img"}],[{"id":"id-135","style":{"width":"62%"},"width":986,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-13.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with sufficiently small universal constant ","element":"span"},{"style":{"height":14.59},"width":187.48,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-14.png","element":"img","alt":" C5, C6 > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"In this proof, we write ","element":"span"},{"href":"#id-82","style":{"height":19.09},"width":608.4,"height":47.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-15.png","element":"img","alt":" ζsi(ˆxsi⋆) as ζi. By Lemma 22, we have","inline":true}],[{"style":{"width":"71%"},"width":1134,"height":194,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-16.png","element":"img"}],[{"text":"Similarly,","element":"span"}],[{"style":{"width":"70%"},"width":1122,"height":194,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-17.png","element":"img"}],[{"text":"Adding the two inequalities up, we get","element":"span"}],[{"style":{"width":"77%"},"width":1234,"height":166,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/36-18.png","element":"img"}],[{"id":"id-133","style":{"width":"83%"},"width":1326,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.76},"width":654.84,"height":44.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-1.png","element":"img","alt":" vsi = KL(ˆzsi+1⋆, ˆzsi+1) − KL(ˆzsi⋆, ˆzsi+1)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.6},"width":270.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-2.png","element":"img","alt":" □s = □s + □s","inline":true},{"text":". By ","element":"span"},{"href":"#id-132","text":"Lemma 24","element":"a"},{"text":", we have ","element":"span"},{"style":{"height":22.21},"width":1591.2,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-3.png","element":"img","alt":"f si(x, y) ≤ fsi(x, y) for all x, y, and thus f si(ˆxsi⋆, ˆysi ) − fsi(ˆxsi, ˆysi⋆) ≤ f si (ˆxsi⋆, ˆysi ) − f si (ˆxsi, ˆysi⋆) ≤ 0.","inline":true,"padRight":true},{"text":"Therefore, ","element":"span"},{"href":"#id-133","text":"Eq. (25) ","element":"a"},{"text":"further implies","element":"span"}],[{"style":{"width":"100%"},"width":1708,"height":288,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":22.4},"width":481.48,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-5.png","element":"img","alt":" ∆si = fsi(ˆxsi, ˆysi ) − f si(ˆxsi, ˆysi )","inline":true,"padRight":true},{"text":"and in the last step we use ","element":"span"},{"style":{"height":16},"width":285.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-6.png","element":"img","alt":" a ≤ [a − b]+ + b.","inline":true}],[{"text":"Unrolling the recursion, we get with probability at least ","element":"span"},{"style":{"height":16},"width":395.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-7.png","element":"img","alt":" 1 − O(δ), for all s and τ","inline":true,"padRight":true},{"text":"(we show that the inequality holds for any fix ","element":"span"},{"style":{"height":11.2},"width":113.48,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-8.png","element":"img","alt":" s and τ","inline":true,"padRight":true},{"text":"with probability ","element":"span"},{"style":{"height":20},"width":178,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-9.png","element":"img","alt":" 1 − O( δST )","inline":true,"padRight":true},{"text":"and then apply the union bound over ","element":"span"},{"style":{"height":14.4},"width":141.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-10.png","element":"img","alt":"s and τ),","inline":true}],[{"id":"id-136","style":{"width":"100%"},"width":1764,"height":962,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-11.png","element":"img"}],[{"text":"where in ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":") ","element":"span"},{"text":"we use the following calculation:","element":"span"}],[{"style":{"width":"95%"},"width":1516,"height":704,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/37-12.png","element":"img"}],[{"text":"and in ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":") ","element":"span"},{"text":"we use the conditions ","element":"span"},{"href":"#id-134","text":"Eq. (23) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-135","text":"Eq. (24)","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"69%"},"width":1096,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/38-0.png","element":"img"}],[{"id":"id-137","style":{"width":"100%"},"width":1612,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/38-1.png","element":"img"}],[{"text":"where in the last inequality we use the following calculation:","element":"span"}],[{"id":"id-138","style":{"width":"100%"},"width":1592,"height":2058,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/38-2.png","element":"img"}],[{"text":"From ","element":"span"},{"href":"#id-136","text":"Eq. (26)","element":"a"},{"text":", we have","element":"span"}],[{"style":{"width":"32%"},"width":516,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/38-3.png","element":"img"}],[{"style":{"width":"84%"},"width":1338,"height":510,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-0.png","element":"img"}],[{"text":"where in the second-to-last inequality we use ","element":"span"},{"href":"#id-137","text":"Eq. (27) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-138","text":"Eq. (28)","element":"a"},{"text":". This finishes the proof.","element":"span"}],[{"id":"id-97","style":{"fontWeight":"bold"},"text":"E.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Part IV. Combining","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":28.99},"width":206,"height":72.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-1.png","element":"img","alt":" u ∈�0, 11−γ�","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists a proper choice of parameters ","element":"span"},{"style":{"height":14.8},"width":252.76,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-2.png","element":"img","alt":" ϵ, β, η such that","inline":true}],[{"style":{"width":"78%"},"width":1244,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with probability at least ","element":"span"},{"style":{"height":16},"width":161.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-4.png","element":"img","alt":" 1 − O(δ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We will choose ","element":"span"},{"style":{"height":24.03},"width":419.84,"height":60.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-5.png","element":"img","alt":" ϵ such that u ≥ C7ϵ ln(AT )1−γ","inline":true},{"text":"with a sufficiently large universal constant ","element":"span"},{"style":{"height":15.01},"width":112,"height":37.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-6.png","element":"img","alt":" C7. By","inline":true,"padRight":true},{"href":"#id-78","text":"Lemma 26","element":"a"},{"text":", we have","element":"span"}],[{"text":"max ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x,y ","element":"span"},{"style":{"height":28.8},"width":449.28,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-7.png","element":"img","alt":"�xs⊤tt Qst⋆ yst − xs⊤t Qst⋆ ystt�","inline":true},{"style":{"height":22.37},"width":116.16,"height":55.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-8.png","element":"img","alt":"≤ maxx,y","inline":true},{"style":{"height":38.4},"width":1442.12,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-9.png","element":"img","alt":"�xs⊤tt �Gst + γEs′∼P st�Vs′t��yst − xs⊤t �Gs + γEs′∼P st�V s′t��ystt�+ O�ϵ ln(AT)1 − γ","inline":true,"padRight":true},{"text":"� ","element":"span"},{"style":{"height":22.37},"width":116.16,"height":55.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-10.png","element":"img","alt":"≤ maxx,y","inline":true},{"style":{"height":29.42},"width":1265.68,"height":73.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-11.png","element":"img","alt":"�xs⊤tt �Gst + γEs′∼P st�Vs′t��yst − xs⊤t �Gs + γEs′∼P st�V s′t��ystt�+ u4 .","inline":true}],[{"text":"Therefore, we can upper bound the left-hand side of the desired inequality by","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-12.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"=1 ","element":"span"},{"style":{"height":10.4},"width":23,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-13.png","element":"img","alt":"1","inline":true},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-14.png","element":"img","alt":"�","inline":true},{"style":{"height":7.2},"width":74.12,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-15.png","element":"img","alt":"max","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"x,y ","element":"span"},{"style":{"height":38.8},"width":1300.04,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-16.png","element":"img","alt":"�xs⊤tt �Gst + γEs′∼P st�Vs′t��yst − xs⊤t �Gs + γEs′∼P st�V s′t��ystt�≥ 34u�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"≤ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-17.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"=1 ","element":"span"},{"style":{"height":10.4},"width":23,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-18.png","element":"img","alt":"1","inline":true},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-19.png","element":"img","alt":"�","inline":true},{"style":{"height":7.2},"width":74.12,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-20.png","element":"img","alt":"max","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"style":{"height":29.42},"width":1220.16,"height":73.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-21.png","element":"img","alt":"xs⊤tt �Gst + γEs′∼P st�Vs′t��yst − xs⊤tt �Gst + γEs′∼P st�Vs′t��ystt ≥ u4","inline":true,"padRight":true},{"text":"� ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-22.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"=1 ","element":"span"},{"style":{"height":29.42},"width":1268.52,"height":73.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-23.png","element":"img","alt":"1�xs⊤tt �Gst + γEs′∼P st�Vs′t��ystt − xs⊤tt �Gst + γEs′∼P st�V s′t��ystt ≥ u4","inline":true,"padRight":true},{"text":"� ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-24.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":"=1 ","element":"span"},{"style":{"height":29.42},"width":1341.56,"height":73.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-25.png","element":"img","alt":"1�xs⊤tt �Gst + γEs′∼P st�V s′t��yst − minx xs⊤t �Gst + γEs′∼P st�V s′t��ystt ≥ u4","inline":true,"padRight":true},{"text":"�","element":"span"},{"style":{"height":2},"width":11,"height":5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-26.png","element":"img","alt":".","inline":true,"padRight":true},{"text":"(29)","element":"span"}],[{"text":"For the first term in ","element":"span"},{"href":"#id-81","text":"Eq. (29)","element":"a"},{"text":", we can bound it by","element":"span"}],[{"text":"� ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"style":{"height":11.6},"width":115.24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-27.png","element":"img","alt":"nT +1(s)","inline":true},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-28.png","element":"img","alt":"�","inline":true},{"style":{"height":7.6},"width":51.72,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-29.png","element":"img","alt":"i=1","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"1 ","element":"span"},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-30.png","element":"img","alt":"�","inline":true,"padRight":true},{"text":"max ","element":"span"},{"style":{"height":7.2},"width":16,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-31.png","element":"img","alt":"y","inline":true},{"style":{"height":38.8},"width":754.8,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-32.png","element":"img","alt":"fsi(ˆxsi, ys) − fsi(ˆxsi, ˆysi ) ≥ u4 − O (ϵ ln(AT))�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"≤ ","element":"span"},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-33.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"s n","element":"span"},{"style":{"height":11.6},"width":95.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-34.png","element":"img","alt":"T +1(s)","inline":true},{"style":{"height":22.4},"width":58,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-35.png","element":"img","alt":"�","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"=1 ","element":"span"},{"style":{"height":10.4},"width":23,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-36.png","element":"img","alt":"1","inline":true},{"style":{"height":38.8},"width":21,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-37.png","element":"img","alt":"�","inline":true},{"style":{"height":7.2},"width":74.12,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-38.png","element":"img","alt":"max","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"style":{"height":28.51},"width":464.32,"height":71.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/39-39.png","element":"img","alt":"fsi(ˆxsi, ys) − fsi(ˆxsi, ˆysi ) ≥ u8","inline":true,"padRight":true},{"id":"id-81","text":"�","element":"span"}],[{"style":{"width":"92%"},"width":1470,"height":682,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-0.png","element":"img"}],[{"text":"The third term in ","element":"span"},{"href":"#id-81","text":"Eq. (29) ","element":"a"},{"text":"can be bounded in the same way. The second term in ","element":"span"},{"href":"#id-81","text":"Eq. (29) ","element":"a"},{"text":"can be bounded using ","element":"span"},{"href":"#id-86","text":"Corollary 2 ","element":"a"},{"text":"by","element":"span"}],[{"style":{"width":"25%"},"width":398,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-1.png","element":"img"}],[{"text":"Overall, we have","element":"span"}],[{"id":"id-139","style":{"width":"88%"},"width":1402,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-2.png","element":"img"}],[{"text":"Notice that the parameters ","element":"span"},{"style":{"height":14.8},"width":95,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-3.png","element":"img","alt":" ϵ, β, η","inline":true,"padRight":true},{"text":"needs to satisfy the conditions specified in this lemma and ","element":"span"},{"href":"#id-83","text":"Lemma 27","element":"a"},{"text":", with which we apply ","element":"span"},{"style":{"height":28.8},"width":341.72,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-4.png","element":"img","alt":" ϵ′ = Θ� u2(1−γ)2ln2(SAT/δ)�","inline":true},{"text":". The constraints suggest the following parameter choice (under a fixed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u","element":"span"},{"text":"):","element":"span"}],[{"style":{"width":"57%"},"width":912,"height":334,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-5.png","element":"img"}],[{"text":"Using these parameters in ","element":"span"},{"href":"#id-139","text":"Eq. (30)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"78%"},"width":1244,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-6.png","element":"img"}]]},{"heading":"F Discussions on Convergence Notions for General Markov Games","paragraphs":[[{"text":"In general Markov games, learning the equilibrium policy pair ","element":"span"},{"style":{"fontStyle":"italic"},"text":"on every state ","element":"span"},{"text":"is impossible because some state might have exponentially small visitation probability under all policies. Therefore, a reasonable definition of convergence is the convergence of the following quantity to zero:","element":"span"}],[{"id":"id-142","style":{"width":"64%"},"width":1026,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/40-7.png","element":"img"}],[{"text":"which is similar to the best-iterate convergence defined in ","element":"span"},{"text":"Section 3","element":"span"},{"text":", but over the state sequence visited by the players instead of taking max over ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". It is also a strict generalization of the sample complexity bound for single-player MDPs under the discounted criteria (see e.g., [","element":"span"},{"href":"#id-140","referenceIndex":33,"text":"LH14","element":"a"},{"text":", ","element":"span"},{"href":"#id-141","referenceIndex":57,"text":"WDCW20","element":"a"},{"text":"]).","element":"span"}],[{"text":"The path convergence defined in our work is, on the other hand, that the following quantity converges to zero:","element":"span"}],[{"id":"id-143","style":{"width":"70%"},"width":1124,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-0.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":22.59},"width":851.48,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-1.png","element":"img","alt":" maxy(xs⊤Qs⋆ys) ≤ maxy(xs⊤Qsx,yys) = maxy V sx,y","inline":true},{"text":"for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", the convergence of ","element":"span"},{"href":"#id-142","text":"Eq. (31) ","element":"a"},{"text":"is stronger than ","element":"span"},{"href":"#id-143","text":"Eq. (32)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Implications of Path Convergence ","element":"span"},{"text":"Although ","element":"span"},{"href":"#id-143","text":"Eq. (32) ","element":"a"},{"text":"does not imply the more standard best-iterate guarantee ","element":"span"},{"href":"#id-142","text":"Eq. (31)","element":"a"},{"text":", it still has meaningful implications. By definition, It implies that frequent visits to a state bring players’ policies closer to equilibrium, leading to both players using near-equilibrium policies for all but ","element":"span"},{"style":{"fontStyle":"italic"},"text":"o","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":") ","element":"span"},{"text":"number of steps over time.","element":"span"}],[{"text":"Path convergence also implies that both players have no regret compared to the game value ","element":"span"},{"style":{"height":14.8},"width":161.48,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-2.png","element":"img","alt":" V s⋆ , which","inline":true,"padRight":true},{"text":"has been considered and motivated in previous works such as [","element":"span"},{"href":"#id-33","referenceIndex":7,"text":"BT02","element":"a"},{"text":", ","element":"span"},{"href":"#id-76","referenceIndex":54,"text":"TWYS20","element":"a"},{"text":"]. To see this more clearly, we apply the results to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"episodic ","element":"span"},{"text":"setting, where in every step, with probability ","element":"span"},{"style":{"height":14.19},"width":87,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-3.png","element":"img","alt":" 1 − γ","inline":true},{"text":", the state is redrawn from ","element":"span"},{"style":{"height":10.59},"width":90.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-4.png","element":"img","alt":" s ∼ ρ","inline":true,"padRight":true},{"text":"for some initial distribution ","element":"span"},{"style":{"height":10.59},"width":19.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-5.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"(every time the state is redrawn from ","element":"span"},{"style":{"height":10.99},"width":84.48,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-6.png","element":"img","alt":" ρ, we","inline":true,"padRight":true},{"text":"call it a new episode). We can show that if ","element":"span"},{"href":"#id-143","text":"Eq. (32) ","element":"a"},{"text":"vanishes, then every player’s long-term average payoff is at least the game value. First, notice that if ","element":"span"},{"href":"#id-143","text":"Eq. (32) ","element":"a"},{"text":"converges to zero, then","element":"span"}],[{"id":"id-144","style":{"width":"87%"},"width":1388,"height":260,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-7.png","element":"img"}],[{"text":"Now fix an ","element":"span"},{"style":{"height":13.41},"width":162.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-8.png","element":"img","alt":" i and let ti","inline":true,"padRight":true},{"text":"be time index at the beginning of episode ","element":"span"},{"style":{"height":13.41},"width":211,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-9.png","element":"img","alt":" i. Let Et = 1","inline":true,"padRight":true},{"text":"indicate the event that episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"has not ended at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Then","element":"span"}],[{"style":{"width":"66%"},"width":1048,"height":690,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-10.png","element":"img"}],[{"text":"Combining this with ","element":"span"},{"href":"#id-144","text":"Eq. (33)","element":"a"},{"text":", we get","element":"span"}],[{"style":{"width":"64%"},"width":1020,"height":186,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-11.png","element":"img"}],[{"text":"Hence the one-step average reward is at least ","element":"span"},{"style":{"height":16.8},"width":270.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-12.png","element":"img","alt":" (1 − γ)Es∼ρ[V s⋆ ]","inline":true},{"text":". A symmetric analysis shows that it ","element":"span"},{"text":"is also at most ","element":"span"},{"style":{"height":16.8},"width":270.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/41-13.png","element":"img","alt":" (1 − γ)Es∼ρ[V s⋆ ]","inline":true},{"text":". This shows that both players have no regret compared to the game ","element":"span"},{"text":"value. Notice that this is only a loose implication of the path convergence guarantee because of the loose second inequality in ","element":"span"},{"href":"#id-144","text":"Eq. (33)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark on the notion of “last-iterate convergence” in general Markov games ","element":"span"},{"text":"While ","element":"span"},{"href":"#id-142","text":"Eq. (31) ","element":"a"},{"text":"corresponds to best-iterate convergence for general Markov games, an even stronger notion one can ","element":"span"},{"text":"pursue after is “last-iterate convergence.” As argued above, it is impossible to require that the policies on all states to converge to equilibrium. To address this issue, we propose to study this problem under the episodic setting described above, in which the state is reset after every trajectory whose expected length is","element":"span"},{"style":{"height":21.6},"width":60,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/42-0.png","element":"img","alt":"11−γ","inline":true},{"text":". In this case, last-iterate convergence will be defined as the convergence of ","element":"span"},{"text":"the following quantity to zero when ","element":"span"},{"style":{"height":10.8},"width":122.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/42-1.png","element":"img","alt":" i → ∞:","inline":true}],[{"style":{"width":"31%"},"width":498,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/42-2.png","element":"img"}],[{"text":"where we recall that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is the episode index and ","element":"span"},{"style":{"height":16.19},"width":137.52,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/72879/images/42-3.png","element":"img","alt":" (xti, yti)","inline":true,"padRight":true},{"text":"are the policies used by the two players at the beginning of episode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". While last-iterate convergence seems reasonable and possibly achievable, we are unaware of such results even for the degenerated case of single-player MDPs — the standard regret bound corresponds to best-iterate convergence, while the techniques we are aware of to prove last-iterate convergence in MDPs require additional assumptions on the dynamics.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]