On Reinforcement Learning for Turn-based Zero-sum Markov Games

2020·Arxiv

Abstract

Abstract

We consider the problem of finding Nash equilibrium for two-player turn-based zero-sum games. Inspired by the AlphaGo Zero (AGZ) algorithm (Silver et al., 2017b), we develop a Reinforcement Learning based approach. Specifically, we propose Explore-Improve-Supervise (EIS) method that combines “exploration”, “policy improvement” and “supervised learning” to find the value function and policy associated with Nash equilibrium. We identify sufficient conditions for convergence and correctness for such an approach. For a concrete instance of EIS where random policy is used for “exploration”, Monte-Carlo Tree Search is used for “policy improvement” and Nearest Neighbors is used for “supervised learning”, we establish that this method finds an -approximate value function of Nash equilibrium in steps when the underlying state-space of the game is continuous and d-dimensional. This is nearly optimal as we establish a lower bound of for any policy.

1 Introduction

In 2016, AlphaGo (Silver et al., 2016) became the first program to defeat the world champion in the game of Go. Soon after, another program, AlphaGo Zero (AGZ) (Silver et al., 2017b), achieved even stronger performance despite learning the game from scratch given only the rules. Starting tabula rasa, AGZ mastered the game of Go entirely through self-play using a new reinforcement learning algorithm. The same algorithm was shown to achieve superhuman performance in Chess and Shogi (Silver et al., 2017a).

One key innovation of AGZ is to learn a policy and value function using supervised learning from samples generated via Monte-Carlo Tree Search. Motivated by the remarkable success of this method, in this work we study the problem of finding Nash Equilibrium for two-player turn-based zero-sum games and in particular consider a reinforcement learning based approach.

Our Contributions. The central contribution of this work is the Explore-Improve-Supervise (EIS) method for finding Nash Equilibrium for two-player turn-based zero-sum games with continuous state space, modeled through the framework of Markov game. It is an iterative method where in each iteration three components are intertwined carefully: “explore” that allows for measured exploration of the state space, “improve” which allows for improving the current value and policy for the state being explored, and “supervise” which learns the improved value and policy over the explored states so as to generalize over the entire state space.

Importantly, we identify sufficient conditions, in terms of each of the “explore”, “improve” and “supervise” modules, under which convergence to the value function of the Nash equilibrium is guaranteed. In particular, we establish finite sample complexity bounds for such a generic method to find the -approximate value function of Nash equilibrium. See Theorem 2 and Proposition 3 for the precise statements.

We establish that when random sampling is used for “explore”, Monte-Carlo-Tree-Search (MCTS) is used for “policy improvement” and Nearest Neighbor is used for “supervised learning”, the theoretical conditions identified for convergence of EIS policy are satisfied. Using our finite sample bound for EIS policy, and quantification of conditions as stated above, we conclude that such an instance of EIS method find approximate value function of Nash equilibrium in steps, where d is the dimension of the state space of the game (cf. Theorem 8). We also establish a mini-max lower bound on the number of steps required for learning -approximate value function of Nash equilibrium as for any method (cf. Theorem 4). This establishes near-optimality of an instance of EIS.

Related Work. The Markov Decision Processes (MDP) provide a canonical framework to study the single-agent setting. Its natural extension, the Markov Games, provide a canonical framework to study multi-agent settings (Littman, 1994). In this work, we consider an instance of it—turn-based two players or agents with zero-sum rewards. Analogous to learning the optimal policy in MDP setting, here we consider finding the Nash Equilibrium in the setting of Markov Games. There has been a rich literature on existence, uniqueness as well as algorithms for finding the Nash Equilibrium. In what follows, we describe the most relevant literature in that regard.

To start with, in Shapley (1953), for finite state and action spaces where game would terminate in a finite number of stages with positive probability, the existence of optimal stationary strategies and uniqueness of the optimal value function are established. For generic state space, the existence of Nash Equilibrium has been established for Markov Games with discounted rewards. Particularly, when the state space is a compact metric space, Maitra & Parthasarathy (1970, 1971) and Parthasarathy (1973) show the uniqueness of value function and existence of optimal stationary policy. The same result has been established by Kumar & Shiau (1981) when the state space is complete, separable metric space. For two-player zero-sum discounted Markov games, the Bellman operator corresponding to the Nash equilibrium is a contraction and hence, the value function is unique and there exists a deterministic stationary optimal policy (Szepesvári & Littman, 1996; Hansen et al., 2013). We also note that existence of Nash equilibrium for a general class of games (stochastic shortest path) is established by Patek (1997). It argues that the optimal value function is unique and can be achieved by mixed stationary strategies.

For computing or finding optimal value function and policy associated with the Nash equilibrium, there are two settings considered in the literature: (i) when system model is entirely known, and (ii) when model is not known but one can sample from the underlying model. In the first setting, classical approaches from the setting of MDPs such as value/policy iteration are adapted to find the optimal value function or policy associated with the Nash equilibrium (Patek, 1997; Hansen et al., 2013). In the second setting which is considered here, various approximate dynamic programming algorithms have been proposed (Szepesvári & Littman, 1996; Bowling & Veloso, 2001; Littman, 2001a,b; Hu et al., 1998; Hu & Wellman, 2003; Lagoudakis & Parr, 2002; Perolat et al., 2015; Pérolat et al., 2016). More recently, with the advance of deep reinforcement learning (Mnih et al., 2015; Lillicrap et al., 2016; Schulman et al., 2015, 2017; Yang et al., 2019a), recent work approximates the value function/policy by deep neural networks (Silver et al., 2016, 2017a,b).

In terms of theoretical results, there has been work establishing asymptotic convergence to the optimal value function when the state space is finite. For example, Q-learning for MDP adapted to the setting of two-player zero-sum games asymptotically converges (Szepesvári & Littman, 1996). Non-asymptotic results are available for model-based algorithms developed for Markov games with finite states, including R-max algorithm (Brafman & Tennenholtz, 2002) and an algorithm that extends upper confidence reinforcement learning algorithm (Wei et al., 2017). Recent work by Sidford et al. (2019) provides an algorithm that computes an -optimal strategy with near-optimal sample complexity for Markov games with finite states. For Markov games where the transition function can be embedded in a given feature space, the work by Jia et al. (2019) analyzes the sample complexity of a Q-learning algorithm. However, non-asymptotic or finite sample analysis for continuous state space without a special structure, such as that considered in this work, receives less attention in the literature.

Comparison with Prior Work. In this work, we develop Explore-Improve-Supervise (EIS) policy when the model is unknown, but one is able to sample from the underlying model. We study the convergence and sample complexity of our approach. Our goal is to provide a formal study on the general framework of EIS. The overall framework is inspired by AlphaGo Zero and inherits similar components. However, we take an important step towards bridging the gap between sound intuitions and theoretical guarantees, which is valuable for a better understanding on applying or extending this framework with different instantiations. We note that EIS bears certain similarities with another AlphaGo-inspired study (Shah et al., 2019). Both works follow the main idea of coupling improvements with supervised learning. However, there are major differences. Shah et al. (2019) focus on approximating value function in deterministic MDPs and only studies a particular instance of the modules. In contrast, we focus on a broader class of algorithms, formulating general principles and studying the guarantees. This poses different challenges and requires generic formulations on properties of the modules that are technically precise and practically implementable.

Finally, as mentioned previously, non-asymptotic analysis for continuous state space, considered in this work, is scarce for Markov games. While there are some results for finite states, the bounds are not directly comparable. For example, the complexity in Perolat et al. (2015) depends on some oracle complexities for linear programming and regression.

For the setting with continuous state space, the sample complexity results in Jia et al. (2019) for Q-learning rely on the assumption of linear structure of the transition kernel. The recent work by Yang et al. (2019b) studies the finite-sample performance of minimax deep Q-learning for two-player zero-sum games, where the convergence rate depends on the family of neural networks. We remark that these belong to a different class of algorithms. We also derive a fundamental mini-max lower bound on sample-complexity for any method (cf. Theorem 4). The lower bound is interesting on its own. Moreover, it shows near optimal dependence on dimension for an instance of our EIS framework.

Organization. The remainder of the paper is organized as follows. We formally introduce the framework of Markov Games and Nash equilibrium in Section 2. Section 3 describes a generic Explore-Improve-Supervise (EIS) algorithm. The precise technical properties for the modules of EIS are then stated in Section 4, under which we establish our main results, convergence and sample complexity of EIS, in Section 5. Finally, a concrete instantiation is provided in Section 6, demonstrating the applicability of the generic EIS algorithm. All the proofs are presented in Appendices.

2 Two-Player Markov Games and Nash Equilibrium

We introduce the framework of Markov Games (MGs) (also called Stochastic Games (Shapley, 1953)) with two players and zero-sum rewards. The goal in this setting is to learn the Nash equilibrium.

2.1 Two-player Zero-sum Markov Game

We consider two-player turn-based Markov games like Go and Chess, where players take turns to make decisions. We denote the two players as P1 and P2. Formally, a Markov game can be expressed as a tuple , where and are the set of states controlled by P1 and P2 respectively, and are the set of actions players can take, r represents reward function, P represents transition kernel and is the discount factor. Specifically, for be the set of feasible actions for player i in a given state . We assume that

Let . For each state , let 1, 2} indicate the current player to play. At state s, upon taking action a by the corresponding player I(s), player 1, 2} receives a reward . In zero-sum games, . Without loss of generality, we let P1 be our reference and use the notation for the definitions of value functions.

Let P(s, a) be the distribution of the new state after playing action a, in state s, by player I(s). In this paper, we focus on the setting where the state transitions are deterministic. This means that P(s, a) is supported on a single state, denotes the state after taking action a at state s.

For each be the policy for player is a probability distribution over Denote by the set of all stationary policies of player i, and let be the set of all polices for the game. A two-player zero-sum game can be seen as player P1 aiming to maximize the accumulated discounted reward while P2 attempting to minimize it. The value function and Q function for a zero-sum Markov game can be defined in a manner analogous to the MDP setting:

where is the expected total discounted reward for P1 if the game starts from state s, players P1 and P2 use the policies respectively. The interpretation for Q-value is similar.

To simplify the notation, we assume that , where A is a finite set. We consider S to be a compact subset of (1). The rewards r(s, a) are independent random variables taking value in for some 0. Define . It follows that absolute value of value function and Q function for any policy is bounded by

Regarding Deterministic Transitions. Let us clarify this assumption. In fact, our approach and main results of EIS framework (i.e., Sections 4 and 5) apply to general non-deterministic cases as well. However, the example in Section 6 considers deterministic cases. In particular, the improvement module is instantiated by a variant of Monte Carlo Tree Search, where a clean non-asymptotical analysis has been only established for the deterministic case (Shah et al., 2019). To facilitate a coherent exposition, we focus on deterministic cases here. Indeed, many games, such as Go and Chess, are deterministic. Additionally, note that one could instantiate our EIS framework with other methods for the non-deterministic cases—for instance, by adapting the sparse sampling oracle (Kearns et al., 2002) as the improvement module—to obtain a similar analysis. As a proof of concept, we provide empirical results in Appendix J on a non-deterministic game with the sparse sampling oracle.

2.2 Nash Equilibrium

Definition 1 (Optimal Counter Policy). Given a policy , policy for P1 is said to be an optimal counter-policy against , if and only if for every , we have . Similarly, a policy is said to be an optimal counter-policy against a given policy P1, if and only if for every

In a two-player zero-sum game, it has been shown that the pairs of optimal policies coincides with the Nash equilibrium of this game (Maitra & Parthasarathy, 1970; Parthasarathy, 1973; Kumar & Shiau, 1981). In particular, a pair of policies is called an equilibrium solution of the game, if is an optimal counter policy against and is an optimal counter policy against . The value function of the optimal policy, fixed point of a -contraction operator. In the sequel, we will simply refer to the strategy as the optimal policy. Finally, we use the concise notation and to denote the optimal value function and the optimal Q-value, respectively, i.e.,

3 EIS: Explore-Improve-Supervise

We describe Explore-Improve-Supervise (EIS) algorithm for learning the optimal value function and optimal policy . The algorithm consists of three separate, but intertwined modules: exploration, improvement and supervised learning. Below is a brief summary of these modules. The precise, formal description of properties

desired from these modules is stated in Section 4, which will lead to convergence and correctness of the EIS algorithm as stated in Theorem 2. Section 6 provides a concrete example of modules of EIS satisfying properties stated in Section 4.

Exploration Module. To extract meaningful information for the entire game, sufficient exploration is required so that enough representative states will be visited. This is commonly achieved by an appropriate exploration policy, such as -greedy policy and Boltzmann policy. We require the existence of an exploration module guaranteeing sufficient exploration.

Improvement Module. For the overall learning to make any progress, the improvement module improves the existing estimates of the optimal solution. In particular, given the current estimates and for , for a state s, a query of the improvement module produces better estimates and that are closer to the optimal

Supervised Learning Module. The previous two modules can be collectively viewed as a data generation process: the exploration module samples sufficient representative states, while a query of the improvement module provides improved estimates for the optimal value and policy. With these as training data, supervised learning module would learn and generalize the improvement of the training data to the entire state space. Subsequently, the trained supervised learning module produces better estimates for and

Combining together, the three modules naturally lead to the following iterative algorithm whose pseudocode is provided in Algorithm 1. Initially, the algorithm starts with an arbitrary model for value function and policy. In each iteration 1, it performs two steps:

Step 1. Data Generation. Given current model : for current state s, query the improvement module to obtain better estimates than the current estimates query the exploration module to arrive at the next state ; repeat the above process to obtain training data of

Step 2. Supervised Learning. Given the improved estimates , , use the supervised learning module to build a new model

Intuitively, the iterative algorithm keeps improving our estimation after each iteration, and eventually converges to optimal solutions. The focus of this paper is to formally understand under what conditions on each of the exploration, improvement and supervised learning module does the algorithm work. Of course,

proof is in the puddling—we provide examples of existence of such modules in Section 6.

4 Properties of Modules

In this section we formally state the desired properties of each of the three modules of EIS. With these properties, we establish convergence and correctness of EIS algorithm in Section 5 to follow. We remark that the properties are not made for the ease of technical analysis. Examples satisfying them shall be provided in Section 6.

4.1 Improvement Module

This module improves both value function and policy. The value function is real-valued, whereas policy for each given state can be viewed as a probability distribution over all possible actions. This requires a careful choice of metric for quantifying improvement. Let be the estimates output by the improvement module in the l-th iteration of EIS. Improvement of value function means . Improvement for policy is measured by the KL divergence between and . Here some care is needed as KL divergence would become infinite if supports of the distributions mismatch.

Note that the optimal policy only assigns positive probability to the optimal actions. On the other hand, there is no guarantee that always has a full support on A. To overcome these challenges, we instead measure the KL divergence with an alternative "optimal policy" that guarantees a full support on A. This naturally leads to the optimal Boltzmann policy: given a temperature 0, the optimal Boltzmann policy is given by

If instead in the above equation to construct the Boltzmann policy (Recall that player P1 is set to be our reference in Section 2). By definition, is guaranteed to be finite for any estimate . Furthermore, converges to 0. Therefore, we could use the KL divergence with a small enough to measure the improvement of the estimates.

Finally, it makes sense to take into account the number of samples (i.e., observed state transitions) required by the module to improve the policy and value function. We now formally lay down the following property for the improvement module.

Property 1. (Improvement Property) Suppose the current model (potentially random) has estimation errors for the value and policy estimates, respectively, i.e.,

where the expectations are taken with respect to the randomness of the model

where the expectations are with respect to the randomness in the model f and the improvement module.

Property 1 allows for a randomized improvement module, but requires that on average, the errors for the value and policy estimates should strictly shrink.

4.2 Supervised Learning Module

To direct the model update in an improving manner, the supervised learning step (line 10 of Algorithm 1) should be able to learn from the training data, and , and generalize to unseen states by preserving the same order of error as the training data. Generically speaking, generalization would require two conditions: (1) sufficiently many training data that are “representative” of the underlying state space; (2) the model itself is expressive enough to capture the characteristics of the function that is desired to be learned.

Before specifying the generalization property, let us provide a few remarks on the above conditions. Condition (1) is typically ensured by using an effective exploration module. Recall that the state space is continuous. The exploration module should be capable of navigating the space until sufficiently many different states are visited. Intuitively, these finite states should properly cover the entire space, i.e., they are representative of the entire space so that learning from these states provide enough information for other states. Formally, this means that given the current estimation errors and for the optimal value and policy, there exists a sufficiently large set of training states, such that supervised learning applied to those training data would generalize to the entire state space with the same order of accuracy. The precise definition of representative states may depend on the particular supervised learning algorithm.

Regarding condition (2), generalization performance of traditional models has been well studied in classical statistical learning theory. More recently, deep neural networks exhibit superior empirical generalization ability, although a complete rigorous proof seems beyond the reach of existing techniques. Our goal is to seek general principle underlying the supervised learning step and as such, we do not limit ourselves to specific models—the learning model could be a parametric model that learns via minimizing empirical squared loss and cross-entropy loss, or it could be a non-parametric model such as nearest neighbors regression. With the above conditions in mind, we state the following general property for the supervised learning module:

Property 2. (Generalization Property) Let temperature , estimation errors be given. There exists at least one set of finite states, denoted by , with size that the following generalization bound holds:

Suppose that a training datasetand the following error guarantees:

where the expectation is taken with respect to the randomness of the value and . Then, there exist non-negative universal constants such that after querying the supervised learning module, i.e., Supervised Learning Module

4.3 Exploration Module

With the above development, it is now straightforward to identify the desired property of the exploration module. In particular, as part of the data generation step, it should be capable of exploring the space so that a set of representative states are visited. Consequently, the supervised learning module can then leverage the training data to generalize. Formally, let E be the set of all possible representative sets that satisfy the Generalization Property:

Denote by the set of states explored by querying the exploration module up to time t, with being the initial state and Exploration Module(cf. line 7 of Algorithm 1). We now state the exploration property, which stipulates that starting at an arbitrary state s, the explored states should contain one of the representative sets in E, within a finite number of steps.

Property 3. (Exploration Property) Given the temperature , and estimation errors for the value and policy, define

Then, the exploration module satisfies that

for some independent of s. The above expectation is taken with respect to the randomness in the exploration module and the environment (i.e., state transitions).

In the sequel, when the context is clear or the initial state does not matter, we usually drop the dependence in s to simplify the notation, i.e.,

5 Main Results: Convergence Guarantees and Sample Complex-

5.1 Convergence Guarantees

As the main result of this paper, we establish convergence of the EIS algorithm under the three desired properties given in Section 4, and quantify the corresponding finite sample complexity. We also provide an algorithm-independent minimax lower bound; in Section 6 we introduce an instance of EIS that essentially matches this lower bound.

Theorem 2. Given a small enough 0, let Properties 1, 2 and 3 hold. Let and be initialization errors. Then for a given 0, 1), with appropriate parameters for Algorithm 1, the output -th iteration satisfies

value function and the optimal Boltzmann policy exponentially with respect to the number of iterations. In particular, after

iterations, we can obtain estimates for both that are within estimation errors. We note that with a sufficiently small temperature, is close to the optimal policy . Therefore, the model can be close to for a large L.

5.2 Sample Complexity

We can also characterize the sample complexity of the EIS algorithm. Recall that the sample complexity is defined as the total number of state transitions required for the algorithm to learn -approximate value/policy function. The sample complexity of Algorithm 1 comes from two parts: the improvement module and the exploration module. Recall that the improvement module requires samples for each call (cf. Property 1). The sample complexity of exploration module is proportional to satisfies (cf. Property 3). The following proposition bounds the sample complexity in terms of the above relevant quantities.

Proposition 3. Consider the setting of Theorem 2. Then, with probability at least 1 , the convergence result (i.e., Eqs (5) and (6)) is achieved with sample complexity M such that

In Section 6, we provide a concrete instance of EIS that finds -approximate value function and policy of Nash equilibrium with transitions.

5.3 A Generic Lower Bound

To understand how good the above sample complexity upper bound is, we establish a lower bound for any algorithm under any sampling policy. In particular, we leverage the the minimax lower bound for the problem of non-parametric regression (Tsybakov, 2009; Stone, 1982) to establish the lower bound, as stated in the following theorem.

Theorem 4. Given an algorithm, let be the estimate of after T samples of transitions for the given Markov game. Then, for each , there exists a two-player zero-sum Markov game such that in order to achieve it must be that

where is an algorithm-independent constant.

6 Implementation: A Concrete Instantiation of the Key Modules

In this section, we demonstrate the applicability of the generic EIS algorithm by giving a concrete instantiation. Specifically, we will use a variant of Monte Carlo Tree Search (MCTS) as the improvement module, nearest neighbor regression as the supervised learning module, and random sampling as the exploration module. We prove that all properties in Section 4 are satisfied. This shows that these properties are reasonable and hence gives a strong support for the generic recipe developed in this paper. Due to space limit, we provide high-level discussions here with informal technical results, and defer precise statements to Appendix E.

Improvement Module: MCTS. Recall that the improvement module should be able to provide improved estimates for the value and policy functions, at the queried state s. Since both the value and policy are related to the Q function, one approach for estimate improvement is to first obtain better estimates and then construct the improved estimates of value and policy from . We will take this approach in this example and use MCTS to obtain the estimates of (see Algorithm 2 in Appendix E). We assume the existence of a generative model (i.e., a simulator). The following theorem states the property of this specific improvement module, which directly implies the desired improvement property, i.e., Property 1.

Theorem 5 (Informal Statement, Theorem 11, Appendix E). Suppose that the state transitions are deterministic. Given the current model such that the value model with appropriately chosen parameters for MCTS, for each query state , the output

MCTS

The above is achieved with a sample complexity of

Supervised Learning Module: Nearest Neighbor Regression. We employ a nearest neighbor algorithm to learn the optimal value function and policy. Intuitively, suppose that the optimal value function and the Boltzmann policy is Lipschitz in the state space, then this algorithm will generalize if there are sufficiently many (say K) training data points around each state in the state space S. Quantitatively, consider covering S with balls of diameter h > 0. We call the training data (h, K)-representative if each covering ball has at least K training data. Here, h and K would depend on the temperature and estimation errors of the training data.

Proposition 6 (Informal Statement, Proposition 12, Appendix E). Under appropriate regularity conditions, if the training data is representative with respect to appropriate chosen h > 0 and K > 0, the nearest neighbor supervised learning satisfies Property 2. In particular, given training data with estimation errors and , we have

where the constant c > 0 is independent of n, the size of training data.

As discussed in Appendix E, the representative number of data points for training required in the above for generalization depends on the property of the state-space. For example, if state space is the unit ball in d dimension, for generalization error scaling with we require representative data points scaling as

Exploration Module: Random Sampling Policy. In the above supervised learning module, the sampled states for nearest neighbor regression should be (h, K)-representative. In other words, to satisfy the exploration property, the exploration module must visit a set of (h, K)-representative states within a finite expected number of steps. We show that a uniformly random sampling policy achieves this. Let N(h) be the h/2-covering number of the compact state space.

Proposition 7 (Informal Statement, Proposition 13, Appendix E). Under appropriate regularity conditions, with h, K chosen as per desired the estimation errors, , for the value and policy, the expected number of steps to obtain a set of (h, K)-representative states under the random sampling policy is upper bounded by

Convergence Guarantees and Sample Complexity of the Instance. For this instance of EIS, we have shown that each module satisfies the desired properties. Therefore, the convergence result stated in Theorem 2 holds for this specific instance. Below we make this result explicit, providing concrete bounds on the estimation errors and sample complexity. In the following, the s denote appropriate constants. Please refer to Appendix E for details.

Theorem 8 (Informal Statement, Theorem 14, Appendix E). For a given , there exist appropriately chosen parameters for this instance such that:

3. In particular, if S is a unit volume hypercube in , then the total sample complexity to achieve value function and policy is given by

Theorem 8 states that for a unit hypercube, the sample complexity of the instance of EIS scales as (omitting the logarithmic factor). Note that the minimax lower bound in Theorem 4 scales as . Hence, in terms of the dependence on the dimension, the instance we consider here is nearly optimal. We note that the sample complexity results from two parts: the MCTS contributes a sample complexity scaling as due to simulating the search tree, while nearest neighbor requires samples due to the need of sufficiently many good neighbors. Obtaining tighter bound with potentially more powerful improvement module or supervised learning module such as neural networks is an interesting future avenue.

7 Conclusion

In this paper, we take theoretical steps towards understanding reinforcement learning for zero-sum turn-based Markov games. We develop the Explore-Improve-Supervise (EIS) method with three intuitive modules intertwined carefully. Such an abstraction of three key modules allows us to isolate the fundamental principles from the implementation details. Importantly, we identify conditions for successfully finding the optimal solutions, backed by a concrete instance satisfying those conditions. Overall, the abstraction and the generic properties developed in this paper could serve as some guidelines, with the potential of finding broader applications with different instantiations. Finally, it would be interesting to extend this framework to general Markov games with simultaneous moves. We believe the generic modeling techniques in Section 4 could be applied, but a key challenge is to develop an improvement module with rigorous non-asymptotic guarantees that satisfies the desired property. We believe that addressing this challenge and formally establishing the framework is a fruitful future direction.

References

Bowling, M. and Veloso, M. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pp. 1021–1026. Lawrence Erlbaum Associates Ltd, 2001.

Brafman, R. I. and Tennenholtz, M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.

Hansen, T. D., Miltersen, P. B., and Zwick, U. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM, 60(1):1:1–1:16, February 2013. ISSN 0004-5411.

Hu, J. and Wellman, M. P. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.

Hu, J., Wellman, M. P., et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pp. 242–250. Citeseer, 1998.

Jia, Z., Yang, L. F., and Wang, M. Feature-based q-learning for two-player stochastic games, 2019.

Kaufmann, E. and Koolen, W. M. Monte-carlo tree search by best arm identification. In Advances in Neural Information Processing Systems, pp. 4897–4906, 2017.

Kearns, M., Mansour, Y., and Ng, A. Y. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49(2-3):193–208, 2002.

Kocsis, L., Szepesvári, C., and Willemson, J. Improved Monte-Carlo search. Univ. Tartu, Estonia, Tech. Rep, 2006.

Kumar, P. R. and Shiau, T.-H. Existence of value and randomized strategies in zero-sum discrete-time stochastic dynamic games. SIAM Journal on Control and Optimization, 19(5):617–634, 1981.

Lagoudakis, M. G. and Parr, R. Value function approximation in zero-sum markov games. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 283–292. Morgan Kaufmann Publishers Inc., 2002.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. 2016.

Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp. 157–163. Elsevier, 1994.

Littman, M. L. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pp. 322–328, 2001a.

Littman, M. L. Value-function reinforcement learning in markov games. Cognitive Systems Research, 2(1): 55–66, 2001b.

Maitra, A. and Parthasarathy, T. On stochastic games. Journal of Optimization Theory and Applications, 5 (4):289–300, 1970.

Maitra, A. and Parthasarathy, T. On stochastic games, ii. Journal of Optimization Theory and Applications, 8(2):154–160, 1971.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Parthasarathy, T. Discounted, positive, and noncooperative stochastic games. International Journal of Game Theory, 2(1):25–37, 1973.

Patek, S. D. Stochastic Shortest Path Games: Theory and Algorithms. PhD dissertation, Massachusetts Institute of Technology, 1997.

Perolat, J., Scherrer, B., Piot, B., and Pietquin, O. Approximate dynamic programming for two-player zero-sum markov games. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1321–1329, 2015.

Pérolat, J., Piot, B., Geist, M., Scherrer, B., and Pietquin, O. Softened approximate policy iteration for markov games. In ICML 2016-33rd International Conference on Machine Learning, 2016.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Shah, D. and Xie, Q. Q-learning with nearest neighbors. In Advances in Neural Information Processing Systems 31, pp. 3115–3125. 2018.

Shah, D., Xie, Q., and Xu, Z. On reinforcement learning using Monte Carlo tree search with supervised learning: Non-asymptotic analysis. arXiv preprint arXiv:1902.05213, 2019.

Shapley, L. S. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953.

Sidford, A., Wang, M., Yang, L. F., and Ye, Y. Solving discounted stochastic two-player games with near-optimal time and sample complexity, 2019.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017b.

Stone, C. J. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, pp. 1040–1053, 1982.

Szepesvári, C. and Littman, M. L. Generalized Markov decision processes: Dynamic-programming and reinforcement-learning algorithms. In Proceedings of International Conference of Machine Learning, volume 96, 1996.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009.

Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pp. 4987–4997, 2017.

Yang, Y., Zhang, G., Xu, Z., and Katabi, D. Harnessing structures for value-based planning and reinforcement learning. arXiv preprint arXiv:1909.12255, 2019a.

Yang, Z., Xie, Y., and Wang, Z. A theoretical analysis of deep q-learning, 2019b.

Appendices

A Preliminary Facts

The following inequalities are used for developing our technical results: Jensen’s Inequality: Let X be a random variable and be a convex function, then

Pinsker’s Inequality: Let and be two probability distributions, then the total variation distance TVand the KL divergences satisfy the bound

Note that if are discrete distributions, then TVdenotes the total variation (or

then

B Proof of Theorem 2

Proof. With the three detailed properties, the proof is conceptually straightforward. At each iteration, the improvement module would produce better estimates for the explored states, by factors of and . The exploration continues until one of the desired representative sets in E has been visited, and the exploration property guarantees that the exploration time will be finite. The current iteration then ends by calling the supervised learning module to generalize the improvement to the entire state space. In what follows, we make these statements formal.

Let us first introduce some notion. We will use the term iteration to refer to a complete round of improvement, exploration and supervised learning (cf. Line 2 of Algorithm 1). In general, at each iteration, we use a superscript (l) to denote quantities relevant to the l-th iteration, except that for the supervised learning module, we follow the convention in the paper and use a subscript . We denote by all the information during the l-th iteration. Let be the sigma-algebra generated by the stochastic process , where the randomness comes from the environment and any randomness that may be used in the three modules. Let be the estimation errors of the model at the beginning of the iteration, i.e.,

We use , to denote the set of training data generated by the exploration module and querying the improvement module during the l-th iteration. Let be the set of states visited by the exploration module. Correspondingly, the estimation errors for the value function and the optimal policy after querying the improvement module are denoted by , respectively:

At the l-th iteration, the supervised learning modules takes the outputs of the improvement module, as the training data. Let and denote the estimation errors for the new model , after querying the supervised learning module:

For the supervised learning module, according to the generalization property (cf. Property 2), when the size of training set is sufficiently large and the sampled states are representative of the state space, the same order of accuracy of the training data will be generalized to the entire state space. For now, let us assume that this is the case; we will come back to verify the generalization bound in Property 2 can indeed be satisfied by . Then, the following bounds hold:

Hence

Therefore, when querying the improvement module, if we select the improvement factors to be

then we have

It is worth taking note of the fact that and would be larger than 1 (cf. Property 2): a reasonable supervised learning model may generalize the same order of accuracy as training data, but unlikely for it be smaller; hence, are required to be smaller than 1 in Property 1 so that

By definition, . Therefore, we have the desired inequalities:

Finally, to complete the proof, as we mentioned before, we need to verify that we could sample enough representative states at each iteration in finite time steps. This is indeed guaranteed by the exploration property. In particular, note that at the l-th iteration, we would like to sample enough representative states that are of errors and for the value and policy functions (cf. Eqs. (7) and (8)). By a recursive argument (cf. Eqs. (10) and (11)), it is not hard to see that at the l-th iteration, we need to query the exploration module until the sampled states,