Approximate exploitability: Learning a best response in large games

2020·arXiv

Abstract

1 Introduction

The increasing incorporation of artificial intelligence into the systems of human life has created a growing impetus to understand the safety and robustness implications of using these algorithms in critical systems. Unexpected failures of AI systems will erode trust, particularly in high consequence settings like medicine and autonomous vehicles. Researchers have been turning their attention to this problem [Amodei et al., 2016], highlighting how neural networks trained by deep learning techniques can suffer from surprising failure modes due to distribution shift [Quionero-Candela et al., 2009]: the situation where a network is trained based on one distribution of data, but then used under a different distribution. This phenomenon has presented itself in a variety of areas, including adversarial examples in image classification [Athalye et al., 2018], and reinforcement learning in both single agent [Witty et al., 2018] and multi-agent settings [Gleave et al., 2020].

Artificial intelligence research has long used games as benchmarks for progress, including milestone achievements in Checkers [Schaeffer et al., 1996], Chess [Campbell et al., 2002], Poker [Moravˇc´ık et al., 2017; Brown and Sand- holm, 2017], Go [Silver et al., 2018], and Atari [Mnih et al., 2015], among others. In these games, agent evaluation is commonly based on the outcomes generated in practice through play. In single agent domains, a game score such as in Atari [Bellemare et al., 2013] is often used to assess quality. However, in multi-agent domains, the in-practice performance of an agent depends directly on the assumed population of other agents [Bard, 2016]. Furthermore, agent performance is often not strictly transitive [Tuyls et al., 2018] – with agent A beating B, B beating C, and C beating A – making measures such as Elo problematic [Balduzzi et al., 2018; Omidshafiei et al., 2019].

While such evaluation is important, less attention is typically given to the potential worst-case performance of an agent. Dominant algorithms for two-player zero-sum games rely on self-play to train an agent, resulting in agents that have observed a particular distribution of play. However, other agents may induce distribution shift through entirely valid play, either deliberately due to the adversarial nature of these domains, or simply through sub-optimal play. This makes an agent’s robustness to worst-case performance an important complementary dimension of agent evaluation. Furthermore, in light of the low consequence, repeatable, and controllable nature of games, we believe exploring agent robustness and AI safety concerns in these domains is another opportunity for games research to drive AI forward as safety considerations grow in importance.

Prior research in computer poker has attempted to directly evaluate an agent’s performance against a worst-case (i.e. best responding) adversary, as game-theoretic algorithms approximating a Nash equilibrium are standard [Zinkevich et al., 2008] and seek to optimize this (for two-player zero-sum games). Exactly computing a policy’s exploitability – its loss relative to a Nash in the worst-case – was feasible in heads-up limit Texas hold’em poker (HUL) [Johanson et al., 2011], having roughly decision points. Unfortunately, other common games are vastly larger, such as heads-up no-limit Texas hold’em (HUNL) (with around decision points [Johanson, 2013]) or Go (approximately decision points [Allis, 1994]). In these games, computing exploitability or the best response strategy is intractable. However, the local best response (LBR) [Lis´y and Bowl-

ing, 2016] approximation was used to provide a lower-bound on the exploitability of HUNL strategies. LBR uses explicit knowledge of the distribution over its opponent’s private state with a shallow search and simple poker-specific heuristic value function. Though simple, LBR’s reliance on domain knowledge and notable sensitivity to the various choices of abstractions and heuristics make it challenging to use broadly. We generalize the approach introduced by LBR, introducing a scalable domain-independent alternative that uses deep reinforcement learning (RL) combined with search and knowledge of the opponent’s distribution over private states to learn a value function for the best response in a two-player zero-sum game. By fixing the policy of the agent being evaluated, the environment becomes single-agent from the perspective of the agent, and the optimal policy from RL is exactly a best response [Greenwald et al., 2017]. By exploiting access to the agent’s policy during evaluation, we hope to simplify the learning task, reducing the representational power needed by the network, and improving sample effi-ciency and accuracy. Our contributions are as follows:

• In Section 3.2, we introduce our best response algorithm, ISMCTS-BR.

• In Section 4.1, we demonstrate the robustness of ISMCTS-BR across a large number of games & policies by showing that it is able to closely approximate the exact best response.

• In Section 4.2, we demonstrate that ISMCTS-BR can exploit strong, search-based agents, and provide evidence that deep reinforcement learning techniques can be vulnerable to distribution shift, highlighting the importance of worst-case performance measures in agent evaluation.

2 Background and Terminology

An extensive-form game is a sequential interaction between players , where c is the chance player, i.e. a player following a static, stochastic policy to define the transition probabilities given states and actions. We use to refer to the non-chance opponent of player i. In this paper, we focus on the two-player zero-sum setting, and in particular, those where beliefs about the opponent’s private state are tractable, i.e. sufficiently small to be represented in memory. The game starts in the root (or empty) history . On each turn, a player i (possibly chance) chooses an action , changing the history to . The full history is sometimes also called a ground state because it uniquely identifies the true state, as it consists of all actions including those not observed by all players. For example, a history in poker includes all the dealt cards, even private ones, in addition to the sequence of actions taken. We define an information state (or infostate) for player i as the state as perceived by an agent which is consistent with its observations.1 Formally, each s is a set of histories, specifically the sequence of player i’s observations along h and are equal. Informally, this consists of all the histories which the player cannot distinguish given the information available to them. We call Z the set of terminal histories, which correspond to the end of a game. Each has a utility for each player . Since players cannot observe the ground state h, policies are defined as , where is the set of probability distributions over the legal actions available to player i at .

For example, in Texas hold’em poker, the root history is when no cards have been dealt and no betting has happened. The history includes any private cards that have been dealt, while the information state for player i includes the private cards for that player, and the information that the other player has received cards, but not which cards those are. Both the history and the information state include all public cards and all betting actions. A terminal history is a history where either a showdown has occurred or a player has folded. is then the number of chips that each player has won.

In a perfect information game such as Chess or Go, the history is identical to the information state, since everything is public. As is common in perfect information games, we use an encoding of the current board state (or a stack of recent ones) as input for training, following [Silver et al., 2018]. is -1 for a loss, 1 for a win, and 0 for a draw when possible (e.g. in Chess). We assume finite games, so every history h is bounded in length. The expected value of a joint policy (all players’ policies) for player i is defined as , where the terminal histories are composed of actions from the joint policy.

2.1 Optimal Policies

The standard game-theoretic solution concept for optimal policies in two-player zero-sum games is Nash equlibrium, defined as a strategy profile (i.e., joint policy) where none of the players benefits from deviating unilaterally. A strategy profile is a Nash equilibrium iff

In two-player zero-sum games, this concept is equivalent to the minmax solution concept, and the expected value for playing optimally is unique:

The value (2) is referred to as the game value (for player 1), . In two-player zero-sum games, the game value is thus . Both solutions concepts thus optimize against the worst-case opponent — the best response. We define the set of player i best responses to an opponent policy ,

For convenience, we refer to elements in this set, best responses to , as . When a player uses an optimal policy , their utility is then guaranteed (on expectation) to achieve at least the game values: .

2.2 Exploitability

For suboptimal policies , player i’s incentive to deviate is:

One metric to quantify the distance from optimal is NASHCONVand EXPLOITABILITYNASHCONV. In zero-sum games, when summing over players, the second terms sum to zero, so NASHCONV sim-plifies to . Furthermore, the -minmax (or -Nash equilibrium) policy is one where .

Exploitability measures how well a strategy profile approximates a Nash equilibrium — the closer it is to zero, the closer the policy is to optimal. Unfortunately, exploitability can be difficult to compute exactly: in the worst case, it requires traversing the full game tree, making it intractable in large games. However, by approximating exploitability, we can address tractability while still assessing worst-case performance.

3 Approximate Best Response

By fixing the policy of one agent, the environment becomes a (stochastic) single agent environment where one can use standard reinforcement learning methods to learn a best response, as they are optimal policies in that environment [Greenwald et al., 2017]. The fixed agent is exploitable by the reward obtained by the optimal policy.

While exploitability is defined using an exact bestresponding policy , we can approximate this policy and define a corresponding metric, the approximate best response (ABR), with the corresponding APPROXIMATENASHCONV (ANC) as:

We define approximate exploitability similarly. Since it uses an approximate best response, the approximate exploitability is a lower bound on the true value, with equality when the approximation matches the exact best response.

3.1 Local Best Response

The poker community has used local best response (LBR) [Lis´y and Bowling, 2016] as a proxy for the full best response [Moravˇc´ık et al., 2017; Brown et al., 2018]. LBR approximates the best response by performing a local search aided by explicit knowledge of the distribution over its opponent’s private state and a value function, typically a hand-crafted heuristic function. LBR is fast to run and easy to understand; it can often be implemented in a small amount of code.

However, the results LBR obtains can be sensitive to the specific experimental setup. For example, when running against the same fixed policy using the same heuristic value approximator in action-abstracted no-limit poker, the values can vary wildly depending on the choice of action abstraction (e.g. from -867 mbb/h to 46 mbb/h [Gruslys et al., 2020] and from 496 mbb/h to 3763 mbb/h [Lis´y and Bowling, 2016]). Additionally, while the check-call value function has had much empirical success in poker, it is not obvious what a strong choice for value function is in other games.

3.2 IS-MCTS Best Response Search

LBR performs a depth-limited tree search, using a value function to truncate the search. This is similar to MCTS, which has been shown to perform remarkably well with a learned value function [Silver et al., 2018]. A natural extension to LBR then is to combine it with MCTS. Since we want an algorithm that works on imperfect information games, we use a variant of MCTS, information set MCTS (IS-MCTS) [Cowl- ing et al., 2010]. However, IS-MCTS does not sample from the belief distribution when searching at an information state. While empirically this works to exploit basic agents, this does not converge to the exact best response in the limit [Lis´y et al., 2015].

Instead, we calculate the exact posterior distribution over histories with the same information state given the opponent’s policy, and use that to sample from opponent information states.2 To the best of our knowledge, ours is the first algorithm to apply IS-MCTS with the exact posterior distribution over opponent private states and PUCB.

We call this algorithm ISMCTS-BR and present it in Algorithm 1. GETNODE retrieves the node from the search tree corresponding to the information state s. We abuse notation and let GETNODE operate on histories h as well, in which case GETNODE retrieves the node corresponding to the information state for the history from the current player’s perspective. SAMPLEHISTORYFROMINFOSTATE samples a history h (i.e. private information for the opponent) from the exact belief distribution induced by the opponent’s policy. In imperfect information games with perfect recall, the belief Pr(h|s) is a function of the opponent’s policy and can be obtained by applying Bayes’ rule given the reach probabilities of each history (see [Srinivasan et al., 2018, Section 3.2]). T and r are used to updated the search tree before proceeding to the next simulation. UPDATESEARCHTREE and CHOOSEACTION follow the standard MCTS algorithm [Sil- ver et al., 2018]; an IS-MCTS implementation can be found in [Lanctot et al., 2019].

Running ISMCTS-BR requires having access to the oppo-

nent policy, as we query the opponent policy during Alg. 2 when the opponent would act. However, in practice, this is an easy restriction to relax, as replacing the exact opponent policy with a different policy weakens the quality of the resulting BR, but the result is still an approximate best response. For example, if an opponent policy is slow to query, we can train a function approximator to approximate the policy, and query the approximation during each simulation. If we only have access to a “black box” policy that returns an action, rather than a probability distribution, we can either train a function approximator or query the policy repeatedly; either would approximate the probability distribution. This generality is an advantage of the approximate best response framework: the proof is in the pudding, so to speak, as each ABR is a lower bound on how exploitable the given policy is. As such, one can choose from multiple different ABRs depending on the circumstance.

4 Experimental Evaluation

We use an experimental setup identical to [Silver et al., 2018] except where otherwise noted. We train for only 100k learning steps rather than 800k steps. We use a distributed actor/learner setup to train our neural network. Each experiment requires roughly 128 Cloud TPUv4 chips for the actors, and 4 TPUv2 chips for the learners. The TPU version choice was arbitrary and based on internal availability at the time we ran our experiments. The results we report, unless otherwise indicated, are the average results over the last five network updates in the run; as the network is updated every 500 minibatch steps, and each minibatch uses 2048 examples, this is roughly 5 million states of data. On iteration t, network pa-

Table 1: ISMCTS-BR ANC (%) vs UniformRandom.

rameters are updated using a loss function that combines the mean-squared error between predicted expected reward and the Monte Carlo return for the episode , the crossentropy loss between the prior policy predicted by the network pand the policy induced by the normalized visit counts during search , and regularization:

Since the reward and prior are learned for information states, not histories, the network learns an average across histories, allowing the agent to act according to the expected value of information state s.

We present results in imperfect information games, large perfect information games, Scotland Yard (a classic pursuit-evasion game on a graph), and two large variants of Texas hold’em poker: heads-up limit (HUL) and heads-up no-limit (HUNL). As these are widely-used domains, we provide descriptions in the appendix. Opponents range from basic and easy to exploit (e.g. uniform random) to agents that use search and are hard to exploit (e.g. AlphaZero).

4.1 Validation

We begin by validating ISMCTS-BR with policies & domains that are simple enough to allow us to compute the exact exploitability.

vs. Uniform Random. We evaluate ISMCTS-BR versus uniform random in numerous games. Table 1 shows that ISMCTS-BR consistently performs well, achieving 95% of achievable reward in most cases.4 GoofspielN refers to a variant of Goofspiel with N cards. SY refers to Scotland Yard. “Glasses”, “War Museum”, “Queen Mary’s Garden” and “Kennington Double” refer to custom graphs of Scotland

Table 2: ABR vs poker chumps, percentage of the exact BR value achieved. HUNL(fcpa) limits actions to fold, call, pot bet, and all-in.

Table 3: Comparison of (approximate) NashConv values for HUL chumps. Units are in mbb/h. ANC is ISMCTS-BR’s ApproximateNashConv.

Yard that are small enough to compute exact NashConv (see the appendix for the specific graphs).

vs. Poker Chumps. Simple poker policies called “chumps” have been evaluated in prior work. These policies either always fold, always call, always raise, or mix between call and raise with equal probability. Table 2 presents the results as a percentage of the value achieved by an exact best response5. In most cases, the reward achieved by ISMCTS-BR quickly converges to the same value as the exact best response. Practitioners should note that this evaluation was particularly helpful in developing ISMCTS-BR as it enabled testing on a range of behaviour at different game scales using exact exploitability values that were generally known.

We can compare LBR with ISMCTS-BR and the exact BR against chump policies in poker. Table 3 shows that ISMCTSBR is able to improve upon LBR in HUL, except for the case of always call, where LBR’s value function exactly matches the opponent’s behaviour. LBR is notably weaker against uniform random. Results in HUNL echo this, with ISMCTS-BR performing similarly against always call (49.7 bb/h instead of LBR’s 49.0), but performing better than the best LBR result reported against the call/raise chump in [Lis´y and Bowling, 2016]: 47.9 bb/h instead of LBR’s 24.4. This is a shortcoming of LBR: the specific choice of heuristic introduces bias, and it can fail to optimally exploit weak agents.6

vs. Nash Approximations. One limitation of the previous opponents is that they behave the same regardless of the game’s state. To address this, we examine ISMCTS-BR in several small imperfect information games against Nash ap-

Table 4: Approx. NashConv (% of exact NashConv) for ISMCTSBR vs CFR+ with 10 iters.

proximations generated with 10 iterations of CFR+ [Tam- melin et al., 2015], ensuring the policy remains exploitable. Results in Table 4 show that ISMCTS-BR can exploit these more sophisticated policies. Against CFR+ with more than 10 iterations, ISMCTS-BR also yields exploitability close to the exact value. However, as the exploitability approaches zero with more iterations, small absolute variations cause large percentage variations.

Nash approximations also provide an opportunity to demonstrate the richness of evaluation enabled by learned best responses. We trained ISMCTS-BR in Leduc poker against CFR+ with 7 iterations and always fold, which have exact exploitabilities of 1.01 and 1, respectively. If practitioners only consider the exact exploitability, or the final exploitability produced by ABR, these policies would appear similar. However, Figure 1 shows that CFR+7 is more diffi-cult to exploit — reflecting the strategy’s relative complexity.

Finally, we ran ISMCTS-BR vs Slumbot [Jackson, 2013], a strong agent in heads-up no-limit Texas hold’em. We evaluate a best effort reproduction of Slumbot 20167 using ISMCTSBR and compare performance to prior LBR results for the original Slumbot 2016 [Lis´y and Bowling, 2016]. ISMCTSBR managed to exploit Slumbot by 1259 mbb/h when ABR acted using the FCPA betting abstraction. This outperforms the prior LBR results for responses that were not forced to check/call through the first two rounds, which won 522 and 763 mbb/h using the fold/call and 56 bets action abstractions, respectively. By constraining LBR’s actions per round — limiting it to check/call until the third round — LBR was able to win 4020 and 3763 mbb/h using the fold/call/pot/all-in and 56 bets action abstractions. While ISMCTS-BR uses the fcpa abstraction, we have not restricted it per round. Prior LBR results do not evaluate the fcpa abstraction without restrictions, but the contrast in performance when using the 56 bets abstraction is considerable, both against Slumbot and the other agents that were evaluated, suggesting that this domain-specific hand-tuning was important. The generality of ISMCTS-BR allows it to find a substantial degree of exploitability, even in high-quality agents, across disparate domains and despite a lack of such domain knowledge.

Figure 1: ISMCTS-BR training curves in Leduc poker vs. always fold and CFR+ with 7 iterations.

4.2 Approximate Exploitability in Go

We use ISMCTS-BR to evaluate two agents trained following [Silver et al., 2018]: one as described in the paper (AlphaZero MSE), and another using the 25-step TD error to train the value function (AlphaZero TD25).8 During the ISMCTSBR search, we use the AlphaZero policy head as the opponent policy, rather than the full MCTS algorithm. When AlphaZero is selecting actions during play, it acts as normal, and uses 800 simulations.

In head-to-head play, AlphaZero MSE wins 23.25% of games vs AlphaZero TD25, indicating a relative Elo of -207. AlphaZero MSE was found to be less exploitable, with ISMCTS-BR at the end of training winning roughly 20% of games instead of the 40% against AlphaZero TD25 (Fig 2).

As AlphaZero is a MCTS agent with a variable number of simulations, we can create a curriculum — training ISMCTSBR against AlphaZero with 1 simulation, then warm-starting the network and train it against AlphaZero with 10 simulations, then 50, 100, 200, 400, and 800. (Fig. 3). The curriculum improved the exploitability estimate significantly, returning results around 90%.

One common way to determine whether an algorithmic change is an improvement is to compare head-to-head performance play with and without the change. If that was done to choose between the TD25 and MSE variants, one would neglect the agent’s worst-case performance, potentially deploying an agent that is vulnerable to failure in the wild. Head-to-head performance and exploitability are complementary, and practitioners much decide how practical worst-case scenarios are. Perhaps limited access to the agent makes it unrealistic for adversaries to probe its behaviour, allowing for more focus on the in-practice head-to-head performance.

5 Conclusion

We introduce ISMCTS-BR, an algorithm for learning an approximate best response that generalizes the poker specific local best response method, and validate it across a range of

Figure 2: ISMCTS-BR results vs AlphaZero agents in Go.

Figure 3: ISMCTS-BR simulation curriculum results in Go against AlphaZero agents.

domains and agents. In addition to a lower-bound on the true exploitability, a learned best response provides both an assessment of how difficult it is to discover an agent’s flaws, and a mechanism to facilitate debugging by identifying game trajectories where the agent is weak. We believe this technique will support evaluation in large games and drive algorithmic improvements for robust agents, as prior work spurred computer poker research.

Although ISMCTS-BR is limited to tractable belief spaces, this could potentially be relaxed by using a sample model to generate actions for the opponent during search or replacing search with a stochastic planning algorithm such as [Ozair et al., 2021] as the best responder.

Acknowledgements

We would like to thank Eric Jackson for helping us evaluate against Slumbot, Kevin Waugh for repeatedly finding & fix-ing bugs in our code, and Josh Davidson for providing the Scotland Yard images.

References

[Allis, 1994] V. L. Allis. Searching for solutions in games and artificial intelligence. PhD thesis, University of Limburg, 1994.

[Amodei et al., 2016] D. Amodei, C. Olah, J. Steinhardt, P.F. Christiano, J. Schulman, and D. Man´e. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016.

[Athalye et al., 2018] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 284–293. PMLR, 2018.

[Balduzzi et al., 2018] D. Balduzzi, K. Tuyls, J. P´erolat, and T. Graepel. Re-evaluating evaluation. CoRR, abs/1806.02643, 2018.

[Bard, 2016] N. DC Bard. Online Agent Modelling in Human-Scale Problems. PhD thesis, U. Alberta, 2016.

[Baudiˇs and Gailly, 2011] Petr Baudiˇs and Jean-loup Gailly. Pachi: State of the art open source go program. In Advances in computer games, pages 24–38. Springer, 2011.

[Bellemare et al., 2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Ar-tificial Intelligence Research, 47:253–279, jun 2013.

[Brown and Sandholm, 2017] N. Brown and T. Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 360(6385), 2017.

[Brown et al., 2018] N. Brown, T. Sandholm, and B. Amos. Depth-limited solving for imperfect-information games. Advances in NeurIPS, pages 7663–7674, 2018.

[Campbell et al., 2002] M. Campbell, A. J. Hoane Jr, and F. Hsu. Deep blue. Artif. intell., 134(1-2):57–83, 2002.

[Cowling et al., 2010] P.I. Cowling, E.J. Powley, and D. Whitehouse. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 4(2):120–143, 2010.

[Gleave et al., 2020] A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell. Adversarial policies: Attacking deep reinforcement learning. In 8th Int’l Conference on Learning Representations, ICLR, 2020.

[Greenwald et al., 2017] A. Greenwald, J. Li, and E. Sodomka. Solving for best responses and equilibria in extensive-form games with reinforcement learning methods. In Rohit Parikh on Logic, Language and Society, pages 185–226. Springer, 2017.

[Gruslys et al., 2020] A. Gruslys, M. Lanctot, R. Munos, F. Timbers, M. Schmid, and et al. The advantage regretmatching actor-critic. ArXiv, 2020.

[Jackson, 2013] E.G. Jackson. Slumbot NL: Solving large games with counterfactual regret minimization using sampling and distributed processing. In Workshops of the Twenty-Seventh AAAI Conference, 2013.

[Johanson et al., 2011] M. Johanson, K. Waugh, M. Bowling, and Martin M. Zinkevich. Accelerating best response calculation in large extensive games. In IJCAI, 2011.

[Johanson, 2011] M. Johanson. Accelerating best response calculation in large extensive games. http://johanson.ca/publications/poker/2011-ijcai-abr/2011-ijcai-abr.html, 2011.

[Johanson, 2013] M. Johanson. Measuring the size of large no-limit poker games. arXiv abs/1302.7008, 2013.

[Lanctot et al., 2019] Marc Lanctot, Edward Lockhart, JeanBaptiste Lespiau, Vinicius Zambaldi, and et al. OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, 2019. http://arxiv.org/abs/1908.09453.

[Lis´y and Bowling, 2016] Viliam Lis´y and Michael H. Bowling. Eqilibrium approximation quality of current no-limit poker bots. CoRR, abs/1612.07547, 2016.

[Lis´y et al., 2015] V. Lis´y, M. Lanctot, and M. Bowling. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In Proceedings of the Int’l Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 27–36, 2015.

[Mnih et al., 2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, and et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

[Moravˇc´ık et al., 2017] M. Moravˇc´ık, M. Schmid, N. Burch, and et al. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 358(6362), 2017.

[Omidshafiei et al., 2019] S. Omidshafiei, C. Papadimitriou, Piliouras G, K. Tuyls, and et al. -rank: Multi-agent evaluation by evolution. Scientific Reports, 9(1):9937, 2019.

[Ozair et al., 2021] S. Ozair, Y. Li, A. Razavi, I. Antonoglou, A. van den Oord, and O. Vinyals. Vector quantized models for planning. arXiv preprint arXiv:2106.04615, 2021.

[Quionero-Candela et al., 2009] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.

[Schaeffer et al., 1996] J. Schaeffer, R. Lake, P. Lu, and M. Bryant. Chinook the world man-machine checkers champion. AI Magazine, 17(1):21–21, 1996.

[Schmid et al., 2021] Martin Schmid, Matej Moravcik, Neil Burch, Rudolf Kadlec, Josh Davidson, Kevin Waugh, Nolan Bard, Finbarr Timbers, Marc Lanctot, Zach Holland, et al. Player of games. arXiv preprint arXiv:2112.03178, 2021.

[Silver et al., 2018] David Silver, Thomas Hubert, and Julian Schrittwieser et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 632(6419):1140–1144, 2018.

[Srinivasan et al., 2018] S. Srinivasan, M. Lanctot, V. Zambaldi, and et al. Actor-critic policy optimization in partially observable multiagent environments. In Advances in NeurIPS, pages 3422–3435, 2018.

[Tammelin et al., 2015] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit texas hold’em. In IJCAI, 2015.

[Tuyls et al., 2018] K. Tuyls, J. Perolat, M. Lanctot, J.Z. Leibo, and T. Graepel. A generalized method for empirical

game theoretic analysis. In Proceedings of the Int’l Conf. on Auto. Agents and Multiagent Systems (AAMAS), 2018.

[Witty et al., 2018] S. Witty, J.K. Lee, E. Tosch, A. Atrey, M.L. Littman, and D.D. Jensen. Measuring and characterizing generalization in deep reinforcement learning. CoRR, abs/1812.02868, 2018.

[Zinkevich et al., 2008] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Adv. in Neural Information Processing Systems 20, pages 905–912, 2008.

A Game Domain Descriptions

Chess, Connect Four, and Go are classic challenge domains used in AI. They are all implemented in the standard way, as available in OpenSpiel [Lanctot et al., 2019]. Go uses the standard rules as used by AlphaZero [Silver et al., 2018].

Leduc Poker is a simplified poker game with a six-card deck containing three ranks and two suits. Players must each ante one chip to play. Each player receives a single private card, and there are two betting rounds with legal actions: fold, call, and raise. There is a limit of two raised per round. Raises are set to 2 chips in the first round and 4 chips in the second round. The utility is the difference in a player’s chip after the game ends and and before the game started.

GoofspielN is a bidding card game where players start with N bid cards numbered . There is also a deck of point cards, face down, numbered . Every turn, the top card of the point card deck is revealed, and each player chooses a single bid card to play simultaneously. The winner of the turn is the player with the highest bid, and that player receives the face-up point card. At the end of the game, the player with the highest number of points wins. We use a common imperfect information variant where the bid cards are never revealed, only whether the turn was won or lost by each player. The exact OpenSpiel game string used is: turn based simultaneous game( game=goofspiel(imp info=true,num cards=N, players=2,points order=descending,

Texas hold’em poker is a popular poker variant played with a 52-card deck. At the start of each round, players receive two private cards and proceed to several betting rounds preceeding and following the revealing of public community cards: flop (three cards), turn (one card), and river (one card). Heads-up Limit (HUL) Texas hold’em poker restricts the bets and amounts to specific sizes in a similarly to Leduc poker. No-limit hold’em (HUNL) removes the fixed bets restriction. However in this paper, we use a restricted variant that allows fold (f), call (c), pot-sized bet (p), and all-in (a) which we abbreviate HUNL(fcpa). The game rules are based on the standard Annual Computer Poker Competition rules, and their OpenSpiel game strings are:

• universal poker(betting=limit,numPlayers=2, numRounds=4,blind=10 5, firstPlayer=2 1,numSuits=4,numRanks=13, numHoleCards=2, numBoardCards=0 3 1 1,raiseSize=10 10 20 20, maxRaises=3 4 4 4,bettingAbstraction=full)

• universal poker(betting=nolimit,numPlayers=2, numRounds=4,blind=100 50, firstPlayer=2 1 1 1, numSuits=4,numRanks=13,numHoleCards=2, numBoardCards=0 3 1 1, stack=20000 20000,bettingAbstraction=fcpa)

A.1 Scotland Yard Scotland Yard is a pursuit-evasion game played on a graph. It is typically played with 6 players— 5 players playing as ”de-

tectives” who chase a fugitive, ”Mr. X” throughout a stylized version of London. The Player of Games paper [Schmid et al., 2021] has a description of the game.

We introduce several smaller variants of Scotland Yard (see Figure 4 below for a visualization):

• Glasses, a custom graph that looks like a pair of glasses.

• War Museum, a subgraph of the full game centered around the War Museum.

• Queen Mary’s Garden, a subgraph of the full game centered around Queen Mary’s Garden.

• Kennington Double, a subgraph of the full game set in Kennington with a double move available to Mr. X.

In each, we create new graphs that the players must traverse. In each, some number of detectives have a limited number of moves to find Mr. X. In Table 5 and 6 we include details about each subgame. Unlike the full game, in the small variants, Mr. X is restricted to only use taxi tickets. The detectives have additional tickets. The full game also includes boats, which are only usable by Mr. X. The shaded nodes are those with a player on them; the grey shaded node is Mr. X, and the other shaded nodes are the detectives. For instance, in Glasses, Mr. X starts on node 6, while detectives start on nodes 1 and 11.

Table 5: Scotland Yard subgame description for Mr. X.

Table 6: Scotland Yard subgame description for the detectives.

Table 7: Scotland Yard edge colors.

B AlphaZero agents

The two AlphaZero agents we train ISMCTS-BR to exploit were trained with the current version of the AlphaZero/MuZero codebase. The code is largely identical to that used to train the agents from [Silver et al., 2018], but the codebase has been actively developed since 2018, and as such, there have been some changes. For our experimental setup, the sole major deviation from [Silver et al., 2018] is in the hardware. As noted in [Silver et al., 2018]:

During training only, 5000 first-generation tensor processing units (TPUs) were used to generate self-play games, and 16 second-generation TPUs were used to train the neural networks.

The two AlphaZero agents we trained used 4000 thirdgeneration TPUs to generate self-play games, and 16 second-generation TPUs were used to train the neural networks. The reason for this is that the first-generation TPUs were unavailable for us to use at the time of training.

The agent we refer to as ”AlphaZero MSE” is the same agent as is referred to as ”AlphaZero” in [Schmid et al., 2021], which includes additional strength comparisons; we include a subset here in Table 8. ”PoG” refers to the Player of Games agent introduced in [Schmid et al., 2021]. Pachi was introduced in [Baudiˇs and Gailly, 2011]; we use version 11.99. We also evaluate against level 10 GnuGo, version 3.8.

In Table 8, we estimate the strength of AlphaZero TD25 here by transitivity in relation to AlphaZero MSE; as Alp-

Table 8: Relative strength comparisons for the Go agents that we trained. All results other than AlphaZero TD25 come from Table 4 of [Schmid et al., 2021], as we use the same agents.

haZero TD25 beats AlphaZero MSE 76.75% of the time, we calculate an Elo value for it as 2099 + 207 = 2306.

Glasses Queen Mary’s Garden

War Museum Kennington Double

Figure 4: Scotland Yard custom maps. The darkest-shaded location is Mr. X.’s location whereas the lighter-colored locations represent the detectives starting locations. Edges colors represent different transport types (taxi, bus, underground). Color meaning is in Table 7.

Designed for Accessibility and to further Open Science