Monte Carlo Game Solver

2020·Arxiv

Abstract

Abstract

We present a general algorithm to order moves so as to speedup exact game solvers. It uses online learning of playout policies and Monte Carlo Tree Search. The learned policy and the information in the Monte Carlo tree are used to order moves in game solvers. They improve greatly the solving time for multiple games.

1 Introduction

Monte Carlo Tree Search (MCTS) associated to Deep Reinforcement learning has superhuman results in the most dif-ficult complete information games (Go, Chess and Shogi) [Silver et al., 2018]. However little has been done to use this kind of algorithms to exactly solve games. We propose to use MCTS associated to reinforcement learning of policies so as to speedup the resolution of various games.

The paper is organized as follows: the second section deals with related work on games. The third section details the move ordering algorithms for various games. The fourth section gives experimental results for these games.

2 Previous Work

In this section we review the different algorithms that have been used to solve games. We then focus on the solver. As we improve with MCTS we show the difference to MCTS Solver. We also expose Depth First Proof Number Search as it has solved multiple games. We finish with a description of online policy learning as the resulting policy is used for our move ordering.

2.1 Solving Games

Iterative Deepening Alpha-Beta associated to a heuristic evaluation function and a transposition table is the standard algorithm for playing games such as Chess and Checkers. Iterative Deepening Alpha-Beta has also been used to solve games such as small board Go [van der Werf and Winands, 2009], Renju [W´agner and Vir´ag, 2001], Lines of Action [Sakuta et al., 2003], Amazons endgames [Kloetzer et al., 2008]. Other researchers have instead used a Depth-first Alpha-Beta without Iterative Deepening and with domain specific algorithms to solve Domineering [Uiterwijk, 2016] and Atarigo [Boissac and Cazenave, 2006]. The advantage of Iterative Deepening associated to a transposition table for solving games is that it finds the shortest win and that it reuses the information of previous iterations for ordering the moves, thus maximizing the cuts. The heuristics usually used to order the move in combination with Iterative Deepening are: trying first the transposition table move, then trying killer moves and then sorting the remaining moves according to the History Heuristic [Schaeffer, 1989]. The advantage of not using Iterative Deepening is that the iterations before the last one are not performed, saving memory and time, however if bad choices on move ordering happen, the search can waste time in useless parts of the search tree and can also find move sequences longer than necessary.

There are various competing algorithms for solving games [van den Herik et al., 2002]. The most simple are Alpha-Beta and Iterative Deepening Alpha-Beta. Other algorithms memorize the search tree in memory and expand it with a best first strategy: Proof Number Search [Allis et al., 1994], PN[Breuker, 1998], Depth-first Proof Number Search (Df-pn) [Nagai, 2002], Monte Carlo Tree Search Solver [Winands et al., 2008; Cazenave and Saffidine, 2011] and Product Propagation [Saffidine and Cazenave, 2013].

Games solved with a best first search algorithm include Go-Moku with Proof Number search and Threat Space Search [Allis et al., 1996], Checkers using various algorithms [Schaeffer et al., 2007], Fanorona with PN[Schadd et al., 2008], Lines of Action with PN[Winands, 2008], Breakthrough with parallel PN[Saffidine et al., 2011], and Hex with parallel Df-pn [Pawlewicz and Hayward, 2013].

Other games such as Awari were solved using retrograde analysis [Romein and Bal, 2003]. Note that retrograde analysis was combined with search to solve Checkers and Fanorona.

2.2 αβ Solver

Iterative Deepening has long been the best algorithm for multiple games. Most of the Chess engines still use it even if the current best algorithm is currently MCTS [Silver et al., 2018].

Depth first is more simple but it can be better than Iter-

ative Deepening for solving games since it does not have to explore a large tree before searching the next depth. More over in the case of games with only two outcomes the results are always either Won or Lost and enable immediate cuts when Iterative Deepening has to deal with unknown values when it reaches depth zero and the state is not terminal.

One interesting property of is that selection sort becomes an interesting sorting algorithm. It is often useful to only try the best move or a few good moves before reaching a cut. Therefore it is not necessary to sort all the moves at first. Selecting move by move as in selection sorting can be more effective.

2.3 MCTS Solver

MCTS has already been used as a game solver [Winands et al., 2008]. The principle is to mark as solved the subtrees that have an exact result. As the method uses playouts it has to go through the tree at each playout and it revisits many times the same states doing costly calculations to choose the move to try according to the bandit. Moreover in order to solve a game a large game tree has to be kept in memory.

The work on MCTS Solver has later been extended to games with multiple outcomes [Cazenave and Saffidine, 2011].

2.4 Depth First Proof Number Search

Proof Number Search is a best first algorithm that keeps the search tree in memory so as to expand the most informative leaf [Allis et al., 1994]. In order to solve the memory problem of Proof Number Search, the PNalgorithm has been used [Breuker, 1998]. PNuses a secondary Proof Number Search at each leaf of the main Proof Number Search tree, thus enabling the square of the total number of nodes of the main search tree to be explored. More recent developments of Proof Number Search focus on Depth-First Proof Number search (DFPN) [Nagai, 2002]. The principle is to use a transposition table and recursive depth first search to efficiently search the game tree and solve the memory problems of Proof Number Search. DFPN can be parallelized to improve the solving time [Hoki et al., 2013]. It has been improved for the game of Hex using a trained neural network [Gao et al., 2017]. It can be applied to many problems, including recently Chemical Synthesis Planning [Kishimoto et al., 2019].

2.5 Online Policy Learning

Playout Policy Adaptation with Move Features (PPAF) has been applied to many games [Cazenave, 2016].

An important detail of the playout algorithm is the code function. In PPAF the same move can have different codes that depend on the presence of features associated to the move. For example in Breakthrough the code also takes into account whether the arriving square is empty or contains an opponent pawn.

The principle of the learning algorithm is to add 1.0 to the weight of the moves played by the winner of the playout. It also decreases the weights of the moves not played by the winner of the playout by a value proportional to the exponential of the weight. This algorithm is given in algorithm 2.

3 Move Ordering

We describe the general tools used for move ordering then their adaptation to different games.

3.1 Outline

In order to collect useful information to order moves we use a combination of the GRAVE algorithm [Cazenave, 2015] and of the PPAF algorithm. Once the Monte Carlo search is finished we use the transposition table of the Monte Carlo search to order the moves, putting first the most simulated ones. When outside the transposition table we use the learned weights to order the moves. The algorithm used to score the moves so as to order them is given in algorithm 3.

3.2 Atarigo

Atarigo is a simplification of the game of Go. The first player to capture has won. It is a game often used to teach Go to beginners. Still it is an interesting games and tactics can be hard to master.

The algorithm for move ordering is given in algorithm 4. It always puts first a capture move since it wins the game. If no such move exist it always plays a move that saves a one liberty string since it is a forced move to avoid losing. Then it favors moves on liberties of opponent strings that have few liberties provided the move has itself sufficient liberties. If none of these are available it returns the evaluation by the Monte Carlo ordering function.

The code associated to a move is built after its four neighboring intersections.

3.3 Nogo

Nogo is the misere version of Atarigo [Chou et al., 2011]. It was introduced at the 2011 Combinatorial Game Theory Workshop in Banff. The first player to capture has lost. It is usually played on small boards. In Banff there was a tournament for programs and Bob Hearn won the tournament using the Fuego framework [Enzenberger et al., 2010] and MonteCarlo Tree Search.

We did not find simple heuristics to order moves at Nogo. So the standard algorithm uses no heuristic and the MC algorithms sort moves according to algorithm 3.

3.4 Go

Go was solved for rectangular boards up to size by the MIGOS program [van der Werf and Winands, 2009]. The algorithm used was an iterative deepening with transposition table.

3.5 Breakthrough and Knightthrough

Breakthrough is an abstract strategy board game invented by Dan Troyka in 2000. It won the 2001 8x8 Game Design Competition and it is played on Little Golem. The game starts with two rows of pawns on each side of the board. Pawns can capture diagonally and go forward either vertically or diagonally. The first player to reach the opposite row has won. Breakthrough has been solved up to size using Job Level Proof Number Search [Saffidine et al., 2011]. The code for a move at Breakthrough contains the starting square, the arrival square and whether it is empty or contains an enemy pawn. The ordering gives priority to winning moves, then to moves to prevent a loss, then Monte Carlo Move Ordering.

Misere Breakthrough is the misere version of Breakthrough, the games is lost if a pawn reaches the opposite side. It is also a difficult game and its is more difficult for MCTS algorithms [Cazenave, 2016]. The code for a move is the same as for Breakthrough and the ordering is Monte Carlo Move Ordering.

Knightthrough emerged as a game invented for the General Game Playing competitions. Pawns are replaced with knights. Misere Knightthrough is the misere version of the game where the goal is to lose. Codes for moves and move ordering are similar to Breakthrough.

3.6 Domineering

Domineering is played on a chess board and two players alternate putting dominoes on the board. The first player puts the dominoes vertically, the second player puts them horizontally. The first player who cannot play loses. In Misere Domineering the first player who cannot play wins.

4 Experimental results

The iterative deepening with a transposition table (ID TT) is called with a null window since it saves much time compared to calling it with a full window. Other algorithms are called with the full window since it they only deal with terminal states values and that the games we solve are either Won or Lost.

A transposition table containing 1 048 575 entries is used for all games. An entry in the transposition table is always replaced by a new one.

An algorithm name finishing with MC denotes the use of Monte Carlo Move Ordering. The times given for MC algorithms include the time for the initial MCTS that learns a policy. The original Proof Number Search algorithm is not included in the experiments since it fails due to being short of memory for complex games. The algorithm solves this memory problem and is included in the experiments. The algorithms that do not use MC still do some move ordering but without Monte Carlo.

Table 1 gives the results for Atarigo. For Atarigo TT MC is the best algorithm and is much better than TT. For Atarigo the best algorithm is again TT MC which is much better than all other algorithms.

Table 2 gives the results for Nogo. Nogo is solved in 49.72 seconds by TT MC with 100 000 playouts. This is 88 times faster than TT the best algorithm not using MC.

Nogo is solved best by TT MC with 1 000 000 playouts before the search. It is 21 times faster than ID TT the best algorithm not using MC.

TT MC with 10 000 000 playouts solves Nogo in 61 430.88 seconds and 46 092 056 485 moves. This is the first time Nogo is solved. The solution is:

Table 1: Different algorithms for solving Atarigo.

Moves Time 14 784 088 742 37 901.56 s. ID TT > 35 540 000 000 > 86 400.00 s. TT > 37 660 000 000 > 86 400.00 s. ID TT MC 62 800 334 126.84 s. TT MC 3 956 049 12.79 s.

Moves Time 33 150 000 000 > 86 400.00 s. ID TT > 37 190 000 000 > 86 400.00 s. TT > 7 090 000 000 > 44 505.91 s. ID TT MC 12 713 931 627 27 298.35 s. TT MC 329 780 434 787.17 s.

Table 2: Different algorithms for solving Nogo.

Table 3: Winner for Nogo boards of various sizes

Table 4: Different algorithms for solving Go.

As it is the first time results about solving Nogo are given we recapitulate in table 3 the winner for various sizes. A one means a first player win and a two a second player win.

Table 4 gives the results for Go. Playouts and depth first can last a very long time in Go since stones are captured and if random play occurs the goban can become almost empty again a number of times before the superko rules forbids states. In order to avoid very long and useless games an

Table 5: Different algorithms for solving Breakthrough.

Moves Time 38 780 000 000 > 86 400.00 s. ID TT 13 083 392 799 33 590.59 s. TT 19 163 127 770 43 406.79 s. ID TT MC 3 866 853 361 11 319.39 s. TT MC 3 499 173 137 9 243.66 s.

Table 6: Different algorithms for solving Misere Breakthrough.

Moves Time 42 630 000 000 > 86 400 s. ID TT > 43 350 000 000 > 86 400 s. TT > 42 910 000 000 > 86 400 s. ID TT MC 1 540 153 635 3 661.50 s. TT MC 447 879 697 1 055.32 s.

artificial limit on the number of moves allowed in a game was set to twice the size of the board. This is not entirely satisfactory since one can imagine weird cases where the limit is not enough. The problem does not appear in the other games we have solved since they converge to a terminal state before a fixed number of moves. The trick we use to address the problem is to send back an evaluation of zero if the search reaches the limit. When searching for a win with a null window this is equivalent to a loss and when searching for a loss it is equivalent to a win. Therefore if the search finds a win it does not rely on the problematic states. The board was solved with a komi of 8.5, the board was solved with a komi of 3.5. Table 5 gives the results for Breakthrough. Using MC improves much the solving time. with MC uses seven times less nodes than the previous algorithm that solved Breakthrough without patterns (i.e. parallel PNwith 64 clients [Saffidine et al., 2011]). Using endgame patterns divides by seven the number of required nodes for parallel PN. Table 6 gives the results for Misere Breakthrough. TT MC is the best algorithm and is much better than all non MC algorithms. The results for Knightthrough are in table 7. ID TT MC is the best algorithm and far better than algorithms not using MC. This is the first time Knightthrough is solved. Table 8 gives the results for Misere Knightthrough. Misere Knightthrough is solved in 20 375 687 163 moves and 42 425.41 seconds by TT MC. This is the first time Misere Knightthrough is solved. Misere Knightthrough is much more difficult to solve than Knightthrough which is solved in seconds by ID TT MC. This is due to Misere Knightthrough being a waiting game with longer games than Knightthrough. Table 9 gives the results for Domineering. The best al-

Table 7: Different algorithms for solving Knightthrough.

Table 8: Different algorithms for solving Misere Knightthrough.

Moves Time 45 290 000 000 > 86 400 s. ID TT > 52 640 000 000 > 86 400 s. TT > 56 230 000 000 > 86 400 s. ID TT MC > 41 840 000 000 > 86 400 s. TT MC 20 375 687 163 42 425.41 s.

Table 9: Different algorithms for solving Domineering.

Moves Time 41 270 000 000 > 86 400 s. ID TT 18 958 604 687 35 196.62 s. TT 197 471 137 376.23 s. ID TT MC 2 342 641 133 5 282.06 s. TT MC 29 803 373 123.76 s.

Table 10: Different algorithms for solving Misere Domineering.

gorithm is TT MC which is 3 times faster than TT without MC.

Table 10 gives the results for Misere Domineering. The best algorithm is TT MC which is much better than all non MC algorithms.

5 Conclusion

For the games we solved, Misere Games are more difficult to solve than normal games. In Misere Games the player waits and tries to force the opponent to play a losing move. This makes the game longer and reduces the number of winning sequences and winning moves.

Monte Carlo Move Ordering improves much the speed of with transposition table compare to depth first and Iterative Deepening with transposition table but without Monte Carlo Move Ordering. The experimental results show significant improvements for nine different games.

In future work we plan to parallelize the algorithms and apply them to other problems. It would also be interesting to test if improved move ordering due to Monte Carlo Move Ordering would improve other popular solving algorithms such as DFPN. The ultimate goal with this kind of algorithms could be to solve exactly the game of Chess which is possible provided we have a very strong move ordering algorithm [Lemoine and Viennot, 2015].

References

[Allis et al., 1994] Louis Victor Allis, Maarten van der Meulen, and H. Jaap van den Herik. Proof-Number Search. Artificial Intelligence, 66(1):91–124, 1994.

[Allis et al., 1996] L. Victor Allis, H. Jaap van den Herik, and M. P. H. Huntjens. Go-Moku solved by new search techniques. Computational Intelligence, 12:7–23, 1996.

[Boissac and Cazenave, 2006] Fr´ed´eric Boissac and Tristan Cazenave. De nouvelles heuristiques de recherche appliqu´ees `a la r´esolution d’Atarigo. In Intelligence artifi-cielle et jeux, pages 127–141. Hermes Science, 2006.

[Breuker, 1998] Dennis M. Breuker. Memory versus Search in Games. PhD thesis, Universiteit Maastricht, 1998.

[Cazenave and Saffidine, 2011] Tristan Cazenave and Ab- dallah Saffidine. Score bounded Monte-Carlo tree search. In H. van den Herik, Hiroyuki Iida, and Aske Plaat, editors, Computers and Games, volume 6515 of Lecture Notes in Computer Science, pages 93–104. SpringerVerlag, Berlin / Heidelberg, 2011.

[Cazenave, 2015] Tristan Cazenave. Generalized rapid ac- tion value estimation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 754–760, 2015.

[Cazenave, 2016] Tristan Cazenave. Playout policy adapta- tion with move features. Theor. Comput. Sci., 644:43–52, 2016.

[Chou et al., 2011] Cheng-Wei Chou, Olivier Teytaud, and Shi-Jim Yen. Revisiting Monte-Carlo tree search on a normal form game: NoGo. In Applications of Evolutionary Computation, volume 6624 of Lecture Notes in Computer Science, pages 73–82. Springer Berlin Heidelberg, 2011.

[Enzenberger et al., 2010] Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego - an open-source framework for board games and Go engine based on Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.

[Gao et al., 2017] Chao Gao, Martin M¨uller, and Ryan Hayward. Focused depth-first proof number search using convolutional neural networks for the game of hex. In IJCAI, pages 3668–3674, 2017.

[Hoki et al., 2013] Kunihito Hoki, Tomoyuki Kaneko, Akihiro Kishimoto, and Takeshi Ito. Parallel dovetailing and its application to depth-first proof-number search. ICGA Journal, 36(1):22–36, 2013.

[Kishimoto et al., 2019] Akihiro Kishimoto, Beat Buesser, Bei Chen, and Adi Botea. Depth-first proof-numbersearch with heuristic edge cost and application to chemical synthesis planning. In Advances in Neural Information Processing Systems, pages 7224–7234, 2019.

[Kloetzer et al., 2008] Julien Kloetzer, Hiroyuki Iida, and Bruno Bouzy. A comparative study of solvers in amazons endgames. In IEEE Symposium On Computational Intelligence and Games, 2008. CIG’08., pages 378–384. IEEE, 2008.

[Lemoine and Viennot, 2015] Julien Lemoine and Simon Vi- ennot. Il n’est pas impossible de r´esoudre le jeu d’´echecs. 1024 – Bulletin de la soci´et´e informatique de France, 6, Juillet 2015.

[Nagai, 2002] Ayumu Nagai. Df-pn Algorithm for Searching AND/OR Trees and its Applications. PhD thesis, The University of Tokyo, 2002.

[Pawlewicz and Hayward, 2013] Jakub Pawlewicz and Ryan B. Hayward. Scalable parallel DFPN search. In Computers and Games - 8th International Conference, CG 2013, Yokohama, Japan, August 13-15, 2013, Revised Selected Papers, pages 138–150, 2013.

[Romein and Bal, 2003] John W. Romein and Henri E. Bal. Solving awari with parallel retrograde analysis. IEEE Computer, 36(10):26–33, 2003.

[Saffidine and Cazenave, 2013] Abdallah Saffidine and Tris- tan Cazenave. Developments on product propagation. In Computers and Games - 8th International Conference, CG

2013, Yokohama, Japan, August 13-15, 2013, Revised Selected Papers, pages 100–109, 2013.

[Saffidine et al., 2011] Abdallah Saffidine, Nicolas Jouandeau, and Tristan Cazenave. Solving breakthrough with race patterns and job-level proof number search. In Advances in Computer Games - 13th International Conference, ACG 2011, Tilburg, The Netherlands, November 20-22, 2011, Revised Selected Papers, pages 196–207, 2011.

[Sakuta et al., 2003] Makoto Sakuta, Tsuyoshi Hashimoto, Jun Nagashima, Jos W. H. M. Uiterwijk, and Hiroyuki Iida. Application of the killer-tree heuristic and the lambda-search method to lines of action. Inf. Sci., 154(3-4):141–155, 2003.

[Schadd et al., 2008] Maarten P. D. Schadd, Mark H. M. Winands, Jos W. H. M. Uiterwijk, H. Jaap van den Herik, and Maurice H. J. Bergsma. Best play in fanorona leads to draw. New Mathematics and Natural Computation, 4(03):369–387, 2008.

[Schaeffer et al., 2007] Jonathan Schaeffer, Neil Burch, Yngvi Bj¨ornsson, Akihiro Kishimoto, Martin M¨uller, Robert Lake, Paul Lu, and Steve Sutphen. Checkers is solved. Science, 317(5844):1518–1522, 2007.

[Schaeffer, 1989] Jonathan Schaeffer. The history heuristic and alpha-beta search enhancements in practice. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1203–1212, 1989.

[Silver et al., 2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.

[Uiterwijk, 2016] Jos WHM Uiterwijk. domineering is solved: The first player wins. In International Conference on Computers and Games, pages 129–136. Springer, 2016.

[van den Herik et al., 2002] H. Jaap van den Herik, Jos W. H. M. Uiterwijk, and Jack van Rijswijck. Games solved: Now and in the future. Artif. Intell., 134(1-2):277–311, 2002.

[van der Werf and Winands, 2009] Erik CD van der Werf and Mark HM Winands. Solving go for rectangular boards. ICGA Journal, 32(2):77–88, 2009.

[W´agner and Vir´ag, 2001] J´anos W´agner and Istv´an Vir´ag. Solving renju. ICGA Journal, 24(1):30–35, 2001.

[Winands et al., 2008] Mark HM Winands, Yngvi Bj¨ornsson, and Jahn-Takeshi Saito. Monte-carlo tree search solver. In International Conference on Computers and Games, pages 25–36. Springer, 2008.

[Winands, 2008] Mark H.M. Winands. LOA is solved. ICGA Journal, 31(4):234–238, 2008.