Reinforcement learning combined with deep neural networks has achieved human-level or higher performances in challenging games [14][15]. In the recent past, deep learning has also proved to be an extremely valuable tool for multi agent reinforcement learning (MARL). The advances in deep reinforcement learning have led great interests in MARL to resolve complex and large scale problems.
One naive approach to train a MARL problem could be training all the agents using a centralized controller to obtain the joint action of all agents. However, the number of actions exponentially increases and each agent needs to send its local information to the central controller, which makes the problem intractable. [1] proposed a novel mean field method to deal with the large scale MARL problem by reducing the complexity of the interactions among agents. In contrast to [1], our contribution is twofold. Firstly, a represented value function approach is proposed to replace the mean value function to emphasize more helpful actions around each agent. In our experiments, this contribution is useful especially in the wild war scenario which all agents start with a random position and agent is insensitivity to mean actions of its neighbors. Secondly, we approximate the represented value function by a succinct function, which the omit term is ensured have a small bound.
The rest of the paper is organized as follows: Section 2 propose an new MARL approach. In Section 3 gives a brief review of related works. Experimental results are presented in Section 4. Finally, conclusions are drawn in Section 5.
A. Pairwise local interactions
Following [1], we factorize the Q-function using only the pairwise local interactions.
Furthermore, we use a linear combination of the pairwise value functions to represent the desire value function by
Where is the index set of the neighboring agents of agent
with the size
.
If we denote , then
Inspired by [1], we express the of each neighbor
in terms of the weight sum of
By Taylor’s theorem, the pairwise -function
, if twice-differentiable with respect
to , can be expended and expressed as
For the first term, we have
For the second term,
1) can ensure
under the mild condition of the
-function
being
-smooth.
Thus, with sharing representation weights, the value function of agent is approximated
by
B. Implementation
Following [1], for off-policy learning, we exploit standard Q-learning [5] for discrete action spaces, which we call RFQ (represented value function for Q-learning). Agent is trained by
minimizing the loss function
The new Boltzmann policy is then determined for each that
Following [1], instead of setting up Boltzmann policy using the Q-function as in RFQ, we can explicitly model the policy by neural networks with the weights , which leads to the on-policy actor-critic method [4] that we call RFAC. The policy network
is trained by the sampled
policy gradient
During the training of MF-AC, one needs to alternatively update and
until convergence. The representation weights
can be defined in various way. In this paper, we adopt the
GAT method [6] to compute the weights.
Literature in MARL has argued for a centralized training decentralized execution scheme, e.g., MADDPG [10], BICNET[11] and COMA[12]. Naturally, such a scheme is not scalable as the input size of the critic network grows. As the number of agents grows larger, the above methods are no longer feasible. Another scheme is to approximate the policies of all the agents using meta-learning [13], which is also infeasible when facing a large number of agents. The work in [1] proposed a mean field based approach that take the mean actions of neighbors of each agent into consideration, which can deal with the MARL problem with a large number of agents.
We notice that the work in [7](MAAC) is closest to our work, where the authors use centrally computed critics that share an attention mechanism which selects relevant information for each agent. Our work is inspired by work in [1] which each agent takes its neighbor’s actions into primary consideration. Furthermore, the proposed approach aims to the solve the MARL problem with hundreds of agents. The work in [8] utilize a graph convolutional approach [9] to extract features of samples, while our work also utilize the graph convolutional approach to compute the representation weights.
In this section, we analyze and evaluate our algorithms in the Open-source MAgent system
[2], which is a mixed cooperative-competitive battle game with two armies fighting against each other in a grid world, each empowered by a different multi agent reinforcement learning (MARL) algorithm.
A. Environment
For fair comparison, we adopt the default settings for training Battle game in all experiments (same with [1]): −0.005 for every move, 0.2 for attacking an enemy, 5 for killing an enemy, −0.1 for attacking an empty grid, −0.1 for being attacked or killed, 64 homogeneous agents for each army, 400 iterations for each round and 2000 rounds self-plays without any knowledge of other MARL algorithms. Following [2], the goal of each army is to get more rewards for training and destroy all the opponents in testing (that’s why we do not change the default reward settings to encourage attack).
B. Model Settings
Our RFQ and RFAC are compared against the baselines that are proved successful on the MAgent platform. We focus on the battles between the proposed algorithms and the mean field methods (MFQ, MFAC) [1], independent Q-learning (IL) and advantageous actor critic (AC). Following [1], we exclude MADDPG/MAAC as baselines, as they cannot deal with the varying number of agents for the battle. Also, [1] demonstrated that MAAC tends to scale poorly and fail when the agent number is in hundreds.
C. Results and Discussion
Following all parameter settings in [1], we train all six models in the default battle scenario. For testing, we use them for comparative games in two scenarios, which are the default battle scenario and the wild war scenario. The unique difference between them is that agents start with two divided camps in the default battle scenario and agents start with a random position in the wild war scenario, as shown in Fig. 1. Training in the default battle scenario and testing in both the default battle scenario and the wild war scenario can test the generalization ability of models. Some game silhouettes are shown in Fig. 2.
(a)Start with two divided camps (Battle
Fig. 1 First frame of the Battle scenario and the Wild war scenario.
Fig. 2 Progress of some games.
In training phase, each algorithm is trained three times independently. After training, 18
models are created as 18 independent players (IL_A, IL_B, IL_C, MFQ_A, MFQ _B, MFQ _C, RFQ_A, RFQ_B, RFQ_C, AC_A, AC_B, AC_C, MFAC_A, MFAC_B, MFAC_C, RFAC_A, RFAC_B, RFAC_C). In testing phase, skill levels of all players are estimated by calculating Elo scores (adapted from chess [3]) based on outcomes of battle or wild war. For each testing game, we randomly select two players among the 18 players to play.
For battle scenario, we played about 136,000 games in total. The result demonstrate that actor-critic MARL algorithms outperform Q-learning MARL algorithms, as result shown in Tab. 1. The proposed RFAC algorithm performs well while the proposed RFQ algorithm has a general performance in battle scenario.
As shown in Tab. 2, the proposed RFQ algorithm and MFAC nearly went to deuce, however, the actor-critic MARL algorithms swept RFQ. As shown in Tab. 3, the proposed RFAC algorithm has a good performance when fight with the other algorithms.
For wild war scenario, we played about 100,000 games in total. The result is shown in Tab. 4. It is evident that on all the metrics the proposed RFAC largely outperforms the corresponding baselines, i.e. IL, MFQ, AC and MFAC respectively. The proposed RFQ algorithm largely outperforms IL and MFQ, respectively. The mean filed approach preforms not good in the wild war scenario which is insensitivity to mean actions.
As shown in Tab. 5, the proposed RFQ algorithm outperforms all the other algorithms except RFAC. As shown in Tab. 6, the proposed RFAC algorithm has far superior forces with contrast to the other algorithms.
The win-table among all the 18 players of the two scenarios is shown in Fig. 3, we can find that the actor-critic algorithms always have a good performance. Furthermore, the proposed
algorithms have better performance in the wild war scenario.
Fig. 3 Win-table with all the belligerents. The line of the win-table means the number of wins of
In this paper, we proposed a novel method for large scale MARL problem, which named RFQ and RFAC. Firstly, we studied the weight norm of the value function. Secondly, we ensured the approximated value function’s trivial term is bounded. Thirdly, the proposed algorithms for battle game have a good performance especially on the wild war scenario, which shows the robustness of the proposed algorithms. More approaches to compute the representation weights and more applications will to be investigated in our future work. The code of this paper is at https://github.com/renweiya/RFQ-RFAC.
[1] Yang Y, Luo R, Li M, et al. Mean field multi-agent reinforcement learning[J]. arXiv preprint arXiv:1802.05438, 2018.
[2] Zheng L, Yang J, Cai H, et al. MAgent: A many-agent reinforcement learning platform for artificial collective intelligence[C]//Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[3] Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
[4] Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Solla, S. A., Leen, T. K., and Müller, K. (eds.), Advances in Neural Information Processing Systems 12, pp. 1008–1014. MIT Press, 2000.
[5] Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.
[6] Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint arXiv:1710.10903, 2017.
[7] Iqbal S, Sha F. Actor-attention-critic for multi-agent reinforcement learning[J]. arXiv preprint arXiv:1810.02912, 2018.
[8] Khan A, Tolstaya E, Ribeiro A, et al. {Graph Policy Gradients for Large Scale Robot Control[J]. arXiv preprint arXiv:1907.03822, 2019.
[9] F. Gama, J. Bruna, and A. Ribeiro. Stability properties of graph neural networks. arXiv preprint arXiv:1905.04497, 2019.
[10] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, pp. 6382–6393, 2017a.
[11] Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., and Wang, J. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.
[12] Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In McIlraith & Weinberger (2018).
[13] E. Parisotto, S. Ghosh, S. B. Yalamanchi, V. Chinnaobireddy, Y. Wu, and R. Salakhutdinov. Concurrent meta reinforcement learning. arXiv preprint arXiv:1903.02710, 2019.
[14] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
[15] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.