Represented Value Function Approach for Large Scale Multi Agent Reinforcement Learning

2020Β·Arxiv

Abstract

Abstract

In this paper, we consider the problem of large scale multi agent reinforcement learning. Firstly, we studied the representation problem of the pairwise value function to reduce the complexity of the interactions among agents. Secondly, we adopt a l2-norm trick to ensure the trivial term of the approximated value function is bounded. Thirdly, experimental results on battle game demonstrate the effectiveness of the proposed approach.

Keywords: Multi agent reinforcement learning; value function; actor-critic.

I. INTRODUCTION

Reinforcement learning combined with deep neural networks has achieved human-level or higher performances in challenging games [14][15]. In the recent past, deep learning has also proved to be an extremely valuable tool for multi agent reinforcement learning (MARL). The advances in deep reinforcement learning have led great interests in MARL to resolve complex and large scale problems.

One naive approach to train a MARL problem could be training all the agents using a centralized controller to obtain the joint action of all agents. However, the number of actions exponentially increases and each agent needs to send its local information to the central controller, which makes the problem intractable. [1] proposed a novel mean field method to deal with the large scale MARL problem by reducing the complexity of the interactions among agents. In contrast to [1], our contribution is twofold. Firstly, a represented value function approach is proposed to replace the mean value function to emphasize more helpful actions around each agent. In our experiments, this contribution is useful especially in the wild war scenario which all agents start with a random position and agent is insensitivity to mean actions of its neighbors. Secondly, we approximate the represented value function by a succinct function, which the omit term is ensured have a small bound.

The rest of the paper is organized as follows: Section 2 propose an new MARL approach. In Section 3 gives a brief review of related works. Experimental results are presented in Section 4. Finally, conclusions are drawn in Section 5.

II. REPRESENTED VALUE FUNCTION

A. Pairwise local interactions

Following [1], we factorize the Q-function using only the pairwise local interactions.

Furthermore, we use a linear combination of the pairwise value functions to represent the desire value function by

Where is the index set of the neighboring agents of agent with the size .

If we denote , then

Inspired by [1], we express the of each neighbor in terms of the weight sum of

By Taylorβs theorem, the pairwise -function, if twice-differentiable with respect

to , can be expended and expressed as

For the first term, we have

For the second term,

1) can ensure

under the mild condition of the -function being -smooth.

Thus, with sharing representation weights, the value function of agent is approximated

by

B. Implementation

Following [1], for off-policy learning, we exploit standard Q-learning [5] for discrete action spaces, which we call RFQ (represented value function for Q-learning). Agent is trained by

minimizing the loss function

The new Boltzmann policy is then determined for each that

Following [1], instead of setting up Boltzmann policy using the Q-function as in RFQ, we can explicitly model the policy by neural networks with the weights , which leads to the on-policy actor-critic method [4] that we call RFAC. The policy network is trained by the sampled

policy gradient

During the training of MF-AC, one needs to alternatively update and until convergence. The representation weights can be defined in various way. In this paper, we adopt the

GAT method [6] to compute the weights.

III. RELATED WORK

Literature in MARL has argued for a centralized training decentralized execution scheme, e.g., MADDPG [10], BICNET[11] and COMA[12]. Naturally, such a scheme is not scalable as the input size of the critic network grows. As the number of agents grows larger, the above methods are no longer feasible. Another scheme is to approximate the policies of all the agents using meta-learning [13], which is also infeasible when facing a large number of agents. The work in [1] proposed a mean field based approach that take the mean actions of neighbors of each agent into consideration, which can deal with the MARL problem with a large number of agents.

We notice that the work in [7](MAAC) is closest to our work, where the authors use centrally computed critics that share an attention mechanism which selects relevant information for each agent. Our work is inspired by work in [1] which each agent takes its neighborβs actions into primary consideration. Furthermore, the proposed approach aims to the solve the MARL problem with hundreds of agents. The work in [8] utilize a graph convolutional approach [9] to extract features of samples, while our work also utilize the graph convolutional approach to compute the representation weights.

IV. EXPERIMENT RESULTS

In this section, we analyze and evaluate our algorithms in the Open-source MAgent system

[2], which is a mixed cooperative-competitive battle game with two armies fighting against each other in a grid world, each empowered by a different multi agent reinforcement learning (MARL) algorithm.

A. Environment

For fair comparison, we adopt the default settings for training Battle game in all experiments (same with [1]): β0.005 for every move, 0.2 for attacking an enemy, 5 for killing an enemy, β0.1 for attacking an empty grid, β0.1 for being attacked or killed, 64 homogeneous agents for each army, 400 iterations for each round and 2000 rounds self-plays without any knowledge of other MARL algorithms. Following [2], the goal of each army is to get more rewards for training and destroy all the opponents in testing (thatβs why we do not change the default reward settings to encourage attack).

B. Model Settings

Our RFQ and RFAC are compared against the baselines that are proved successful on the MAgent platform. We focus on the battles between the proposed algorithms and the mean field methods (MFQ, MFAC) [1], independent Q-learning (IL) and advantageous actor critic (AC). Following [1], we exclude MADDPG/MAAC as baselines, as they cannot deal with the varying number of agents for the battle. Also, [1] demonstrated that MAAC tends to scale poorly and fail when the agent number is in hundreds.

C. Results and Discussion

Following all parameter settings in [1], we train all six models in the default battle scenario. For testing, we use them for comparative games in two scenarios, which are the default battle scenario and the wild war scenario. The unique difference between them is that agents start with two divided camps in the default battle scenario and agents start with a random position in the wild war scenario, as shown in Fig. 1. Training in the default battle scenario and testing in both the default battle scenario and the wild war scenario can test the generalization ability of models. Some game silhouettes are shown in Fig. 2.

(a)Start with two divided camps (Battle

Fig. 1 First frame of the Battle scenario and the Wild war scenario.

Fig. 2 Progress of some games.

In training phase, each algorithm is trained three times independently. After training, 18

models are created as 18 independent players (IL_A, IL_B, IL_C, MFQ_A, MFQ _B, MFQ _C, RFQ_A, RFQ_B, RFQ_C, AC_A, AC_B, AC_C, MFAC_A, MFAC_B, MFAC_C, RFAC_A, RFAC_B, RFAC_C). In testing phase, skill levels of all players are estimated by calculating Elo scores (adapted from chess [3]) based on outcomes of battle or wild war. For each testing game, we randomly select two players among the 18 players to play.

For battle scenario, we played about 136,000 games in total. The result demonstrate that actor-critic MARL algorithms outperform Q-learning MARL algorithms, as result shown in Tab. 1. The proposed RFAC algorithm performs well while the proposed RFQ algorithm has a general performance in battle scenario.

As shown in Tab. 2, the proposed RFQ algorithm and MFAC nearly went to deuce, however, the actor-critic MARL algorithms swept RFQ. As shown in Tab. 3, the proposed RFAC algorithm has a good performance when fight with the other algorithms.

For wild war scenario, we played about 100,000 games in total. The result is shown in Tab. 4. It is evident that on all the metrics the proposed RFAC largely outperforms the corresponding baselines, i.e. IL, MFQ, AC and MFAC respectively. The proposed RFQ algorithm largely outperforms IL and MFQ, respectively. The mean filed approach preforms not good in the wild war scenario which is insensitivity to mean actions.

As shown in Tab. 5, the proposed RFQ algorithm outperforms all the other algorithms except RFAC. As shown in Tab. 6, the proposed RFAC algorithm has far superior forces with contrast to the other algorithms.

The win-table among all the 18 players of the two scenarios is shown in Fig. 3, we can find that the actor-critic algorithms always have a good performance. Furthermore, the proposed

algorithms have better performance in the wild war scenario.

Fig. 3 Win-table with all the belligerents. The line of the win-table means the number of wins of

V. CONCLUSION AND FUTURE WORK

In this paper, we proposed a novel method for large scale MARL problem, which named RFQ and RFAC. Firstly, we studied the weight norm of the value function. Secondly, we ensured the approximated value functionβs trivial term is bounded. Thirdly, the proposed algorithms for battle game have a good performance especially on the wild war scenario, which shows the robustness of the proposed algorithms. More approaches to compute the representation weights and more applications will to be investigated in our future work. The code of this paper is at https://github.com/renweiya/RFQ-RFAC.

REFERENCES

[1] Yang Y, Luo R, Li M, et al. Mean field multi-agent reinforcement learning[J]. arXiv preprint arXiv:1802.05438, 2018.

[2] Zheng L, Yang J, Cai H, et al. MAgent: A many-agent reinforcement learning platform for artificial collective intelligence[C]//Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

[3] Arpad E Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.

[4] Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Solla, S. A., Leen, T. K., and MΓΌller, K. (eds.), Advances in Neural Information Processing Systems 12, pp. 1008β1014. MIT Press, 2000.

[5] Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279β292, 1992.

[6] VeliΔkoviΔ P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint arXiv:1710.10903, 2017.

[7] Iqbal S, Sha F. Actor-attention-critic for multi-agent reinforcement learning[J]. arXiv preprint arXiv:1810.02912, 2018.

[8] Khan A, Tolstaya E, Ribeiro A, et al. {Graph Policy Gradients for Large Scale Robot Control[J]. arXiv preprint arXiv:1907.03822, 2019.

[9] F. Gama, J. Bruna, and A. Ribeiro. Stability properties of graph neural networks. arXiv preprint arXiv:1905.04497, 2019.

[10] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, pp. 6382β6393, 2017a.

[11] Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., and Wang, J. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.

[12] Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In McIlraith & Weinberger (2018).

[13] E. Parisotto, S. Ghosh, S. B. Yalamanchi, V. Chinnaobireddy, Y. Wu, and R. Salakhutdinov. Concurrent meta reinforcement learning. arXiv preprint arXiv:1903.02710, 2019.

[14] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.

[15] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.