Multi Type Mean Field Reinforcement Learning

2020Β·Arxiv

ABSTRACT

ABSTRACT

Mean field theory provides an effective way of scaling multiagent reinforcement learning algorithms to environments with many agents that can be abstracted by a virtual mean agent. In this paper, we extend mean field multiagent algorithms to multiple types. The types enable the relaxation of a core assumption in mean field reinforcement learning, which is that all agents in the environment are playing almost similar strategies and have the same goal. We conduct experiments on three different testbeds for the field of many agent reinforcement learning, based on the standard MAgents framework. We consider two different kinds of mean field environments: a) Games where agents belong to predefined types that are known a priori and b) Games where the type of each agent is unknown and therefore must be learned based on observations. We introduce new algorithms for each type of game and demonstrate their superior performance over state of the art algorithms that assume that all agents belong to the same type and other baseline algorithms in the MAgent framework.

KEYWORDS

Mean Field Methods; Multiagent Systems; Reinforcement Learning; Many-Agent Learning

ACM Reference Format: Sriram Ganapathi Subramanian, Pascal Poupart, Matthew E. Taylor, and Nidhi Hegde. 2020. Multi Type Mean Field Reinforcement Learning. In Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9β13, 2020, IFAAMAS, 13 pages.

1 INTRODUCTION

Multiagent reinforcement learning (MARL) is a quickly growing field with lots of recent research pushing its boundaries [6, 14]. Yet, scaling the multiagent algorithms to environments with a large number of learning agents continues to be a problem [5]. Research advances in the field of MARL [2, 4] deal with only a limited number of agents and the proposed methods cannot be easily generalized to more complex scenarios with many agents. Some recent research has used the concept of mean field theory for enabling the use of MARL approaches to environments with many agents [12, 17].

Yet, the current algorithms implemented for many agents require some strong assumptions about the game environment to perform adequately. The important mean field approximation reduces a many agent problem into a simplified two agent problem where all the other participating agents are approximated as a single mean field. This mean field approximation, however, would be valid only for scenarios where all the agents in the environment can be considered similar to each other in objectives and abilities. Real world applications often have a set of agents that are diverse, and therefore it is virtually impossible to aggregate them into a single mean field.

In this paper, we introduce a concept of multiple types to model agent diversity in the mean field approximation for many agent reinforcement learning. The types are groupings applied to other agents in the environment in such a way that all members of a particular type play approximately similar strategies and have similar overall goals. Now, the modelling agent can consider each type to be a distinct agent, which has to be modelled separately. Within each type, the mean field approximation should still be reasonable as the agents within one particular type are more related to each other than other agents from different types. Thus, the many agent interaction is effectively reduced to agent interactions where is the number of types. Note that this is more complex than the simple two agent interaction considered by previous research and this can approximate a real world dynamic environment in a much better way. Most real world applications for many agent reinforcement learning can be broadly classified into two categories. The first category involves applications where we have predefined types and the type of each agent is known a priori. Some common applications include games with multiple teams (like quiz competitions), multiple party elections with coalitions, airline price analysis with multiple airlines forming an alliance beforehand, etc. We call these applications predefined known type scenarios. The other category are applications that involve a large number of agents that may have different policies due to differences in their rewards, actions or observations. Common examples are demand and supply scenarios, stock trading scenarios, etc., where one type of agent may be risk averse while another type may be risk seeking. We call these applications pre-defined unknown type scenarios, since there are no true underlying types like the previous case and a suitable type therefore must be assigned through observations. Another important aspect in this paper will be the notion of neighbourhood since each individual agent may be impacted more by agents whose states are βcloserβ (according to some distance measure) to the state of an agent. For instance, in battle environments, nearby agents pose a greater threat than far away agents.

Using the open source test environment for many agent games β MAgents [18] β we consider three testbeds that involve many strategic agents in the environment. Two of these testbeds correspond to the known type and the third one corresponds to the unknown type. We introduce two different algorithms for the known and unknown type scenarios. We demonstrate the superior performance of these algorithms in comparison to previous algorithms that assume a single type of agents in the testbeds.

2 BACKGROUND

Reinforcement Learning: Single agent reinforcement learning (RL) [15] is the most common form of reinforcement learning in the literature [1]. Here the problem is modelled in the framework of a Markov Decision Process (MDP) where the MDP is composed of denotes the set of states that the agent can move into, denotes the set of actions that the agent can take, denotes the transition distribution () and denotes the reward function (). The agents are allowed to explore the environment during the process of training and collect experience tuples . An agent learns a policy , which is a mapping from states to actions where the goal is to maximize the expected cumulative rewards discount factor. An optimal policy obtains the highest cumulative rewards among all possible policies.

In Multiagent reinforcement learning (MARL), there is a notion of stochastic games [3], where the state and action space are defined as Cartesian products of individual states and actions of different agents in the environment. A stochastic game can be considered a special type of normal form game [9], where a particular iteration of the game depends on previous game(s) played and the experiences of all the agents in the previous game(s). A stochastic game can be defined as a tuple is a finite set of states (assumed to be the same for all agents), is a finite set of denotes the set of actions of the agent is the transition distribution is the reward function with denoting the reward received by agent . Here each agent is trying to learn a policy that maximizes its return upon consideration of opponent behaviours. Agents are typically self interested and the combined system moves towards a Nash equilibrium [11]. Scalability in MARL environments is often a bottleneck. Important research efforts are aimed at handling only up to a handful of agents and the solutions or algorithms considered become intractable in many agent scenarios.

Mean Field Reinforcement Learning: Mean field theory approximates many agent interactions in a multiagent environment into two agent interactions [10] where the second agent corresponds to the mean effect of all the other agents. This allows domains with many agents that were previously considered intractable to be revisited and scalable approximate solutions to be devised [12, 17]. Mean field reinforcement learning applies mean field approximation to stochastic games [17]. In the paper by Yang et al. [17], the multi-agent function for stochastic games is decomposed additively into local functions that capture pairwise

interactions:

Here is the number of neighbours of the agent and is the index set of neighbouring agents. Yang et al. [17] showed that this decomposition can be well approximated by the mean field function under certain conditions. The mean action on the neighbourhood is expressed as where is the action of each neighbour . In the case of discrete actions, is a one-hot vector encoding and is a vector of fractions corresponding to the probability that each action may be executed by an agent at random.

The mean field function can be updated in a recurrent manner as follows,

where is the reward obtained. The are the old and new states respectively. is the learning rate. The value function for agent is given by,

Here, the term denotes the mean action of all the other agents apart from . The mean action for all the agents is first calculated using the relation, , where is the policy of the agent at time , and represents the previous mean action for the neighbours of the agent denotes the total number of agents in the neighbourhood of the agent . Then, the Boltzmann policy for each agent is calculated using , which is the Boltzmann softmax parameter,

3 MEAN FIELD MARL AND TYPES

We consider environments where there are types that the neighbouring agents can be classified into. In this paper, we assume that the function decomposes additively according to a partition of the agents into subsets that each include one agent of each type. This decomposition can be viewed as a generalization of pairwise decompositions to multiple types since each term depends on a representative from each type.

Let the standard multi-agent function be

Here we have a total of denotes the action of agent belonging to type in the neighbourhood of agent . Notice that this representation of the function includes the interaction with each one of the types and is not a simple pairwise interaction as done by [17]. Let us assume that we have a scheme in which we can classify each agent into one of these subsets. Note that we do not need each group to contain an equal number of agents as we can always make a new subset that contains one agent of a type and other agents to be placeholder agents (dead agents) of other types. We will relax this requirement of decomposition shortly and in practice we do not need to make these subsets at all.

3.1 Mean Field Approximation

We also assume discrete action spaces and use a one hot representation of the actions as in Yang et al. [17]. The one hot action of each agent belonging to type in the neighbourhood of agent is represented as where is the mean action of all agents in the neighbourhood of agent belonging to type and Λis the deviation between the action of an agent and the mean action of its type.

Let ; Λ; Λbe a vector obtained by the concatenation of all such deviations of the agents in the neighbourhood of the agent belonging to each of the types (all agents of a single subset). Similar to [17], we apply Taylorβs theorem to expand the function in Equation 5 to get,

which is the Taylor polynomial remainder. The summation term sums out to 0. Finally, if we ignore the remainder terms , we obtain the following approximation,

The magnitude of this approximation depends on the deviation Λbetween each actionand its mean field approximation Β―More precisely, we can quantify the overall effect of the mean field approximation by the average deviation . Theorems 3.1 and 3.2 show that the average deviation is reduced as we increase the number of types and Theorem 3.3 provides an explicit bound on the approximation in Eq. 6 based on the average deviation.

Theorem 3.1. When there are two types in the environment, but they have been considered to be the same type, the average deviation induced by the mean field approximation is bounded as follows,

where denotes the total number of agents in the environment; and denote the total number of agents of types 1 and 2, respectively; and are bounds on the average deviation for agents of types 1 and 2 respectively.

Similarly, and denote the errors induced by using a single type (instead of two types) in the mean field approximation.

Furthermore, denotes an action of an agent belonging to type 1 and denotes an action of an agent belonging to type 2. Here denotes the mean field action of all the agents, denotes the mean action of all agents of type 1 and denotes the mean action of all agents of type 2.

Proof. Since we have considered every agent to belong to only one type, the deviation between the agentβs action and the overall mean action is the error estimate. Hence, we will have the following,

The superscript () and subscript (2) have been dropped for simplicity.

Theorem 3.2. When there are two types in the environment, and they have been considered to be different types, the average deviation induced by the mean field approximation is bounded as follows,

We presented Theorems 3.1 and 3.2 to demonstrate the reduction in the bound on the average deviation as we increase the number of types from 1 to 2. Similar derivations can be performed to demonstrate that the bounds on the average deviation decrease as we increase the number of types (regardless of the true number of types).

Theorem 3.3. When the function is additively decomposable according to Equation 5, and it is -smooth, then the Multi Type Mean Field function provides a good approximation bounded by

Proof. We rewrite the expression for the function as . Suppose that is -smooth, where its gradient is Lipschitz-continous with constant such that for all

where denotes the -norm. Note that all the eigenvalues of can be bounded in the symmetric interval . As the Hessian matrix is real symmetric and hence diagonalizable, there exists an orthogonal matrix such that . It then follows that,

with

Let, Λ, consistent with the previous definition. Recall that the term is then ; Λ; Λ, where is the one-hot encoding for actions, and is a -dimensional categorical distribution. Then, it can be shown that,

Consider the term 12. Since is the maximum eigenvalue, from Equation 14 (with a slight abuse of notation),

Therefore, from Equation 16:

Thus,

Thus, in this paper, we modify the mean field function to include a finite number of types that each have a corresponding mean field. The function then considers a finite number of interactions across types.

3.2 Mean Field Update

The mean action represents the mean action of the neighbours of agent belonging to type . As in the paper by Yang et al. [17], the mean field function can be updated in a recurrent manner,

where is the reward obtained. The and are the old and new states respectively. is the learning rate. The value function for agent is given by,

Here, the term denotes the mean action of all the other agents apart from belonging to type . In all of our implementations, the mean action for all the types is first calculated using the relation

where is the policy of agent (in βs neighbourhood) and represents the previous mean action for the neighbours of agent belonging to type is the total number of agents of type in βs neighbourhood. Then, the Boltzmann policy for each agent

where is the Boltzmann softmax parameter.

By iterating through Equations 21, 22 and 23, the mean actions and respective policies of all agents keep improving. We prove in Theorem 3.4 that this approach converges to a fixed point within a small bounded distance of the Nash equilibrium. In Appendix A, we give a specific example for a case in which the Multi Type Mean Field algorithm does better than the simple mean field method.

We first make three mild assumptions about the Multi Type Mean Field update and then state two lemmas before giving Theorem 3.4.

Assumption 1: In the Multi Type Mean Field update, each action-value pair is visited infinitely often, and the reward is bounded by some constant.

Assumption 2: The agentβs policies are Greedy in the Limit with Infinite Exploration (GLIE). In the case of the Boltzmann policy, the policy becomes greedy w.r.t. the function in the limit as the temperature decays asymptotically to zero.

Assumption 3: For each stage game and in state in training, for all the Nash equilibrium is recognized either as a global optimum or a saddle point as expressed as:

Lemma 1. Under Assumption 3, the Nash operator a contraction mapping under the complete metric space from Q to Q with the fixed point being the Nash value of the entire game (), i.e.,

Proof. Refer to Theorem 17 in [7] for a detailed proof.

We define a new operator which is the Multi Type Mean Field operator. We differentiate this from the Nash operator used above. This operator is defined as (where

The Multi Type Mean Field value function can be defined as . This is the value function obtained from Equation 21. Also, . Now using the same principle of Lemma 1 on the Multi Type Mean Field operator, we can show that the Multi Type Mean Field operator also forms a contraction mapping (additionally refer to Proposition 1 in Yang et al. [17]).

converges to a constant with probability 1 (w.p.t 1) when

1) 0 β€

2)

where X is the set of possible states,

where and K is finite.

with constant and finite.

Here denotes the filtration of an increasing sequence of including the history of processes; and is a weighted maximum norm. The value of this constant ,

where and is the value with which the scale invariant iterative process is bounded. is the scale factor applied to the original process.

Proof. This lemma follows from Theorem 1 in [8]. We provide the complete proof of this lemma in the Appendix C, highlighting the changes from the Theorem 1 in [8].

Theorem 3.4. When updating according to Equations 20, 21, 22 and 23, for all agents , the multi-agentwill converge to a bounded distance of the Nash function under the Assumptions 1, 2 and 3, expressed as,

where . Here , from Theorem 3.3. The joint

Nash function is denoted as the Nash Q-value of the agent (value received by the agent in a Nash equilibrium), and

Proof. The proof of convergence of this theorem is structurally similar to that discussed by the authors in [7] and [17]. We provide the proof here, with changes necessitated by the multiple type case.

Note that in Assumption 3, Equation 24 corresponds to the global optimum and Equations 25 and 26 correspond to the saddle point. These assumptions are the same as those considered in [7]. Also note that the authors in [7] and [17] mention that Assumption 3 is a strong assumption to impose, which is needed to show the theoretical convergence, but this is not required to impose in practice.

From [7], we formally define the Nash operator

We need to apply Lemma 2 to prove Theorem 3.4. By subtracting on both sides of Equation 20 and in relation to Equation 28,

where denotes the visited state-joint action pair at the time

In Theorem 3.3, we proved a bound for the actual function and the Multi Type Mean Field function. We apply that to Equation 34, to get the following equation for

where

The aim is to prove that the four conditions of Lemma 2 hold and that in Equation 35 converges to a constant according to Lemma 2 and thus the MTMFfunction in Equation 35 converges to a point whose distance to the Nash Equilibrium is bounded. In Equation 28, refers to the learning rate and hence the first condition of Lemma 2 is automatically satisfied. The second condition is also true as we are dealing with finite state and action spaces.

Let -field generated by all random variables in the history time is a random variable derived from the historical trajectory up to time

To prove the third condition of Lemma 2, from Equation 34,

From Lemma 1, forms a contraction mapping with the norm being the maximum norm on . Thus, from Equation 35 we get,

Now, applying Equation 37 in Equation 36,

Since we are taking the max norm, the last two terms in the righthand side of Equation 38 are both positive and finite. We can prove that the term converges to 0 w.p.1. The proof involves the use of Assumption 3 (refer to Theorem 1 in [17]). We use this fact in the last step of Equation 38. Hence, the third condition of Lemma 2 is satisfied. The value of constant

For the fourth condition we use the fact that the MTMF operator forms a contraction mapping. Hence, and it follows that,

In the last step, the left side of the equation contains the reward and the value function as the variables. The reward is bounded by Assumption 1 and the value function is also bounded by being updated recursively by Equation 21 (MTMF is a contraction operator). So we can choose a positive, finite such that the inequality holds.

Finally, with all conditions met, it follows from Lemma 2 that converges to constant w.p.1. The value of this constant is from Lemma 2 and using the value of derived above. Therefore, from Equation 35 we get,

Hence, the multi-agent function converges to a point within a bounded distance from the Nash equilibrium of the stochastic game. The distance is a function of the error in the type classification and the closeness of resemblance of each agent to its type.

4 IMPLEMENTATION

We propose two algorithms based on Q-learning to estimate the Multi Type Mean Field function for known and unknown types. The algorithms, denoted as MTMFQ (Multi Type Mean Field Qlearning), train an agent to minimize the loss function . Here (from Eq. 20) is the target value used to calculate the temporal difference (T.D.) error using the weights . Now, the gradient is obtained as,

Algorithm 1 describes the Multi Type Mean Field Q-learning (MTMFQ) algorithm when agent types are known. In this algorithm, different groups of agents in the environment are considered as types. An agent models its relation to each type separately and ultimately chooses the action that provides maximum benefit in the face of competition against the different types. This is referred to as version 1 of MTMFQ. This algorithm deals with multiple types in contrast to MFQ described in the paper by Yang et al. [17]. In Line 8, each agent is chosen, and its neighbours are considered. The neighbours are classified into different types and in each type a new mean action is calculated (Line 9). In Lines 14 β 19, the Q networks are updated as done in common practice [13] for all the agents in the environment. At Line 12, the current actions are added to a buffer containing previous mean actions.

Version 2 of MTMFQ (see Algorithm 2) deals with the second type of scenario, where the type of each agent is unknown. An additional step doing K-means clustering is introduced to determine the types. Once we recognize the type of each agent, the algorithm is very similar to Algorithm 1. The clustering does not necessarily recognize the types correctly and the overall process is not as efficient as the known type case. But we will show that this way of approximate type determination is still better than using only a single mean field for all the agents. For the implementation we use neural networks, but it can be done without neural networks too. We only need a way to recursively update Equations 21, 22 and 23.

Note that the total number of types in the environment is unknown, and the agent does not need to guess the correct number of types. Generally, when more types are used, the approximate multi-agent function will be closer to the exact Nash function as shown by the bounds in Theorems 3.1, 3.2, 3.3 and 3.4. There is no risk of overfitting when using more types. In the limit, when there is one type per agent, we recover the exact multi-agent function. The only drawback is an increase in computational complexity. The code repository will appear at https://github.com/BorealisAI/mtmfrl.

5 EXPERIMENTS AND RESULTS

We report results with three games designed within the MAgents framework: the Multi Team Battle, Battle-Gathering and the Predator Prey domains. In the first two games, Multi Team Battle and the Battle-Gathering game, the conditions are such that the different groups (or teams) are fully known upfront. In the third game, the conditions are such that the types of the agents are initially unknown. Hence, the agents must also learn the identity of the opponent agents during game play. Multi Team Battle and Gathering are two separately existing MAgent games, which have been combined to obtain the Battle-Gathering game used in this paper. Predator Prey domain is also obtained from combining Multi Team Battle with another existing MAgent game (Pursuit). In preparation for each game, agents train for 2000 episodes of game play against different groups training using the same algorithm, which is referred to as the first stage. Next, in the second stage, they enter into a faceoff against other agents trained by other algorithms where they fight each other for 1000 games. We report the results for both stages. We repeat all experiments 50 times and report the averages.

As can be seen from the nature of the experiments, variances can be quite large across individual experimental runs.

In the first game (Multi Team Battle, see Figure 1(a)), there are four teams (different colours in Figure 1(a)) fighting against each other to win a battle. Here, agents belonging to one team are competing against agents from other teams and cooperating with agents of the same team. During the first stage, a team does not know how the other teams will play during the faceoff β so it clones itself to

Figure 1: Multi Team Battle Game Results.

Figure 2: Battle-Gathering Game Results.

three additional teams with slightly different rewards in order to induce different strategies. This is similar to self play. Each team receives a reward equal to the sum of the local rewards attributed to each agent in the team. The reward function is defined in such a way that it encourages local cooperation in killing opponents and discourages getting attacked and dying in the game. The reward function for different agent groups are also subtly different (refer to Appendix B for the details). Let us call each of these groups: Group A, Group B, Group C, and Group D. We maintain a notion

Figure 3: Predator Prey Game Results.

of favourable opponents and each group gets slightly higher rewards for killing the favourable opponents than the others. Four algorithms, namely MFQ [17], MFAC [17], Independent Q Learning (IL) [16], and MTMFQ are trained separately where each of these groups trains its own separate network using the same algorithm (all agents within a group train the same network). We start all the battles with 72 agents of each group for training and testing. Group A from MTMFQ, Group B from IL, Group C from MFQ, and Group D from MFAC enter the faceoff stage where they fight each other for 1000 games. Each game (or episode) has a maximum of 5000 steps and a game is considered to be won by the group/groups that have the most number of agents alive at the end of the game.

The results of the first training stage are reported in Figure 1(a). We report the cumulative rewards from all agents in Group A for each algorithm. These rewards are for Group A against the other 3 groups. Since the different groups are just clones of each other the reward curves for other groups are similar to that of Group A. In the training stage, the teams trained by different algorithms did not play against each other, but simply against the cloned teams trained by the same algorithm. At the beginning of training, for about 100 episodes the agents are still exploring, and their strategies are not well differentiated yet. As a result, MTMFQβs performance is still comparable to the performance of other algorithms. At this stage, the assumption of a single type is fine. As training progresses, each group begins to identify the favourable opponents and tries to make a targeted approach in the battle. When such differences exist across a wide range of agents, the MTMFQ algorithm shows a better performance than the other techniques as it explicitly considers the presence of different types in the game. Overall, we observe that MTMFQ has a faster convergence than all other algorithms, and it also produces higher rewards at the end of the complete training. This shows that MTMFQ identifies favourable opponents early, but the other algorithms struggle longer to learn this condition. The MFAC algorithm is the worst overall. This is consistent with the observation by Yang et al. [17], where the authors give particular reasons for this bad performance, including the greedy in the limit with infinite exploration (GLIE) assumption and a positive bias of Q-learning algorithms. This algorithm is not able to earn an overall positive reward upon the complete training. Figure 1(c) shows the win rate of the teams trained by each algorithm (MTMFQ, MFQ, MFAC, and IL) in a direct faceoff of 1000 episodes. In the face off, each group is trained by a different algorithm and with a different reward function that induces different strategies. The only algorithm that handles different strategies among the opponent teams is MTMFQ, and therefore it dominates the other algorithms with a win rate of 60%. Figure 1(d) reinforces this domination.

The second domain is the Battle-Gathering game (Figure 2(a)). In this game, all the agent groups compete for food resources that are limited (red denotes food particles and other colours are competing agents) in addition to killing its opponents as in the Multi Team Battle game. Hence, this game is harder than the first one. All the training and competition are similar to the first game.

Figure 2(b) reports the results of training in the Battle-Gathering game. Like the Multi Team Battle game, we plot the rewards obtained by Group A while fighting other groups for each algorithm. Again, MTMFQ shows the strongest performance in comparison to the other three algorithms. The MFQ technique performs better than both MFAC and IL. In this game, MTMFQ converges in around 1500 episodes, while the other algorithms take around 1800 episodes to converge. The win rates shown in Figure 2(c) and the total rewards reported in Figure 2(d) also show the dominance of MTMFQ.

The third domain is the multiagent Predator Prey (Figure 3(a)). Here we have two distinct types of agents β predator and prey, each having completely different characteristics. The prey are faster than the predators and their ultimate goal is to escape the predators. The predators are slower than the prey, but they have higher attacking abilities than the prey. So the predators try to attack more and kill more prey. We model this game as an unknown type scenario where the types of the other agents in the competition are not known before hand (refer to Appendix B for more details). The MTMFQ algorithm plays the version with unknown types (Algorithm 2). Here we have four groups with the first two groups (Groups A and B) being predators and the other two groups (Groups C and D) being prey. Each algorithm will train two kinds of predator agents and two kinds of prey agents. All these agents are used in the playoff stage. In the playoff stage we have 800 games where we change the algorithm of predator and prey at every 200 games to maintain a fair competition. For the first 200 games, MTMFQ plays Group A, MFQ plays Group B, IL plays Group C, and MFAC plays Group D. In the next 200 games, MFAC plays Group A, MTMFQ plays Group B, MFQ plays Group C and IL plays Group D, and so on. We start all training and testing episodes with 90 prey and 45 predators for each group. Winning a game in the playoff stage is defined in the same way as the previous two games. Notice that this makes it more fair, as predators have to kill a lot more prey to win the game (as we start with more prey than predators) and prey have to survive longer. In this setup, the highly different types of agents make type identification easier for MTMFQ (as the types are initially unknown). The prey execute more move actions while the predators execute more attack actions. This can be well differentiated by clustering.

The results of the first training stage are reported in Figure 3(b). MTMFQ has comparable or even weaker performance than other algorithms in the first 600 episodes of the training and the reasoning is similar to the reasoning in the Multi Team Battle game (the agent strategies are not sufficiently differentiated for multiple types to be useful). Notice that for this game, MTMFQ takes many more episodes than the earlier two games to start dominating. This is because of the inherent hardness of this domain compared to the other domains (very different and unknown types). Similar to observations in the other domains, MTMFQ converges earlier (after around 1300 episodes as opposed to 1700 for the other algorithms). This shows its robustness to the different kinds of opponents in the environment. MTMFQ also gains higher cumulative rewards than the other algorithms. Win rates in Figure 3(c) show that MTMFQ still wins more games than the other algorithms, but the overall percentage of games won is less than the other domains. Thus, irrespective of the difficulty of the challenge we can see that MTMFQ has an upper hand. The lead of MTMFQ is also observed in Figure 3(d).

6 CONCLUSION

In this paper, we extended the notion of mean field theory to multiple types in MARL. We demonstrate that reducing many agent interactions to simple two agent interactions does not give very accurate solutions in environments where there are clearly different teams/types playing different strategies. We perform suitable experiments using MAgents and demonstrate superior performances using a type based approach. We hope that this paper will provide a different dimension to the mean field theory based MARL research.

One limitation of our approach is that it is computationally more expensive than the mean field reinforcement learning method without types. If we really have only one type in the environment, then our method would add more compute time and not necessarily produce a better result. As future work we would like to extend this work for completely heterogeneous agents with different action spaces as well. StarCraft is one example of such a domain. Our work would be well suited for this scenario as clustering would be even easier. Another approach would be to consider sub types, further dividing types.

REFERENCES

[1] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. 2017. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine 34, 6 (Nov 2017), 26β38. https://doi.org/10.1109/MSP.2017.2743240

[2] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. 2015. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research 53 (2015), 659β697.

[3] Michael Bowling and Manuela Veloso. 2000. An analysis of stochastic game theory for multiagent reinforcement learning. Technical Report. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.

[4] Lucian Bu, Robert Babu, Bart De Schutter, et al. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38, 2 (2008), 156β172.

[5] Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2010. Multi-agent Reinforcement Learning: An Overview. Vol. 310. Delft University of Technology, 183β221. https://doi.org/10.1007/978-3-642-14435-6_7

[6] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents

and Multi-Agent Systems 33, 6 (01 Nov 2019), 750β797. https://doi.org/10.1007/ s10458-019-09421-1

[7] Junling Hu and Michael P Wellman. 2003. Nash Q-learning for general-sum stochastic games. Journal of machine learning research 4, Nov (2003), 1039β1069.

[8] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. 1994. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation 6, 6 (1994), 1185β1201. https://doi.org/10.1162/neco.1994.6.6.1185

[9] James S Jordan. 1991. Bayesian learning in normal form games. Games and Economic Behavior 3, 1 (1991), 60β81.

[10] Jean-Michel Lasry and Pierre-Louis Lions. 2007. Mean field games. Japanese journal of mathematics 2, 1 (2007), 229β260.

[11] Eric Maskin. 1999. Nash equilibrium and welfare optimality. The Review of Economic Studies 66, 1 (1999), 23β38.

[12] David Mguni, Joel Jennings, and Enrique Munoz de Cote. 2018. Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.

[14] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. 2018. Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications. arXiv preprint arXiv:1812.11794 (2018).

[15] Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.

[16] Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning. 330β337.

[17] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholm, Stockholm Sweden, 5571β5580. http://proceedings.mlr.press/v80/yang18d.html

[18] Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In Thirty-Second AAAI Conference on Artificial Intelligence.

APPENDIX A: ILLUSTRATIVE EXAMPLE

This section provides an example of a simulated scenario where the Multi Type Mean Field algorithm is more useful than the simple mean field algorithm. Consider a game in which the central agent has to decide the

direction of spin. The domain is stateless. The spin direction is

influenced by the direction of spin of its A,B,C and D neighbours (four neighbours in total). Here we denote A as the neighbour to the left of the agent, B as the neighbour to the top of the agent, C as the neighbour to the right of the agent, and D as the neighbour to the bottom of the agent. The neighbour agents spin in one direction at random. If the agent spins in the same direction as both of its C and D neighbours, the agent gets a reward of -2 regardless of what the A and B are doing. If the spin is in the same direction as both of its A and B neighbours, the agent gets a +2 reward unless the direction is not the same as the one used by both of its C and D neighbours. All other scenarios result in a reward of 0. So the agent in Grid A in Figure 4 will get a -2 for the spin down (since the C and D neighbours are spinning down) and a +2 for a spin up (since the A and B neighbours are spinning up). In Grid B of Figure 4 the agent will get a -2 for spin up and a +2 for spin down. It is clear that the best action in Grid A is to spin up and the best action in Grid B is to spin down. Here we have a notion of multi stage game with many individual

stages. In each stage of a multi stage game, one or more players

take one action each, simultaneously and obtain rewards. Consider a sequence of stage games in which the agent gets Grid

A for every 2 consecutive stages and then the Grid B for the third

stage. The goal of the agent is to take the best possible action at all stages. Let us assume that the agent continues to learn at all stages, and it starts with Q values of 0. We apply MFQ (Equation 2) and MTMFQ (Equation 20) and show why MFQ fails, but MTMFQ succeeds in this situation. We are going to calculate the Q values for the 3 stages using both the MFQ (from [17]) and the MTMFQ update rules. We approximate the average action using the number of times the neighbourhood agents spin up. In the MTMFQ we use the A, B neighbours as the type 1 and the C, D neighbours as the

Thus, the agent will choose to spin up in the first stage (correct action). For the second stage,

Again the agent will choose to spin up in the second stage (correct action). For the third stage,

Figure 4: A counter example to show the failure of MFQ and success of MTMFQ.

Here again the agent will choose to make the spin up (wrong action). Now coming to MTMFQ updates, for the first stage,

Now it can be seen that the agent will spin down in this case (correct action).

Thus, the MFQ agent will make one wrong move out of 3 moves whereas the MTMFQ agent will make the right move all the time. In situations like these, where the relationship of the agent with different neighbour agents is different, the MFQ algorithm would fail. The differences would be captured by MTMFQ which would take more efficient actions. This shows an example where MTMFQ outperforms MFQ.

APPENDIX B: EXPERIMENTAL DETAILS

This section gives more details about the experimental conditions, especially the reward function.

For the Multi Team Battle domain, the agents in all of these groups get a reward of -0.01 for each move action, and a reward of -1 for dying. Attacking an empty grid has a reward of -0.1. Each of these members has a power of 10 and a damage of 2. When an agent loses all its powers, it dies. The power is like health that the agents maintain which gets depleted on being attacked. The agents can also recover (regain health points) from the attack using a step recovery rate. This is set to be 0.1. The positive rewards for attacking another agent is cyclic in nature with each group preferring to attack and kill a particular opponent group more than others. Group A has a positive reward of 0.2, 0.3, and 0.4 for attacking an agent of group B, C and D respectively, and it gets a reward of 80, 90, and 100 for killing a member of group B, C, and D respectively. Here we will call Group D as the favourable opponent of Group A as A gets more returns for fighting or killing D. The group B has a positive reward of 0.2, 0.3, and 0.4 for attacking a member of group C, D, and A respectively and a positive reward of 80, 90, and 100 for killing a member of Group C, D, and A respectively. Thus, Group A is the favourable opponent of Group B. Group C has a reward of 0.2, 0.3, and 0.4 for attacking Groups D, A, and B respectively. It gets a kill reward of 80, 90, and 100 for killing agents of Group D, A and B respectively. For Group D the order is Group A, B and C with a attack reward of 0.2, 0.3, and 0.4 and kill reward of 80, 90, and 100.

For the Battle-Gathering game, the reward function is similar with an addition of each agent getting a +80 for collecting a food resource. The food resources are stationary objects, but each food resource has a power similar to agents. This power has to be reduced by damage before the food can be captured. Capturing a food will constitute making repeated efforts to gain the food resource by attacking the grid containing food. The agents would get a +0.5 for making such an attack.

For the Predator Prey domain, the predators have a power of 10 and a speed of 2 with an attack range of 2 (they can attack within a distance of 2 units) and they get a dead penalty of -0.1 and an attack penalty of -0.2. The step recovery rate is 0.1. The prey have a faster speed of 2.5 and an attack range of 0. All agents get a reward of +5 for killing agents belonging to other groups. Additionally, Group A gets a reward of +0.5 for attacking a member of Group B (since Group B is also a predator, Group A does not prefer to attack that), +1 for attacking a member of Group C, +3 for attacking a member of Group D (though both Groups C and D are prey, A prefers D to C). Group B gets 0.5 for attacking A, but gets +1 for attacking D and +3 for attacking C. Group B prefers Group C to Group D. Every attack action entails a punishment for the (attack) receiver which is equal in value to the reward for the attacker.

APPENDIX C: PROOF OF LEMMA 2 USED IN THIS PAPER

This section explains the changed Lemma used in this paper compared to Theorem 1 in [8].

We state and prove a general theorem for a stochastic process before we begin the proof.

where goes from 0 to , then the general solution of this process can be given by

where

Proof. Consider the expression,

Z transforms will be applied to solve this expression. Z transform is a generalized version of Discrete Time Fourier Transform that transforms the variables into a new subspace where solving the equation is easier. Once we obtain a solution, we can apply inverse Z transforms to get the solution in terms of the original variables.

Considering 1 we can rewrite the Equation 45 as

Now taking the inverse Z transform

The above result is for the case where 1. If 1, see that the Equation 42 forms an arithmetic progression whose general term is

Now from Theorem 6.1, notice that if we want a general stochastic process of that form to converge we need the coefficient a fraction (as we take a limit to in the solution only a a fraction will give a converged result). Note that this result only needs to be finite, and it can be any finite number. If we iterate to infinity then we can apply the limit to the solution which will converge to

converges to zero w.p.1, if the following conditions are satisfied:

All the random variables are allowed to depend on the past . are assumed to be non-negative and mutually independent given

Proof. This is the same as Lemma 1 in [8]. The proof is based on that fact that we can divide the process (both sides of Equation 48) by a large value the Equation 48 is effectively reduced to

This is because of the conditions 1 and 2 which guarantees that is a fraction and that the variance of the process is finite.

Now if you consider Equation 50, the update is in such a way that the process is equal to a fraction of its previous value. Thus, this process converges to 0 w.p.1.

where is a sequence of functions and is a random process. Let be a probability space. If the following are satisfied: 1) The process is scale invariant. That is w.p.1 for all

2) If we can keep bounded by scaling the process then would converge to a constant w.p.1 under condition 1.

The original process will converge to a constant is the scaling factor that was applied to

Proof. The intuition of the proof is that we have a process that starts at a value and its first difference reduces with time till the value stabilizes at a point. Now, if this process is invariant to scaling, we can start the process with a small value (after scaling by a small fraction ) and then we can select a constant over which the should not increase (we can scale the whole process if it goes above that constant). Now according to the second condition the bounded process should converge to a constant . Here the relation is To show that the net effect of the corrections must stay finite w.p.1, note that if converges then for any 0 there exists such that for all with probability at least 1 . This implies that the norm of the original process does not go above . Thus, convergence of to constant then imply that the scaled version of the original process converges to the same constant w.p.1 under the bound. Now if we remove the scaling factor the convergence point of the original process is

Proof. This is simply an application of Lemma 4.

According to condition 5, we have a process that is scale invariant. Let us assume that we applied a scaling factor of to that process to get the bound as in Equation 53. Now, let us consider the iterative process

Equation 54 is linear in and will converge to a point by Theorem 6.1 w.p.1, to some , where , where is a finite arbitrary constant. From Theorem 6.1 we can see that Equation 54 converges to a constant and hence, Lemma 4 can be applied. The convergence point will be from Theorem 6.1 for Equation 54. If we change the value of the bound then the convergence point will change accordingly. To get the convergence point of the original process, we reapply the scaling factor. Thus, we get that point to be

converges to a constant w.p.1 under the following conditions:

3) finite.

stands for the past at step is allowed to depend on the past. and are assumed to be non negative. The notation ||.|| refers to some weighted maximum norm.

The value of this constant is the constant with which the iterative process is bounded. is the scaling factor that was applied to the original process.

Proof. Be defining we can decompose the iterative process into two parallel processes given by

where . Dividing both the sides of Equation 55 by for each and denoting and we can bound the process by condition 3.

Now we can rewrite the equation pair from condition 3 as

Let us assume that the process stays bounded. Then the variance of is bounded by some constant and thereby converges to zero w.p.1 according to Lemma 3. Hence, there exists such that for all with probability at least 1 This implies that the process can be further bounded by

with probability . If we choose such that 1 then for

Note that in the above relation we do not need the term to be less than 1. We only need the product of this term with to be less that one. Let us represent . Now rewriting Equation 57 we get the following bound,

Let us bound the norm by . Then we get the bound as

Let us assume that is scale invariant (we prove that below). Now we can apply Lemma 5 as this satisfies all the conditions of Lemma 5. The original process converges to a constant w.p.1. Again according to Lemma 5 this constant value is

where is the factor with which the original process was scaled. Let us denote a new fraction . Thus, the value of

. Now, this guarantees w.p.1 convergence of the original process under the boundedness assumption if that process is scale invariant. Let the constant be the point Equation 60 converges to according to Theorem 6.1. Since, the value of is very small for a small (we will show that is small in the application) and a small (since is arbitrary, we can choose a small ), we can get

Now we prove the scale invariance condition same as Jaakola et al. [8]. By Condition 4, can be written as where . Let us now decompose

and converges to zero w.p.1 by Lemma 3. Again by choosing such that 1 we can bound the processes for is then a scale invariant process whose bounded version was proven earlier to converge to This proves the w.p.1 convergence of the triple the original process.