b

DiscoverSearch
About
My stuff
Multi Type Mean Field Reinforcement Learning
2020Β·arXiv
ABSTRACT
ABSTRACT

Mean field theory provides an effective way of scaling multiagent reinforcement learning algorithms to environments with many agents that can be abstracted by a virtual mean agent. In this paper, we extend mean field multiagent algorithms to multiple types. The types enable the relaxation of a core assumption in mean field reinforcement learning, which is that all agents in the environment are playing almost similar strategies and have the same goal. We conduct experiments on three different testbeds for the field of many agent reinforcement learning, based on the standard MAgents framework. We consider two different kinds of mean field environments: a) Games where agents belong to predefined types that are known a priori and b) Games where the type of each agent is unknown and therefore must be learned based on observations. We introduce new algorithms for each type of game and demonstrate their superior performance over state of the art algorithms that assume that all agents belong to the same type and other baseline algorithms in the MAgent framework.

KEYWORDS

Mean Field Methods; Multiagent Systems; Reinforcement Learning; Many-Agent Learning

ACM Reference Format: Sriram Ganapathi Subramanian, Pascal Poupart, Matthew E. Taylor, and Nidhi Hegde. 2020. Multi Type Mean Field Reinforcement Learning. In Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9–13, 2020, IFAAMAS, 13 pages.

Multiagent reinforcement learning (MARL) is a quickly growing field with lots of recent research pushing its boundaries [6, 14]. Yet, scaling the multiagent algorithms to environments with a large number of learning agents continues to be a problem [5]. Research advances in the field of MARL [2, 4] deal with only a limited number of agents and the proposed methods cannot be easily generalized to more complex scenarios with many agents. Some recent research has used the concept of mean field theory for enabling the use of MARL approaches to environments with many agents [12, 17].

image

image

Yet, the current algorithms implemented for many agents require some strong assumptions about the game environment to perform adequately. The important mean field approximation reduces a many agent problem into a simplified two agent problem where all the other participating agents are approximated as a single mean field. This mean field approximation, however, would be valid only for scenarios where all the agents in the environment can be considered similar to each other in objectives and abilities. Real world applications often have a set of agents that are diverse, and therefore it is virtually impossible to aggregate them into a single mean field.

In this paper, we introduce a concept of multiple types to model agent diversity in the mean field approximation for many agent reinforcement learning. The types are groupings applied to other agents in the environment in such a way that all members of a particular type play approximately similar strategies and have similar overall goals. Now, the modelling agent can consider each type to be a distinct agent, which has to be modelled separately. Within each type, the mean field approximation should still be reasonable as the agents within one particular type are more related to each other than other agents from different types. Thus, the many agent interaction is effectively reduced to  𝑀agent interactions where  𝑀is the number of types. Note that this is more complex than the simple two agent interaction considered by previous research and this can approximate a real world dynamic environment in a much better way. Most real world applications for many agent reinforcement learning can be broadly classified into two categories. The first category involves applications where we have predefined types and the type of each agent is known a priori. Some common applications include games with multiple teams (like quiz competitions), multiple party elections with coalitions, airline price analysis with multiple airlines forming an alliance beforehand, etc. We call these applications predefined known type scenarios. The other category are applications that involve a large number of agents that may have different policies due to differences in their rewards, actions or observations. Common examples are demand and supply scenarios, stock trading scenarios, etc., where one type of agent may be risk averse while another type may be risk seeking. We call these applications pre-defined unknown type scenarios, since there are no true underlying types like the previous case and a suitable type therefore must be assigned through observations. Another important aspect in this paper will be the notion of neighbourhood since each individual agent may be impacted more by agents whose states are β€œcloser” (according to some distance measure) to the state of an agent. For instance, in battle environments, nearby agents pose a greater threat than far away agents.

Using the open source test environment for many agent games β€” MAgents [18] β€” we consider three testbeds that involve many strategic agents in the environment. Two of these testbeds correspond to the known type and the third one corresponds to the unknown type. We introduce two different algorithms for the known and unknown type scenarios. We demonstrate the superior performance of these algorithms in comparison to previous algorithms that assume a single type of agents in the testbeds.

Reinforcement Learning: Single agent reinforcement learning (RL) [15] is the most common form of reinforcement learning in the literature [1]. Here the problem is modelled in the framework of a Markov Decision Process (MDP) where the MDP is composed of  ⟨S,𝐴, 𝑃, π‘…βŸ©, where Sdenotes the set of states that the agent can move into,  𝐴denotes the set of actions that the agent can take,  𝑃denotes the transition distribution (𝑃(𝑠′|𝑠,π‘Ž)) and  𝑅denotes the reward function (𝑅(𝑠,π‘Ž)). The agents are allowed to explore the environment during the process of training and collect experience tuples  βŸ¨π‘ ,π‘Ž,𝑠′,π‘ŸβŸ©. An agent learns a policy  πœ‹ : S β†’ 𝐴, which is a mapping from states to actions where the goal is to maximize the expected cumulative rewards �𝑑 𝛾𝑑𝑅(𝑠𝑑,π‘Žπ‘‘), where 𝛾 ∈ [0, 1) is thediscount factor. An optimal policy obtains the highest cumulative rewards among all possible policies.

In Multiagent reinforcement learning (MARL), there is a notion of stochastic games [3], where the state and action space are defined as Cartesian products of individual states and actions of different agents in the environment. A stochastic game can be considered a special type of normal form game [9], where a particular iteration of the game depends on previous game(s) played and the experiences of all the agents in the previous game(s). A stochastic game can be defined as a tuple  ⟨S, 𝑁, A, 𝑃, π‘…βŸ© where Sis a finite set of states (assumed to be the same for all agents),  𝑁is a finite set of  𝑛 agents,A = 𝐴1 Γ—Β· Β· ·×𝐴𝑛 where 𝐴𝑗 denotes the set of actions of the agent  𝑗.𝑃is the transition distribution  𝑃(𝑠′|𝑠, a) where a = (π‘Ž1, . . . ,π‘Žπ‘›) and𝑅𝑗 (𝑠, a) = π‘Ÿ 𝑗is the reward function with  π‘Ÿ 𝑗denoting the reward received by agent  𝑗. Here each agent is trying to learn a policy that maximizes its return upon consideration of opponent behaviours. Agents are typically self interested and the combined system moves towards a Nash equilibrium [11]. Scalability in MARL environments is often a bottleneck. Important research efforts are aimed at handling only up to a handful of agents and the solutions or algorithms considered become intractable in many agent scenarios.

Mean Field Reinforcement Learning: Mean field theory approximates many agent interactions in a multiagent environment into two agent interactions [10] where the second agent corresponds to the mean effect of all the other agents. This allows domains with many agents that were previously considered intractable to be revisited and scalable approximate solutions to be devised [12, 17]. Mean field reinforcement learning applies mean field approximation to stochastic games [17]. In the paper by Yang et al. [17], the multi-agent  𝑄function for stochastic games is decomposed additively into local  𝑄functions that capture pairwise

interactions:

image

Here  𝑛𝑗is the number of neighbours of the agent  𝑗and  πœ‚(𝑗)is the index set of neighbouring agents. Yang et al. [17] showed that this decomposition can be well approximated by the mean field  𝑄function  𝑄 𝑗 (𝑠, a) β‰ˆ 𝑄 𝑗𝑀𝐹 (𝑠,π‘Žπ‘—,π‘Žπ‘—)under certain conditions. The mean action  π‘Žπ‘— on the neighbourhood  πœ‚(𝑗) of agent 𝑗is expressed as  π‘Žπ‘— = 1π‘›π‘—οΏ½π‘˜ βˆˆπœ‚(𝑗) π‘Žπ‘˜where  π‘Žπ‘˜is the action of each neighbour π‘˜. In the case of discrete actions,  π‘Žπ‘˜is a one-hot vector encoding and  π‘Žπ‘— is a vector of fractions corresponding to the probability that each action may be executed by an agent at random.

The mean field  𝑄function can be updated in a recurrent manner as follows,

image

where  π‘Ÿ 𝑗 is the reward obtained. The  𝑠,𝑠′ are the old and new states respectively.  𝛼𝑑is the learning rate. The value function  𝑣 𝑗𝑑 (𝑠′)for agent  𝑗 at time 𝑑is given by,

image

Here, the term  π‘Žπ‘—denotes the mean action of all the other agents apart from  𝑗. The mean action for all the agents is first calculated using the relation,  π‘Žπ‘— = 1𝑁 π‘—οΏ½π‘˜ π‘Žπ‘˜,π‘Žπ‘˜ ∼ πœ‹π‘˜π‘‘ (Β·|𝑠,π‘Žπ‘˜βˆ’), where  πœ‹π‘˜π‘‘is the policy of the agent  π‘˜at time  𝑑, and  π‘Žπ‘˜βˆ’represents the previous mean action for the neighbours of the agent  π‘˜. 𝑁 𝑗 denotes the total number of agents in the neighbourhood of the agent  𝑗. Then, the Boltzmann policy for each agent  𝑗is calculated using  𝛽, which is the Boltzmann softmax parameter,

image

We consider environments where there are  𝑀types that the neighbouring agents can be classified into. In this paper, we assume that the  𝑄function decomposes additively according to a partition of the agents into  𝑋 𝑗 subsets that each include one agent of each type. This decomposition can be viewed as a generalization of pairwise decompositions to multiple types since each term depends on a representative from each type.

Let the standard multi-agent  𝑄function be  𝑄 𝑗 (𝑠, 𝒂), then,

image

Here we have a total of  𝑀 types and π‘Žπ‘˜π‘šdenotes the action of agent π‘˜belonging to type  π‘šin the neighbourhood of agent  𝑗. Notice that this representation of the  𝑄function includes the interaction with each one of the types and is not a simple pairwise interaction as done by [17]. Let us assume that we have a scheme in which we can classify each agent into one of these subsets. Note that we do not need each group to contain an equal number of agents as we can always make a new subset that contains one agent of a type and other agents to be placeholder agents (dead agents) of other types. We will relax this requirement of decomposition shortly and in practice we do not need to make these subsets at all.

3.1 Mean Field Approximation

We also assume discrete action spaces and use a one hot representation of the actions as in Yang et al. [17]. The one hot action of each agent  π‘˜belonging to type  π‘šin the neighbourhood of agent  𝑗is represented as  π‘Žπ‘˜π‘šπ‘š = π‘Žπ‘—π‘š + ˆ𝛿 𝑗,π‘˜π‘šwhere  π‘Žπ‘—π‘šis the mean action of all agents in the neighbourhood of agent  𝑗belonging to type  π‘šand ˆ𝛿 𝑗,π‘˜π‘šis the deviation between the action of an agent and the mean action of its type.

Let  𝛿 𝑗,π‘˜π‘– = [ ˆ𝛿 𝑗,π‘˜1; ˆ𝛿 𝑗,π‘˜2; Β· Β· Β·; ˆ𝛿 𝑗,π‘˜π‘€ ]be a vector obtained by the concatenation of all such deviations of the agents in the neighbourhood of the agent  𝑗belonging to each of the  𝑀types (all agents of a single subset). Similar to [17], we apply Taylor’s theorem to expand the  𝑄function in Equation 5 to get,

image

which is the Taylor polynomial remainder. The summation term [ 1𝑋 𝑗�𝑋 𝑗𝑖=1 𝛿 𝑗,π‘˜π‘– ]sums out to 0. Finally, if we ignore the remainder terms  𝑅𝑗𝑠,π‘Žπ‘—, we obtain the following approximation,

image

The magnitude of this approximation depends on the deviation ˆ𝛿 𝑗,π‘˜π‘šbetween each actionπ‘Žπ‘˜π‘šπ‘š and its mean field approximation Β―π‘Žπ‘—π‘š.More precisely, we can quantify the overall effect of the mean field approximation by the average deviation οΏ½π‘˜ || Λ†π›Ώπ‘˜ ||2/𝑁. Theorems 3.1 and 3.2 show that the average deviation is reduced as we increase the number of types and Theorem 3.3 provides an explicit bound on the approximation in Eq. 6 based on the average deviation.

Theorem 3.1. When there are two types in the environment, but they have been considered to be the same type, the average deviation induced by the mean field approximation is bounded as follows,

image

where  𝑁denotes the total number of agents in the environment;  𝐾1and  𝐾2denote the total number of agents of types 1 and 2, respectively; πœ–1and  πœ–2are bounds on the average deviation for agents of types 1 and 2 respectively.

image

Similarly,  𝛼1and  𝛼2denote the errors induced by using a single type (instead of two types) in the mean field approximation.

image

Furthermore,  π‘Žπ‘˜1denotes an action of an agent belonging to type 1 and  π‘Žπ‘˜2denotes an action of an agent belonging to type 2. Here  π‘Ždenotes the mean field action of all the agents,  π‘Ž1denotes the mean action of all agents of type 1 and  π‘Ž2denotes the mean action of all agents of type 2.

Proof. Since we have considered every agent to belong to only one type, the deviation between the agent’s action and the overall mean action is the error estimate. Hence, we will have the following,

image

The superscript (𝑗) and subscript (2) have been dropped for simplicity.

image

Theorem 3.2. When there are two types in the environment, and they have been considered to be different types, the average deviation induced by the mean field approximation is bounded as follows,

image

We presented Theorems 3.1 and 3.2 to demonstrate the reduction in the bound on the average deviation as we increase the number of types from 1 to 2. Similar derivations can be performed to demonstrate that the bounds on the average deviation decrease as we increase the number of types (regardless of the true number of types).

Theorem 3.3. When the  𝑄function is additively decomposable according to Equation 5, and it is  𝐿-smooth, then the Multi Type Mean Field  𝑄function provides a good approximation bounded by

image

Proof. We rewrite the expression for the  𝑄function as  𝑄(π‘Ž) β‰œπ‘„ 𝑗 (𝑠,π‘Žπ‘—,π‘Žπ‘—1,π‘Žπ‘—2, Β· Β· Β· ,π‘Žπ‘—π‘€). Suppose that  𝑄is  𝐿-smooth, where its gradient  βˆ‡π‘„is Lipschitz-continous with constant  𝐿such that for all  π‘Ž,π‘Ž,

image

where  || Β· ||2denotes the  𝑙2-norm. Note that all the eigenvalues of  βˆ‡2𝑄can be bounded in the symmetric interval  [βˆ’πΏ, 𝐿]. As the Hessian matrix  βˆ‡2𝑄is real symmetric and hence diagonalizable, there exists an orthogonal matrix  π‘ˆsuch that  π‘ˆπ‘‡ [βˆ‡2𝑄]π‘ˆ = Ξ› β‰œπ‘‘π‘–π‘Žπ‘”[πœ†1, . . . , πœ†π·]. It then follows that,

image

with

image

Let, Λ†π›Ώπ‘šπ‘Ž = π‘Žπ‘—π‘š βˆ’ π‘Žπ‘—π‘š, consistent with the previous definition. Recall that the term  𝛿is then  π›Ώπ‘Ž = [ ˆ𝛿1π‘Ž; ˆ𝛿2π‘Ž; Β· Β· Β·; Λ†π›Ώπ‘€π‘Ž], where  π‘Žis the one-hot encoding for  𝐷actions, and  π‘Žis a  𝐷-dimensional categorical distribution. Then, it can be shown that,

image

Consider the term 12π‘‹οΏ½π‘‹π‘˜=1 𝑅𝑗𝑠,π‘Žπ‘— (π‘Žπ‘˜). Since  𝐿is the maximum eigenvalue, from Equation 14 (with a slight abuse of notation),

image

Therefore, from Equation 16:

image

Thus,

image

Thus, in this paper, we modify the mean field  𝑄function to include a finite number of types that each have a corresponding mean field. The  𝑄function then considers a finite number of interactions across types.

3.2 Mean Field Update

The mean action  π‘Žπ‘—π‘–represents the mean action of the neighbours of agent  𝑗belonging to type  𝑖. As in the paper by Yang et al. [17], the mean field  𝑄function can be updated in a recurrent manner,

image

where  π‘Ÿ 𝑗is the reward obtained. The  𝑠and  𝑠′are the old and new states respectively.  𝛼𝑑is the learning rate. The value function  𝑣 𝑗𝑑 (𝑠′)for agent  𝑗 at time 𝑑is given by,

image

Here, the term  π‘Žπ‘—π‘–denotes the mean action of all the other agents apart from  𝑗belonging to type  𝑖. In all of our implementations, the mean action for all the types is first calculated using the relation

image

where  πœ‹π‘˜π‘‘is the policy of agent  π‘˜(in  𝑗’s neighbourhood) and  π‘Žπ‘˜π‘–βˆ’represents the previous mean action for the neighbours of agent  π‘˜belonging to type  𝑖. 𝑁 𝑗𝑖is the total number of agents of type  𝑖in 𝑗’s neighbourhood. Then, the Boltzmann policy for each agent  𝑗 is

image

where  𝛽is the Boltzmann softmax parameter.

By iterating through Equations 21, 22 and 23, the mean actions and respective policies of all agents keep improving. We prove in Theorem 3.4 that this approach converges to a fixed point within a small bounded distance of the Nash equilibrium. In Appendix A, we give a specific example for a case in which the Multi Type Mean Field algorithm does better than the simple mean field method.

We first make three mild assumptions about the Multi Type Mean Field update and then state two lemmas before giving Theorem 3.4.

Assumption 1: In the Multi Type Mean Field update, each action-value pair is visited infinitely often, and the reward is bounded by some constant.

Assumption 2: The agent’s policies are Greedy in the Limit with Infinite Exploration (GLIE). In the case of the Boltzmann policy, the policy becomes greedy w.r.t. the  𝑄function in the limit as the temperature decays asymptotically to zero.

Assumption 3: For each stage game  [𝑄1𝑑 (𝑠), . . . ,𝑄𝑁𝑑 (𝑠)] at time𝑑and in state  𝑠in training, for all  𝑑,𝑠, 𝑗 ∈ {1, . . . , 𝑁 }the Nash equilibrium  πœ‹βˆ— = [πœ‹1βˆ—, . . . , πœ‹π‘βˆ— ]is recognized either as a global optimum or a saddle point as expressed as:

image

Lemma 1. Under Assumption 3, the Nash operator  H π‘π‘Žπ‘ β„Ž formsa contraction mapping under the complete metric space from Q to Q with the fixed point being the Nash  𝑄value of the entire game (π‘Έβˆ—), i.e.,  H π‘π‘Žπ‘ β„Žπ‘Έβˆ— = π‘Έβˆ—.

Proof. Refer to Theorem 17 in [7] for a detailed proof. β–‘

We define a new operator  H 𝑀𝑇𝑀𝐹which is the Multi Type Mean Field operator. We differentiate this from the Nash operator used above. This operator is defined as (where  𝑸 β‰œ [𝑄1, . . . ,𝑄𝑁 ]),

image

The Multi Type Mean Field value function can be defined as v𝑀𝑇𝑀𝐹 (𝑠) β‰œ [𝑣1(𝑠), . . . , 𝑣𝑁 (𝑠)]. This is the value function obtained from Equation 21. Also,  𝒓(𝑠, 𝒂) = [π‘Ÿ1(𝑠, 𝒂), . . . ,π‘Ÿπ‘ (𝑠, 𝒂)]. Now using the same principle of Lemma 1 on the Multi Type Mean Field operator, we can show that the Multi Type Mean Field operator also forms a contraction mapping (additionally refer to Proposition 1 in Yang et al. [17]).

image

converges to a constant  𝑆with probability 1 (w.p.t 1) when

1) 0 ≀  𝛼 ≀ 1, �𝑑 𝛼 = ∞, �𝑑 𝛼2 < ∞. (29)

2) π‘₯ ∈ X ; |X | < ∞ (30)

where X is the set of possible states,

image

where  𝛾 ∈ [0, 1)and K is finite.

image

with constant  𝐾2 > 0and finite.

Here  F𝑑denotes the filtration of an increasing sequence of  𝜎-fieldsincluding the history of processes;  𝛼𝑑, Δ𝑑, 𝐹𝑑 ∈ F𝑑and  || Β· ||𝑀is a weighted maximum norm. The value of this constant  𝑆 = πœ“πΆ1+𝛼 |𝐾 |𝛼𝛽0,

where  πœ“ ∈ (0, 1)and  𝐢1is the value with which the scale invariant iterative process is bounded.  𝛽0is the scale factor applied to the original process.

Proof. This lemma follows from Theorem 1 in [8]. We provide the complete proof of this lemma in the Appendix C, highlighting the changes from the Theorem 1 in [8].

image

Theorem 3.4. When updating  𝑄 𝑗 (𝑠,π‘Žπ‘—, Β―π‘Žπ‘—1, . . . , Β―π‘Žπ‘—π‘€)according to Equations 20, 21, 22 and 23, for all agents  𝑗, the multi-agent𝑄 functionwill converge to a bounded distance of the Nash  𝑄function under the Assumptions 1, 2 and 3, expressed as,

image

where  𝑆 = πœ“πΆ1+𝛼𝛾 |𝐷 |𝛼𝛽0. Here  𝐷 = 12πΏπœ–, from Theorem 3.3. The joint

Nash  𝑄function is denoted as  π‘Έβˆ— = [𝑄1βˆ—, . . . ,π‘„π‘βˆ— ], where 𝑄 π‘—βˆ— denotesthe Nash Q-value of the agent  𝑗(value received by the agent  𝑗in a Nash equilibrium), and  𝑸𝑑 = [𝑄1𝑑 , . . . ,𝑄𝑁𝑑 ].

Proof. The proof of convergence of this theorem is structurally similar to that discussed by the authors in [7] and [17]. We provide the proof here, with changes necessitated by the multiple type case.

Note that in Assumption 3, Equation 24 corresponds to the global optimum and Equations 25 and 26 correspond to the saddle point. These assumptions are the same as those considered in [7]. Also note that the authors in [7] and [17] mention that Assumption 3 is a strong assumption to impose, which is needed to show the theoretical convergence, but this is not required to impose in practice.

From [7], we formally define the Nash operator  H π‘π‘Žπ‘ β„Ž as,

image

We need to apply Lemma 2 to prove Theorem 3.4. By subtracting π‘„βˆ—(𝑠, 𝒂)on both sides of Equation 20 and in relation to Equation 28,

image

where  π‘₯ β‰œ (𝑠𝑑, 𝒂𝑑)denotes the visited state-joint action pair at the time  𝑑.

In Theorem 3.3, we proved a bound for the actual  𝑄function and the Multi Type Mean Field  𝑄function. We apply that to Equation 34, to get the following equation for  Ξ”,

image

where  𝐷 = 12πΏπœ–.

The aim is to prove that the four conditions of Lemma 2 hold and that  Ξ”in Equation 35 converges to a constant  𝑆according to Lemma 2 and thus the MTMF𝑄function in Equation 35 converges to a point whose distance to the Nash Equilibrium is bounded. In Equation 28, 𝛼(𝑑)refers to the learning rate and hence the first condition of Lemma 2 is automatically satisfied. The second condition is also true as we are dealing with finite state and action spaces.

Let  F𝑑 be the 𝜎-field generated by all random variables in the history time  𝑑 - (𝑠𝑑, 𝛼𝑑,π‘Žπ‘‘,π‘Ÿπ‘‘βˆ’1, . . . ,𝑠1, 𝛼1, a1, Q0). Thus, Q𝑑is a random variable derived from the historical trajectory up to time  𝑑.

To prove the third condition of Lemma 2, from Equation 34,

image

From Lemma 1,  𝑭 π‘π‘Žπ‘ β„Žπ‘‘forms a contraction mapping with the norm  ||Β·||∞being the maximum norm on  𝒂. Thus, from Equation 35 we get,

image

Now, applying Equation 37 in Equation 36,

image

Since we are taking the max norm, the last two terms in the righthand side of Equation 38 are both positive and finite. We can prove that the term  ||𝐢𝑑 (𝑠𝑑, 𝒂𝑑)||converges to 0 w.p.1. The proof involves the use of Assumption 3 (refer to Theorem 1 in [17]). We use this fact in the last step of Equation 38. Hence, the third condition of Lemma 2 is satisfied. The value of constant  𝐾 = 𝛾|𝐷| = 𝛾| 12πΏπœ–|.

For the fourth condition we use the fact that the MTMF operator H 𝑀𝑇𝑀𝐹 forms a contraction mapping. Hence,  H 𝑀𝑇𝑀𝐹 Qβˆ— = Qβˆ—and it follows that,

image

In the last step, the left side of the equation contains the reward and the value function as the variables. The reward is bounded by Assumption 1 and the value function is also bounded by being updated recursively by Equation 21 (MTMF is a contraction operator). So we can choose a positive, finite  𝐾2such that the inequality holds.

Finally, with all conditions met, it follows from Lemma 2 that Δ𝑑converges to constant  𝑆w.p.1. The value of this constant is 𝑆 = πœ“πΆ1+𝛼𝛾 |𝐷 |𝛼𝛽0from Lemma 2 and using the value of  𝐾derived above. Therefore, from Equation 35 we get,

image

Hence, the multi-agent  𝑄function converges to a point within a bounded distance from the Nash equilibrium of the stochastic game. The distance is a function of the error in the type classification and the closeness of resemblance of each agent to its type. β–‘

We propose two algorithms based on Q-learning to estimate the Multi Type Mean Field  𝑄function for known and unknown types. The algorithms, denoted as MTMFQ (Multi Type Mean Field Qlearning), train an agent  𝑗to minimize the loss function  L(πœ™ 𝑗) =(𝑦𝑗 βˆ’ π‘„πœ™ 𝑗 (𝑠,π‘Žπ‘—,π‘Žπ‘—1, . . . ,π‘Žπ‘—π‘€))2. Here  𝑦𝑗 = π‘Ÿ 𝑗 + 𝛾𝑣MTMFπœ™ 𝑗_ (𝑠′)(from Eq. 20) is the target value used to calculate the temporal difference (T.D.) error using the weights  πœ™ 𝑗_. Now, the gradient is obtained as,

image

Algorithm 1 describes the Multi Type Mean Field Q-learning (MTMFQ) algorithm when agent types are known. In this algorithm, different groups of agents in the environment are considered as types. An agent models its relation to each type separately and ultimately chooses the action that provides maximum benefit in the face of competition against the different types. This is referred to as version 1 of MTMFQ. This algorithm deals with multiple types in contrast to MFQ described in the paper by Yang et al. [17]. In Line 8, each agent is chosen, and its neighbours are considered. The neighbours are classified into different types and in each type a new mean action is calculated (Line 9). In Lines 14 – 19, the Q networks are updated as done in common practice [13] for all the agents in the environment. At Line 12, the current actions are added to a buffer containing previous mean actions.

Version 2 of MTMFQ (see Algorithm 2) deals with the second type of scenario, where the type of each agent is unknown. An additional step doing K-means clustering is introduced to determine the types. Once we recognize the type of each agent, the algorithm is very similar to Algorithm 1. The clustering does not necessarily recognize the types correctly and the overall process is not as efficient as the known type case. But we will show that this way of approximate type determination is still better than using only a single mean field for all the agents. For the implementation we use neural networks, but it can be done without neural networks too. We only need a way to recursively update Equations 21, 22 and 23.

Note that the total number of types in the environment is unknown, and the agent does not need to guess the correct number of types. Generally, when more types are used, the approximate multi-agent  𝑄function will be closer to the exact Nash  𝑄function as shown by the bounds in Theorems 3.1, 3.2, 3.3 and 3.4. There is no risk of overfitting when using more types. In the limit, when there is one type per agent, we recover the exact multi-agent  𝑄function. The only drawback is an increase in computational complexity. The code repository will appear at https://github.com/BorealisAI/mtmfrl.

image

We report results with three games designed within the MAgents framework: the Multi Team Battle, Battle-Gathering and the Predator Prey domains. In the first two games, Multi Team Battle and the Battle-Gathering game, the conditions are such that the different groups (or teams) are fully known upfront. In the third game, the conditions are such that the types of the agents are initially unknown. Hence, the agents must also learn the identity of the opponent agents during game play. Multi Team Battle and Gathering are two separately existing MAgent games, which have been combined to obtain the Battle-Gathering game used in this paper. Predator Prey domain is also obtained from combining Multi Team Battle with another existing MAgent game (Pursuit). In preparation for each game, agents train for 2000 episodes of game play against different groups training using the same algorithm, which is referred to as the first stage. Next, in the second stage, they enter into a faceoff against other agents trained by other algorithms where they fight each other for 1000 games. We report the results for both stages. We repeat all experiments 50 times and report the averages.

image

As can be seen from the nature of the experiments, variances can be quite large across individual experimental runs.

In the first game (Multi Team Battle, see Figure 1(a)), there are four teams (different colours in Figure 1(a)) fighting against each other to win a battle. Here, agents belonging to one team are competing against agents from other teams and cooperating with agents of the same team. During the first stage, a team does not know how the other teams will play during the faceoff β€” so it clones itself to

image

Figure 1: Multi Team Battle Game Results.

image

Figure 2: Battle-Gathering Game Results.

three additional teams with slightly different rewards in order to induce different strategies. This is similar to self play. Each team receives a reward equal to the sum of the local rewards attributed to each agent in the team. The reward function is defined in such a way that it encourages local cooperation in killing opponents and discourages getting attacked and dying in the game. The reward function for different agent groups are also subtly different (refer to Appendix B for the details). Let us call each of these groups: Group A, Group B, Group C, and Group D. We maintain a notion

image

Figure 3: Predator Prey Game Results.

of favourable opponents and each group gets slightly higher rewards for killing the favourable opponents than the others. Four algorithms, namely MFQ [17], MFAC [17], Independent Q Learning (IL) [16], and MTMFQ are trained separately where each of these groups trains its own separate network using the same algorithm (all agents within a group train the same network). We start all the battles with 72 agents of each group for training and testing. Group A from MTMFQ, Group B from IL, Group C from MFQ, and Group D from MFAC enter the faceoff stage where they fight each other for 1000 games. Each game (or episode) has a maximum of 5000 steps and a game is considered to be won by the group/groups that have the most number of agents alive at the end of the game.

The results of the first training stage are reported in Figure 1(a). We report the cumulative rewards from all agents in Group A for each algorithm. These rewards are for Group A against the other 3 groups. Since the different groups are just clones of each other the reward curves for other groups are similar to that of Group A. In the training stage, the teams trained by different algorithms did not play against each other, but simply against the cloned teams trained by the same algorithm. At the beginning of training, for about 100 episodes the agents are still exploring, and their strategies are not well differentiated yet. As a result, MTMFQ’s performance is still comparable to the performance of other algorithms. At this stage, the assumption of a single type is fine. As training progresses, each group begins to identify the favourable opponents and tries to make a targeted approach in the battle. When such differences exist across a wide range of agents, the MTMFQ algorithm shows a better performance than the other techniques as it explicitly considers the presence of different types in the game. Overall, we observe that MTMFQ has a faster convergence than all other algorithms, and it also produces higher rewards at the end of the complete training. This shows that MTMFQ identifies favourable opponents early, but the other algorithms struggle longer to learn this condition. The MFAC algorithm is the worst overall. This is consistent with the observation by Yang et al. [17], where the authors give particular reasons for this bad performance, including the greedy in the limit with infinite exploration (GLIE) assumption and a positive bias of Q-learning algorithms. This algorithm is not able to earn an overall positive reward upon the complete training. Figure 1(c) shows the win rate of the teams trained by each algorithm (MTMFQ, MFQ, MFAC, and IL) in a direct faceoff of 1000 episodes. In the face off, each group is trained by a different algorithm and with a different reward function that induces different strategies. The only algorithm that handles different strategies among the opponent teams is MTMFQ, and therefore it dominates the other algorithms with a win rate of 60%. Figure 1(d) reinforces this domination.

The second domain is the Battle-Gathering game (Figure 2(a)). In this game, all the agent groups compete for food resources that are limited (red denotes food particles and other colours are competing agents) in addition to killing its opponents as in the Multi Team Battle game. Hence, this game is harder than the first one. All the training and competition are similar to the first game.

Figure 2(b) reports the results of training in the Battle-Gathering game. Like the Multi Team Battle game, we plot the rewards obtained by Group A while fighting other groups for each algorithm. Again, MTMFQ shows the strongest performance in comparison to the other three algorithms. The MFQ technique performs better than both MFAC and IL. In this game, MTMFQ converges in around 1500 episodes, while the other algorithms take around 1800 episodes to converge. The win rates shown in Figure 2(c) and the total rewards reported in Figure 2(d) also show the dominance of MTMFQ.

The third domain is the multiagent Predator Prey (Figure 3(a)). Here we have two distinct types of agents β€” predator and prey, each having completely different characteristics. The prey are faster than the predators and their ultimate goal is to escape the predators. The predators are slower than the prey, but they have higher attacking abilities than the prey. So the predators try to attack more and kill more prey. We model this game as an unknown type scenario where the types of the other agents in the competition are not known before hand (refer to Appendix B for more details). The MTMFQ algorithm plays the version with unknown types (Algorithm 2). Here we have four groups with the first two groups (Groups A and B) being predators and the other two groups (Groups C and D) being prey. Each algorithm will train two kinds of predator agents and two kinds of prey agents. All these agents are used in the playoff stage. In the playoff stage we have 800 games where we change the algorithm of predator and prey at every 200 games to maintain a fair competition. For the first 200 games, MTMFQ plays Group A, MFQ plays Group B, IL plays Group C, and MFAC plays Group D. In the next 200 games, MFAC plays Group A, MTMFQ plays Group B, MFQ plays Group C and IL plays Group D, and so on. We start all training and testing episodes with 90 prey and 45 predators for each group. Winning a game in the playoff stage is defined in the same way as the previous two games. Notice that this makes it more fair, as predators have to kill a lot more prey to win the game (as we start with more prey than predators) and prey have to survive longer. In this setup, the highly different types of agents make type identification easier for MTMFQ (as the types are initially unknown). The prey execute more move actions while the predators execute more attack actions. This can be well differentiated by clustering.

The results of the first training stage are reported in Figure 3(b). MTMFQ has comparable or even weaker performance than other algorithms in the first 600 episodes of the training and the reasoning is similar to the reasoning in the Multi Team Battle game (the agent strategies are not sufficiently differentiated for multiple types to be useful). Notice that for this game, MTMFQ takes many more episodes than the earlier two games to start dominating. This is because of the inherent hardness of this domain compared to the other domains (very different and unknown types). Similar to observations in the other domains, MTMFQ converges earlier (after around 1300 episodes as opposed to 1700 for the other algorithms). This shows its robustness to the different kinds of opponents in the environment. MTMFQ also gains higher cumulative rewards than the other algorithms. Win rates in Figure 3(c) show that MTMFQ still wins more games than the other algorithms, but the overall percentage of games won is less than the other domains. Thus, irrespective of the difficulty of the challenge we can see that MTMFQ has an upper hand. The lead of MTMFQ is also observed in Figure 3(d).

In this paper, we extended the notion of mean field theory to multiple types in MARL. We demonstrate that reducing many agent interactions to simple two agent interactions does not give very accurate solutions in environments where there are clearly different teams/types playing different strategies. We perform suitable experiments using MAgents and demonstrate superior performances using a type based approach. We hope that this paper will provide a different dimension to the mean field theory based MARL research.

One limitation of our approach is that it is computationally more expensive than the mean field reinforcement learning method without types. If we really have only one type in the environment, then our method would add more compute time and not necessarily produce a better result. As future work we would like to extend this work for completely heterogeneous agents with different action spaces as well. StarCraft is one example of such a domain. Our work would be well suited for this scenario as clustering would be even easier. Another approach would be to consider sub types, further dividing types.

[1] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. 2017. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine 34, 6 (Nov 2017), 26–38. https://doi.org/10.1109/MSP.2017.2743240

[2] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. 2015. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research 53 (2015), 659–697.

[3] Michael Bowling and Manuela Veloso. 2000. An analysis of stochastic game theory for multiagent reinforcement learning. Technical Report. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.

[4] Lucian Bu, Robert Babu, Bart De Schutter, et al. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38, 2 (2008), 156–172.

[5] Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2010. Multi-agent Reinforcement Learning: An Overview. Vol. 310. Delft University of Technology, 183–221. https://doi.org/10.1007/978-3-642-14435-6_7

[6] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents

and Multi-Agent Systems 33, 6 (01 Nov 2019), 750–797. https://doi.org/10.1007/ s10458-019-09421-1

[7] Junling Hu and Michael P Wellman. 2003. Nash Q-learning for general-sum stochastic games. Journal of machine learning research 4, Nov (2003), 1039–1069.

[8] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. 1994. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation 6, 6 (1994), 1185–1201. https://doi.org/10.1162/neco.1994.6.6.1185

[9] James S Jordan. 1991. Bayesian learning in normal form games. Games and Economic Behavior 3, 1 (1991), 60–81.

[10] Jean-Michel Lasry and Pierre-Louis Lions. 2007. Mean field games. Japanese journal of mathematics 2, 1 (2007), 229–260.

[11] Eric Maskin. 1999. Nash equilibrium and welfare optimality. The Review of Economic Studies 66, 1 (1999), 23–38.

[12] David Mguni, Joel Jennings, and Enrique Munoz de Cote. 2018. Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.

[14] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. 2018. Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications. arXiv preprint arXiv:1812.11794 (2018).

[15] Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.

[16] Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning. 330–337.

[17] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholm, Stockholm Sweden, 5571–5580. http://proceedings.mlr.press/v80/yang18d.html

[18] Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In Thirty-Second AAAI Conference on Artificial Intelligence.

This section provides an example of a simulated scenario where the Multi Type Mean Field algorithm is more useful than the simple mean field algorithm. Consider a game in which the central agent has to decide the

direction of spin. The domain is stateless. The spin direction is

influenced by the direction of spin of its A,B,C and D neighbours (four neighbours in total). Here we denote A as the neighbour to the left of the agent, B as the neighbour to the top of the agent, C as the neighbour to the right of the agent, and D as the neighbour to the bottom of the agent. The neighbour agents spin in one direction at random. If the agent spins in the same direction as both of its C and D neighbours, the agent gets a reward of -2 regardless of what the A and B are doing. If the spin is in the same direction as both of its A and B neighbours, the agent gets a +2 reward unless the direction is not the same as the one used by both of its C and D neighbours. All other scenarios result in a reward of 0. So the agent in Grid A in Figure 4 will get a -2 for the spin down (since the C and D neighbours are spinning down) and a +2 for a spin up (since the A and B neighbours are spinning up). In Grid B of Figure 4 the agent will get a -2 for spin up and a +2 for spin down. It is clear that the best action in Grid A is to spin up and the best action in Grid B is to spin down. Here we have a notion of multi stage game with many individual

stages. In each stage of a multi stage game, one or more players

take one action each, simultaneously and obtain rewards. Consider a sequence of stage games in which the agent gets Grid

A for every 2 consecutive stages and then the Grid B for the third

stage. The goal of the agent is to take the best possible action at all stages. Let us assume that the agent continues to learn at all stages, and it starts with Q values of 0. We apply MFQ (Equation 2) and MTMFQ (Equation 20) and show why MFQ fails, but MTMFQ succeeds in this situation. We are going to calculate the Q values for the 3 stages using both the MFQ (from [17]) and the MTMFQ update rules. We approximate the average action using the number of times the neighbourhood agents spin up. In the MTMFQ we use the A, B neighbours as the type 1 and the C, D neighbours as the

image

Thus, the agent will choose to spin up in the first stage (correct action). For the second stage,

image

Again the agent will choose to spin up in the second stage (correct action). For the third stage,

image

image

Figure 4: A counter example to show the failure of MFQ and success of MTMFQ.

image

Here again the agent will choose to make the spin up (wrong action). Now coming to MTMFQ updates, for the first stage,

image

Now it can be seen that the agent will spin down in this case (correct action).

Thus, the MFQ agent will make one wrong move out of 3 moves whereas the MTMFQ agent will make the right move all the time. In situations like these, where the relationship of the agent with different neighbour agents is different, the MFQ algorithm would fail. The differences would be captured by MTMFQ which would take more efficient actions. This shows an example where MTMFQ outperforms MFQ.

This section gives more details about the experimental conditions, especially the reward function.

For the Multi Team Battle domain, the agents in all of these groups get a reward of -0.01 for each move action, and a reward of -1 for dying. Attacking an empty grid has a reward of -0.1. Each of these members has a power of 10 and a damage of 2. When an agent loses all its powers, it dies. The power is like health that the agents maintain which gets depleted on being attacked. The agents can also recover (regain health points) from the attack using a step recovery rate. This is set to be 0.1. The positive rewards for attacking another agent is cyclic in nature with each group preferring to attack and kill a particular opponent group more than others. Group A has a positive reward of 0.2, 0.3, and 0.4 for attacking an agent of group B, C and D respectively, and it gets a reward of 80, 90, and 100 for killing a member of group B, C, and D respectively. Here we will call Group D as the favourable opponent of Group A as A gets more returns for fighting or killing D. The group B has a positive reward of 0.2, 0.3, and 0.4 for attacking a member of group C, D, and A respectively and a positive reward of 80, 90, and 100 for killing a member of Group C, D, and A respectively. Thus, Group A is the favourable opponent of Group B. Group C has a reward of 0.2, 0.3, and 0.4 for attacking Groups D, A, and B respectively. It gets a kill reward of 80, 90, and 100 for killing agents of Group D, A and B respectively. For Group D the order is Group A, B and C with a attack reward of 0.2, 0.3, and 0.4 and kill reward of 80, 90, and 100.

For the Battle-Gathering game, the reward function is similar with an addition of each agent getting a +80 for collecting a food resource. The food resources are stationary objects, but each food resource has a power similar to agents. This power has to be reduced by damage before the food can be captured. Capturing a food will constitute making repeated efforts to gain the food resource by attacking the grid containing food. The agents would get a +0.5 for making such an attack.

For the Predator Prey domain, the predators have a power of 10 and a speed of 2 with an attack range of 2 (they can attack within a distance of 2 units) and they get a dead penalty of -0.1 and an attack penalty of -0.2. The step recovery rate is 0.1. The prey have a faster speed of 2.5 and an attack range of 0. All agents get a reward of +5 for killing agents belonging to other groups. Additionally, Group A gets a reward of +0.5 for attacking a member of Group B (since Group B is also a predator, Group A does not prefer to attack that), +1 for attacking a member of Group C, +3 for attacking a member of Group D (though both Groups C and D are prey, A prefers D to C). Group B gets 0.5 for attacking A, but gets +1 for attacking D and +3 for attacking C. Group B prefers Group C to Group D. Every attack action entails a punishment for the (attack) receiver which is equal in value to the reward for the attacker.

This section explains the changed Lemma used in this paper compared to Theorem 1 in [8].

We state and prove a general theorem for a stochastic process before we begin the proof.

image

where  𝑛goes from 0 to  ∞, then the general solution of this process can be given by

image

where  𝑦0 = π‘Ž.

Proof. Consider the expression,

image

Z transforms will be applied to solve this expression. Z transform is a generalized version of Discrete Time Fourier Transform that transforms the variables into a new subspace where solving the equation is easier. Once we obtain a solution, we can apply inverse Z transforms to get the solution in terms of the original variables.

image

Considering  𝑝 β‰ 1 we can rewrite the Equation 45 as

image

Now taking the inverse Z transform

image

The above result is for the case where  𝑝 β‰ 1. If  𝑝 =1, see that the Equation 42 forms an arithmetic progression whose general term is  π‘Ž + 𝑛𝑑.

image

Now from Theorem 6.1, notice that if we want a general stochastic process of that form to converge we need the coefficient  𝑝 to bea fraction (as we take a limit to  ∞in the solution only a  𝑝 which isa fraction will give a converged result). Note that this result only needs  𝑑to be finite, and it can be any finite number. If we iterate to infinity then we can apply the limit to the solution which will converge to  𝑦𝑛 = 𝑑1βˆ’π‘ .

image

converges to zero w.p.1, if the following conditions are satisfied:

image

All the random variables are allowed to depend on the past  𝑃𝑛. 𝛼𝑛(π‘₯) and 𝛽𝑛(π‘₯)are assumed to be non-negative and mutually independent given  𝑃𝑛.

Proof. This is the same as Lemma 1 in [8]. The proof is based on that fact that we can divide the process  𝑀𝑛+1(both sides of Equation 48) by a large value  π‘Š (π‘₯) such that π‘Ÿπ‘›(π‘₯) β‰ͺ π‘Š (π‘₯). Nowthe Equation 48 is effectively reduced to

image

This is because of the conditions 1 and 2 which guarantees that  𝛽𝑛is a fraction and that the variance of the process  π‘Ÿπ‘›is finite.

Now if you consider Equation 50, the update is in such a way that the process is equal to a fraction of its previous value. Thus, this process converges to 0 w.p.1.

image

where  𝐺𝑛is a sequence of functions and  π‘Œπ‘›is a random process. Let (Ξ©, F, P)be a probability space. If the following are satisfied: 1) The process is scale invariant. That is w.p.1 for all  πœ” ∈ Ξ©

image

2) If we can keep  ||𝑋𝑛||bounded by scaling the process then  𝑋𝑛would converge to a constant  𝐷w.p.1 under condition 1.

The original process will converge to a constant  𝐷1 = 𝐷𝛽0 where 𝛽0is the scaling factor that was applied to  ||𝑋𝑛||.

Proof. The intuition of the proof is that we have a process that starts at a value and its first difference reduces with time till the value stabilizes at a point. Now, if this process is invariant to scaling, we can start the process with a small value (after scaling by a small fraction  𝛽0) and then we can select a constant over which the  ||𝑋𝑛||should not increase (we can scale the whole process if it goes above that constant). Now according to the second condition the bounded process should converge to a constant  𝐷. Here the relation is  𝐷 ≀ 𝐢.To show that the net effect of the corrections must stay finite w.p.1, note that if  ||𝑋𝑛||converges then for any  πœ– >0 there exists  π‘€πœ–such that  ||𝑋𝑛|| ≀ 𝐷 ≀ 𝐢for all  𝑛 > π‘€πœ–with probability at least 1  βˆ’πœ–. This implies that the norm of the original process does not go above  𝐢 after π‘€πœ–. Thus, convergence of  ||𝑋𝑛||to constant  𝐷 wouldthen imply that the scaled version of the original process converges to the same constant  𝐷w.p.1 under the bound. Now if we remove the scaling factor the convergence point of the original process is

𝛽0 . β–‘

image

Proof. This is simply an application of Lemma 4.

According to condition 5, we have a process that is scale invariant. Let us assume that we applied a scaling factor of  𝛽0to that process to get the bound as in Equation 53. Now, let us consider the iterative process

image

Equation 54 is linear in  |𝑋𝑛(π‘₯)|and will converge to a point by Theorem 6.1 w.p.1, to some  𝑋 βˆ—(π‘₯), where  ||𝑋 βˆ—|| ≀ 𝐸1, where  𝐸1is a finite arbitrary constant. From Theorem 6.1 we can see that Equation 54 converges to a constant and hence, Lemma 4 can be applied. The convergence point will be 𝐾+𝛾𝛽𝐢1𝛼from Theorem 6.1 for Equation 54. If we change the value of the bound  𝐢1then the convergence point will change accordingly. To get the convergence point of the original process, we reapply the scaling factor. Thus, we get that point to be 𝐾+𝛾𝛽𝐢1𝛼𝛽0 .

image

converges to a constant  𝐷w.p.1 under the following conditions:

image

3)  || E{𝐹𝑛(π‘₯)|𝑃𝑛, 𝛽}||𝑀 ≀ 𝛾||Δ𝑛||𝑀 + 𝐾, where 𝛾 ∈ (0, 1) and 𝐾 isfinite.

image

stands for the past at step  𝑛. 𝐹𝑛(π‘₯)is allowed to depend on the past. 𝛼and  𝛽are assumed to be non negative. The notation ||.|| refers to some weighted maximum norm.

The value of this constant  𝐷 = πœ“πΆ1+𝛽 |𝐾 |𝛼𝛽0 where πœ“ ∈ (0, 1) and 𝐢1is the constant with which the iterative process is bounded.  𝛽0is the scaling factor that was applied to the original process.

Proof. Be defining  π‘Ÿπ‘›(π‘₯) = 𝐹𝑛(π‘₯) βˆ’ E{𝐹𝑛(π‘₯)|𝑃𝑛, 𝛽}we can decompose the iterative process into two parallel processes given by

image

where  Δ𝑛(π‘₯) = 𝛿𝑛(π‘₯) + 𝑀𝑛(π‘₯). Dividing both the sides of Equation 55 by  𝛽0for each  π‘₯and denoting  𝛿′𝑛(π‘₯) = 𝛿𝑛(π‘₯)/𝛽0, 𝑀 ′𝑛(π‘₯) =𝑀𝑛(π‘₯)/𝛽0and  π‘Ÿ ′𝑛(π‘₯) = π‘Ÿπ‘›(π‘₯)/𝛽0we can bound the  𝛿′𝑛process by condition 3.

Now we can rewrite the equation pair from condition 3 as

image

Let us assume that the  Δ𝑛process stays bounded. Then the variance of  π‘Ÿ ′𝑛(π‘₯)is bounded by some constant  𝐢and thereby  𝑀 ′𝑛converges to zero w.p.1 according to Lemma 3. Hence, there exists 𝑀such that for all  𝑛 > π‘š, ||𝑀 ′𝑛|| < πœ–with probability at least 1  βˆ’ πœ–.This implies that the  𝛿′𝑛process can be further bounded by

image

with probability  > 1 βˆ’ πœ–. If we choose  𝐢such that  𝛾(𝐢 + 1)/𝐢 ≀1 then for  ||𝛿′𝑛|| > πΆπœ–

image

Note that in the above relation we do not need the term 𝐢+1𝐢to be less than 1. We only need the product of this term with  𝛾to be less that one. Let us represent  𝐹 = (𝐢 + 1)/𝐢. Now rewriting Equation 57 we get the following bound,

image

Let us bound the norm by  𝐢1. Then we get the bound as

image

Let us assume that  𝛿𝑛is scale invariant (we prove that below). Now we can apply Lemma 5 as this satisfies all the conditions of Lemma 5. The original process converges to a constant  𝐷w.p.1. Again according to Lemma 5 this constant value is  𝐷 = 𝛾𝛽𝐹𝐢1+𝛽 |𝐾 |𝛼𝛽0

where  𝛽0is the factor with which the original process was scaled. Let us denote a new fraction  πœ“ = 𝛾𝛽𝐹. Thus, the value of  𝐷 =

𝛼𝛽0. Now, this guarantees w.p.1 convergence of the original process under the boundedness assumption if that process is scale invariant. Let the constant  𝐷1 = 𝛾𝛽𝐹𝐢1+𝛽 |𝐾 |𝛼be the point Equation 60 converges to according to Theorem 6.1. Since, the value of 𝐷1is very small for a small  𝐾(we will show that  𝐾is small in the application) and a small  𝐢1(since  𝐢1is arbitrary, we can choose a small  𝐢1), we can get  |𝛿𝑛| β‰ˆ 𝛿𝑛.

Now we prove the scale invariance condition same as Jaakola et al. [8]. By Condition 4,  π‘Ÿ ′𝑛(π‘₯)can be written as  (1+||𝛿𝑛 +𝑀𝑛||)𝑠𝑛(π‘₯),where  𝐸{𝑠2𝑛(π‘₯)|𝑃𝑛} ≀ 𝐢. Let us now decompose  𝑀𝑛 as 𝑒𝑛 + 𝑣𝑛 with

image

and  𝑣𝑛converges to zero w.p.1 by Lemma 3. Again by choosing  𝐢such that  𝛾(𝐢 + 1)/𝐢 <1 we can bound the  𝛿′𝑛 and 𝑒𝑛processes for ||𝛿′𝑛 + 𝑒𝑛|| > πΆπœ–. The pair (𝛿′𝑛,𝑒𝑛)is then a scale invariant process whose bounded version was proven earlier to converge to  𝐷 w.p.1.This proves the w.p.1 convergence of the triple  𝛿′𝑛,𝑒𝑛, 𝑣𝑛 boundingthe original process.

image


Designed for Accessibility and to further Open Science