Prosocial learning agents solve generalized Stag Hunts better than selfish ones

2017·Arxiv

Abstract

Abstract

Deep reinforcement learning has become an important paradigm for constructing agents that can enter complex multi-agent situations and improve their policies through experience. One commonly used technique is reactive training - applying standard RL methods while treating other agents as a part of the learner’s environment. It is known that in general-sum games reactive training can lead groups of agents to converge to inefficient outcomes. We focus on one such class of environments: Stag Hunt games. Here agents either choose a risky cooperative policy (which leads to high payoffs if both choose it but low payoffs to an agent who attempts it alone) or a safe one (which leads to a safe payoff no matter what). We ask how we can change the learning rule of a single agent to improve its outcomes in Stag Hunts that include other reactive learners. We extend existing work on reward-shaping in multi-agent reinforcement learning and show that that making a single agent prosocial, that is, making them care about the rewards of their partners can increase the probability that groups converge to good outcomes. Thus, even if we control a single agent in a group making that agent prosocial can increase our agent’s long-run payoff. We show experimentally that this result carries over to a variety of more complex environments with Stag Hunt-like dynamics including ones where agents must learn from raw input pixels.

1 Introduction

Constructing agents which can make good decisions in complex environments is a cornerstone of modern artificial intelligence research. An increasingly popular paradigm for this task is deep reinforcement learning (deep RL, Kaelbling et al. (1996); Sutton and Barto (1998); Silver et al. (2016); Mnih et al. (2015)). Recent work has become interested in using deep RL to construct agents that can function well in environments which include other agents (Silver et al., 2016; Lazaridou et al., 2017; Das et al., 2017; Lowe et al., 2017; Foerster et al., 2017b; Evtimova et al., 2017; Foerster et al., 2016; Havrylov and Titov, 2017; Lewis et al., 2017; Heinrich and Silver, 2016; Peng et al., 2017; Leibo et al., 2017). In many of these examples deep RL been applied by the method of reactive training where the learner treats other agents as part of the environment and optimizes its own reward. Reactive training works very well in zero-sum situations, where for one agent to gain the other must lose (Silver et al., 2016; Wu and Tian, 2016; Neumann, 1928). However, for general-sum games good outcomes are far from guaranteed (Sandholm and Crites, 1996; Leibo et al., 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2017a). In this paper we will study a simple modification to independent learning that can be applied to a single agent out of a dyad and can lead to better outcomes in a class of coordination games.

As with the works above we are interested in situations where we construct an agent that will go into an initially poorly understood environment (which we refer to as a game). Our agent must learn from its experiences to update its policy and maximize some scalar reward. However, there will also be other agents which we do not control and will also learn from their experiences using a form of reactive learning. Our starting point will be standard independent multi-agent reinforcement learning

(independent MARL). In independent MARL each agent treats the other agents simply as a fixed part of the environment.1

If all agents use independent MARL, and if the system converges to deterministic strategies, then it will have converged to a Nash equilibrium of the game (Fudenberg and Levine, 1998) - a set of strategies such that no player has incentive to unilaterally deviate.2 In general-sum games there can be multiple equilibria and thus initial conditions can alter the long-run payoffs that our agent (and others involved) can earn. In addition, in some classes of games many different kinds of dynamics can converge on equilibria with low payoffs even though high payoff equilibria exist (Kandori et al., 1993; Fudenberg and Levine, 1998). We will focus on one such example: Stag Hunt games.

In the simple, matrix-form, two-player Stag Hunt each player makes a choice between a risky action (hunt the stag) and a safe action (forage for mushrooms). Foraging for mushrooms always yields a safe payoff while hunting yields a high payoff if the other player also hunts but a very low payoff if one shows up to hunt alone. There are two Nash equilibria: either both players show up to hunt (this is called the payoff dominant equilibrium) or both players stay home and forage (this is called the risk-dominant equilibrium Harsanyi et al. (1988)).

In the Stag Hunt, when the payoff to hunting alone is sufficiently low, dyads of independent learners as well as evolving populations converge to the risk-dominant (safe) equilibrium (Kandori et al., 1993; Nowak, 2006; Fudenberg and Levine, 1998; Matignon et al., 2012). The intuition comes from the fact that even a slight amount of doubt about whether one’s partner will show up causes an agent to choose the safe action. This in turn causes partners to be less likely to hunt in the future and the system trends to the inefficient equilibrium.

Failure to coordinate on good outcomes in the Stag Hunt (and related games) has been extensively studied outside of MARL in many fields, particularly in the social and behavioral sciences (Kandori et al., 1993; Carlsson and Van Damme, 1993; Van Huyck et al., 1990; Camerer, 2003; Yoshida et al., 2008) because the Stag Hunt is a simple example of a more general type of coordination dynamic which is ubiquitous in real world interactions. For this reason how agents can avoid the spiral to the inefficient equilibrium is a problem of interest across many fields.

Existing work in RL has proposed modifications to independent MARL to increase the probability of the system converging to a preferred equilibrium (Littman, 2001; Babes et al., 2008; Devlin and Kudenko, 2011; Devlin et al., 2011; Kapetanakis and Kudenko, 2002; Yoshida et al., 2008; Panait et al., 2008; Matignon et al., 2012; Wang and Sandholm, 2003). One prominent set of work has investigated endowing agents with a form of optimism, eg. lenient learning (Panait et al., 2008; Matignon et al., 2012) or Frequency Maximum Q-learning (Kapetanakis and Kudenko, 2002), while another has focused on potential-based reward shaping (Babes et al., 2008; Devlin and Kudenko, 2011; Devlin et al., 2011). We will be interested in building upon ideas from the literature on reward shaping and applying these insights to situations beyond those typically considered. In particular, we are interested in environments where we only control a single agent, where decisions are not single actions but rather temporally extended sequences, and where good policies may need to be learned using function approximation (eg. directly from raw pixels).

Reward shaping methods change the updating rule of all agents in the system using a potential function which encodes heuristic knowledge about ‘good’ states. If we know the structure of the game, we can choose an appropriate potential function (Babes et al., 2008; Devlin and Kudenko, 2011; Devlin et al., 2011). A special property of Stag Hunt games is that changing both agents to optimize the joint reward changes the basin of attraction of the payoff-dominant equilibrium Yoshida et al. (2008). We can restate this idea as making agents prosocial: that is, changing their utility function to care not only about their reward but also their partner’s rewards .

Our first contribution is an extension of existing results to the case where we control a single agent. We show analytically that in Stag Hunts we can improve the probability of convergence to a good equilibrium even if we can only make a single agent prosocial. We extend this result to N strategy 2 player games as long as any sub-game is a Stag Hunt. This occurs because in Stag Hunts equilibria can be ranked by their payoffs and both agents agree on this ranking, thus, causing one agent to ‘learn towards’ one equilibrium causes others to follow.

We then begin to probe some of the limitations of the analytical result experimentally. In particular, we ask what happens when Stag Hunts are played by more than 2 agents. We focus on two studied extension of the Stag Hunt: games on networks (Jackson, 2010; Morris, 2000; Chen et al., 2009) and multi-player ‘weak link’ games (Van Huyck et al., 1990; Camerer, 2003; Van Huyck et al., 1991).

Our second contribution is to experimentally generalize these insights in a domain where analytical solutions are difficult: Markov games with Stag Hunt-like structure and learning via function approximation. We consider a set of two-player grid based games as well as an Atari game and show that in each of these cases adding prosociality helps policy gradient to converge to good outcomes. Each group of agents either converges to always playing the the safe or risky equilibrium, so improving the probability of converging to a good outcome also improves the long run payoff of the prosocial agent. Thus even in complex Stag Hunt-like games, even if we control just a single agent, and even if we only care about that agent’s payoff it can still be better to make that agent prosocial in the long-run.

2 Risk Dominance, the Stag Hunt, and Equilibrium Selection

We begin by illustrating the intuition behind our results with the simple matrix Stag Hunt. In the Stag Hunt players simultaneously decide to either take a risky option (Hunt the Stag) or a safe option (Forage). Foraging always yields a safe payoff, whereas Hunting yields a higher payoff if the other person also hunts but yields a bad payoff if the player shows up to hunt alone (because the Stag gores her). This is illustrated in the payoff matrix below (rows represent strategy choices for player 1, columns represent strategy choices for player 2, entries represent payoffs to each player as a function of strategies chosen):

Definition 1 A generalized Stag Hunt if

Let be the action spaces of the players and let be the reward players receive from a pair of actions. We consider the set of stable points in this game:

Definition 2 Nash equilibria are strategy pairs such that for any

In a Nash equilibrium, neither agent has incentive to deviate from their chosen strategy given what the other agent is doing. For this reason, they are possible convergence points of a learning process. There are two (non-randomized) equilibria here: both choose to Hunt or both choose Forage. Hunting yields higher payoff for both agents (is the payoff-dominant equilibrium) thus one may hope that learning agents converge to Hunt. However this may not always be true.

The generalized Stag Hunt conditions imply two things. First, that (Hunt, Hunt), (Forage, Forage) are the equilibria and (Hunt, Hunt) is payoff dominant. The second part of the condition implies that the coordination game has a sort of asymmetric risk, if one player chooses Hunt but the other chooses Forage, only the Hunt player loses. Thus Forage is a safe action that does well no matter what the other chooses whereas Hunt is a risky action that only works if the other person also hunts.

We can now consider which equilibrium is more likely to be selected by a general class of learning dynamics. This example captures much of the intuition behind existing results about why dynamics often converge to safe outcomes (Kandori et al., 1993; Nowak, 2006; Fudenberg and Levine, 1998) as well as the basic idea behind why certain modifications to independent MARL (eg. Devlin and Kudenko (2011); Devlin et al. (2011); Kapetanakis and Kudenko (2002); Yoshida et al. (2008); Panait et al. (2008); Matignon et al. (2012)) can help drive the system towards payoff dominant equilibria.

We assume that agents play the game repeatedly (either with each other, themselves in a ‘self-play’ form or with a population of others).3 Agents start with beliefs which are the probabilities they expect their partner to choose Hunt. Agents best respond to this belief. Having observed the choice of their partner, they update their beliefs in the correct direction (that is, if the partner chose Hunt, p goes up). This means that agents choose Hunt if We can calculate the critical value of such that if the agent chooses Hunt otherwise they choose

This tells us something about the basins of attraction of each equilibrium. If both agents have beliefs above then they will converge to (Hunt, Hunt) - this is because they will play Hunt this period and only increase their beliefs and thus they will play Hunt next period and so on. A similar argument holds if both agents start below . If one agent is above and the other is below then the specifics of the learning rule come into play and limit behavior depends on which of the two attractor regions they reach first. Note that what matters for the computation of are not just h and m but also the off-equilibrium payoffs This is illustrated in figure 1. Interestingly, in games with more than two strategies the basins of attraction of various equilibria can be informed even by the presence of strategies which are never played in any equilibrium as well as the exact properties of the dynamics in question (Panageas and Piliouras, 2016; Nowak, 2006; Kandori et al., 1993).

With this notion of basin of attraction in mind, we ask whether modifying our agents while keeping the game constant can shift the basin of attraction. We consider a type of agent with social preferences:

Definition 3 A prosocial agent’s total utility from an outcome is given by

We refer to as the agent’s level of prosociality. agents are perfectly selfish, agents are fully prosocial (weigh their and their partner’s utility equally) and agents are selfless.

We now show our main analytical result as well as a corollary which motivate our experiments. We relegate the proofs to the Appendix.

Theorem 1 In a generalized Stag Hunt the size of the basin of attraction of (Hunt, Hunt) increases in both players’ level of prosociality. There exists such that if either agent has unique interior attractor is (Hunt, Hunt).

Corrolary 1 In any symmetric game where the only pure equilibria are symmetric and the payoff matrix subset to any pair of strategies is a generalized Stag Hunt there exists that if for either player the unique convergence point is the payoff dominant equilibrium.

Figure 1: Belief-based learning dynamics in the Stag Hunt game depend on initial states as well as all entries of the payoff matrix, this same intuition applies to RL-based dynamics which simply use values of actions.

Theorem 1 and its Corollary implies that in generalized Stag Hunts, any agent can increase their (expected) payoff after convergence by increasing their . However, in games with more complex strategy spaces where some aspects of the game do not obey the Stag Hunt inequality, increasing may be counterproductive. We discuss this in more detail in the conclusion. When is close to 1 the agent may learn a policy that is both bad for itself and socially inefficient. Considering at least guarantees that the agent will only choose to hurt its own payoffs if the resulting outcome is efficient (increases the sum of the payoffs). Thus we mostly think about prosocial, but not selfless, agents in practice and look at in our experiments.

3 Experiments

3.1 Matrix Stag Hunt

Our theorem implies that if we endow agents (and more importantly, even one agent) with prosocial preferences then dynamics should now more reliably select the payoff dominant equilibria in matrix games. The relative improvement depends on the learning rule used and details of the payoff matrix, so we begin by studying policy gradient in the Stag Hunt. We set h (the payoff from joint hunting) to 2, m and c (the payoff from foraging regardless of what the other person is doing) to 1 and vary the g (hunting alone and getting gored by the stag) payoff as well as the prosociality of both players. We train policy gradient based agents with softmax action implementation. For optimization we use Adam (Kingma and Ba, 2014) with a learning rate of .01 as well as random initialization. We train each pair of agents for 400 rounds and look at average behavior in the last 50 rounds. We train 300 replicates and average them together.

We see from figure 2 (left) that both g and social preferences affect convergence to the payoff dominant equilibrium. There are situations where we control both agents (eg. in using self play to find good policies). In this case, we see that adding social preferences to both agents can help find good optimal equilibria but actually in the dyadic case having a single agent with prosocial preferences is almost as efficient as having both be prosocial.

3.2 Prosociality With More than 2 Players

The results above show that a single prosocial agent can help in the matrix, dyadic Stag Hunts. An important question is how robust these results are to changes in the game complexity as well as the number of players. We move to Stag Hunts with more than 2 agents played on networks.

There is a large literature about coordination games on networks that looks at questions including which agents should be ‘seeded’ with an initial behavior to get it to spread on a network (Jackson,

Figure 2: In the Stag Hunt, selfish agents trained reactively with policy gradients fail to do well but the addition of social preferences to even one of the agents helps guide convergence to a good outcome though does not guarantee it. Expanding to Stag Hunts beyond dyads shows us that sometimes prosociality of a single agent can greatly affect outcomes, such as in the case of the central agent on a star network, but other times even many prosocial agents may not help, such as in the weak-link game or in the fully connected network Stag Hunt.

2010; Morris, 2000; Chen et al., 2009). A major takeaway from this literature is that seeding more central agents leads to larger cascades or higher probability of diffusion.

We ask whether a similar relationship holds for the question of making subsets of agents prosocial. We consider the standard coordination games on graphs setup (Jackson, 2010): agents live on a graph, in each round each agent chooses an action from Hunt or Forage and plays the Stag Hunt with all agents that it is connected to using that single action. The agent receives the average (or total) payoff from these games and updates its strategy for the next iteration. We use the same algorithm, parameters, and analysis as the dyad Stag Hunt above. We consider 5 agents playing on two graphs: either a fully connected graph or a star network. We vary the loss to hunting alone as well as how many (and which) agents we make prosocial.

We also consider the weak-link game which is a type of multiplayer Stag Hunt (Van Huyck et al., 1990; Camerer, 2003; Van Huyck et al., 1991). Here there are 5 players. Each simultaneously chooses an effort level Players pay a linear cost of effort. This effort goes into a project and the project’s success is determined by the weakest link, that is the player who put in the least effort. At the end of the round each agent receives a payoff given by This game has Stag Hunt properties because the best equilibrium is one where all agents choose to put in maximum effort but any identical effort levels form equilibria. We apply policy gradient with 5 players and vary A (which can be thought of as highly related to the risk parameter) as well as agent prosociality. Note that here we have a choice of how to do prosocial payoffs: we can either combine the agent’s reward with the average of everyone’s reward (which corresponds to the agent weighing their own utility against some total ‘global utility’) or use the sum of everyone’s reward (which would cause the prosocial agent to be fully utilitarian but with a large number of other agents would cause the agent to ignore their own reward for the common good). For the reported experiments we use the average and run 100 replicates.

The games and the results are shown in Figure 2. We see that in the case of the star network making just making the center agent prosocial can greatly increase the probability that the whole group

Figure 3: Markov games that have complicated policy spaces but preserve high level properties of the Stag Hunt. In each game agents have access to safe policies (which yield low rewards but work no matter what the other agent does) or high reward policies that require coordination with the other agent and can lead to very poor payoffs if the coordination fails.

converges to the payoff dominant outcome. However, in the fully connected network the power that an individual agent has over where the group converges is quite limited. In the weak-link game even making all agents prosocial and making the payoff multiplier high still results in poor coordination on the best equilibrium. This shows that in more complex situations many aspects of the underlying game can greatly change the ability of a single (or few) prosocial agents to affect outcomes.

3.3 Markov Games with Stag Hunt Properties

We now consider a set of more complex two-player Stag Hunt-like games. Each of these games we model using the framework of Markov games. In each of these cases we apply a policy gradient with state representation constructed using a deep convolutional network with inputs as the raw board pixels. Because we use standard setups from the literature we put the details of the model architecture/training in the Appendix.

We consider 3 different grid-world games: Markov Stag Hunt, Harvest and Escalation (Figure 3). In each of these games two agents move on a grid and can move in any of the 4 cardinal directions. In the Markov Stag Hunt the grid is populated with a Stag and 2 plants. Moving over a plant gives either agent 1 point and causes the plant to disappear and re-appear in another part of the board. Moving over the Stag causes an agent to lose g points but if both agents move over the stag simultaneously they each gain 5 points and the stag disappears and re-appears randomly elsewhere on the board. In each time period the stag moves towards the closest agent to it, although the stag can never catch an agent who continues to move away from it.

Like in the standard Stag Hunt, there are two equilibria here - either both agents try to get the stag or they both try to stay away from it and pick up plants. There is asymmetric risk, if one chooses to try to hunt while the other avoids, the hunter will suffer the consequences.

In the Harvest game at each time step a plant can appear randomly somewhere on the board (up to 4 plants can be on the board at a time). Each plant is born ‘young’, then every time step turns ‘mature’ with probability . While a plant is mature it can die on each time step with probability The probabilities are always selected such that each plant lives for 20 time steps in expectation.

Players can move over plants to pick them up. Players receive 1 point if they up a young plant, however waiting until each plant becomes mature and picking it up yields 2 points to both players. Thus the coordination question is whether to wait or rush for the plants. Choosing to wait is an equilibrium that yields high payoffs for both players. However, just like stag hunting it incurs a risk that one’s partner loses their nerve and grabs the young plant.

Figure 4: We see that the general pattern from the Stag Hunt generalizes to the Markov games. Increases in risk lead to convergence to equilibria with less coordination while giving prosocial preferences to agents increases the dyad’s ability to coordinate on good outcomes. The weakest effect exists in our Markov Stag Hunt as well as Harvest with extremely high risk. Lines reflect average payoffs over replicates smoothed over 1000 episode blocks. Error bars reflect standard errors estimated using independent replicates.

In the Coordinated Escalation game a special marker appears on one of the squares. If the agents step on the square together, they both receive one point, at which point an adjacent square lights up. If the agents step together onto the next square, they receive 1 point. If at any time an agent breaks the streak (eg. by stepping off the path), the other agent receives a penalty of some multiplier times the current length T of the streak and the game ends. The current streak length T is observed (encoded in the state). This game has many equilibria, where both agents play to keep streaks of size T but no more with risk escalating at each time step (as the reward from further escalation is always 1 but the cost from one’s partner failing to continue the pattern increases linearly).

We also consider a version of Escalation where agents must learn from raw pixels. We use methods employed in Tampuu et al. (2017) to adapt Atari Pong to construct Escalation Pong. In Escalation Pong each agent controls a player, each time the ball is hit both agents receives a reward of 1, however if an agent drops the ball (allows it to pass) then the other agent receives a reward of where k is proportional to the number of times the ball has been hit back and forth. The game can end stochastically at any time period and also ends when any player drops the ball.

We again compare the performance (here in terms of payoff to our agent) from situations where both agents are selfish, both agents are prosocial, and where only our agent is prosocial. In all conditions, both agents start with randomly initialized policies and learn by playing with each other. Figure 4 shows that the intuition from the coordination game replicates in these more complex environments: giving agents social preferences can help lead to coordination on payoff-dominant strategies. Both the risk and whether one or both agents have social preferences plays an important role.

While the broad trends are replicated across all four games, Figure 4 also emphasizes that the exact effect of prosociality is dependent on the details of the particular game. For example, in both Escalation games, but not in Markov Stag Hunt or Harvest, payoffs vary quite smoothly with the risk and prosociality. The reason is that in the Escalation games, there is a large set of joint policies which can be ordered (ie. coordinate for longer) and have increasing payoff and risk while in Harvest and Markov Stag Hunt the joint policies of interest are either those which always or never hunt/wait.

We also observe that single-agent prosociality can provide benefit in Harvest but little in Markov Stag Hunt. The reason is the relative basin of attraction of the two strategies for a single player. In the Harvest game, initial random policies have a good chance of stumbling upon corn in the ‘Old’ phase, especially if the corn spends a high fraction of time in that phase. If one agent waits, its easy for the other agent to learn to wait.

However, in the Markov Stag Hunt, the stag chases the agent closest to it and for random exploration to find good strategies the agent who is not being chased has to wander upon the stag. Thus it is more likely that, during initial training, the agent which is being chased steps on the stag by random exploration and will not be coordinated with its partner. In this case even a prosocial agent will learn to avoid the stag. If both agents are prosocial however, they are both encouraged to play H to avoid their partner being gored.

These games illustrate that the basin of attraction for equilibria in a game with more than two strategies are influenced not only by the payoffs of the two equilibria, but also the payoffs of all the non-equilibrium policies. In the Harvest game the fraction of time in the young stage does not affect the payoff to either equilibrium but it does change the payoffs of ‘random’ policies that pick up a mixture of young and old corn proportional to their lifetime. Similarly we observed that in a version of the Stag Hunt where the stag runs away from the agents then even for reasonably large gore penalties the agents never learn H because during exploration they are unlikely to both step on the stag at the same time. This further underscores a point made by Leibo et al. (2017) that the ‘difficulty’ of discovering various strategies can affect equilibrium selection and furthermore that this may be dependent not only on the underlying game itself but on the function approximation used in our RL algorithm.

4 Conclusion

Game theoretic properties of multi-agent systems can hamper the ability of RL methods to converge to payoff-dominant policies even when such policies are equilibria. We have shown that prosocial preferences applied to even a single agent can increase the size of the basin of attraction in matrix games and that this extends to complex Markov games with function approximation.

Prosociality can have benefits but it can also carry costs. First, if the game (or some part of the game) is not a Stag Hunt, then prosociality for a single agent may introduce new equilibria that are suboptimal for that agent. For example, in a social dilemma a prosocial agent may be exploited by its partner.6 Second, in typical environments where an agent’s actions only weakly influence on the payoffs of others, prosocial rewards increase the variance of observed rewards and thus may make reinforcement learning algorithms converge more slowly. Finally, while we assume that the agent’s partner is sufficiently adaptive to learn H, it is better for the agent to play selfishly if the partner is not sufficiently adaptive.

We have focused on prosociality because it is a simple way to modify our agent by changing the reward function. In the related field of social dilemmas other work focuses on modifying agent’s models of the environment to include the fact that other agents are also learning and so our actions today can lead to different policies for our partner tomorrow (Babes et al., 2008; Lerer and Peysakhovich, 2017; Foerster et al., 2017a). A recent contribution to such ‘learning shaping’ uses an explicit model of a partner’s learning process included in the objective function of the agent (Foerster et al., 2017a). Such an approach has been shown to work in simple social dilemmas and so exploring it further, in particular in combination with simpler heuristic methods, is an attractive future direction.

Off-equilibrium payoffs are an important criterion for determining the basin of attraction of multi-agent optimization. However, an equally important criterion which we have not analyzed is function approximation. Deep RL does not work on the state space of a problem directly but rather approximates that state space using neural network techniques (Silver et al., 2016; Mnih et al., 2015). The architectures of these networks can affect the relative complexity and thus relative likelihood of different policies. This, in turn, can affect the basins of attraction for various equilibria (Leibo et al., 2017; Tamar et al., 2016). Good choices of priors (ie. model architectures) have allowed AI to make great strides in fields like computer vision and prosocial important direction in multi-agent systems is further exploring good architectures for learning in these conditions.

We close with a practical question: how can we construct agents that can coordinate productively with humans? It has been shown that humans often fail to coordinate on payoff-dominant equilibria (Van Huyck et al., 1990; Camerer, 2003) but that artificial agents can be added to help groups coordinate better Shirado and Christakis (2017). We have shown that learning algorithms must take into account the learning behavior of other agents if they are to learn to coordinate in multi-agent environments. Understanding how to design human-machine hybrid systems in ways that properly account for human psychology (Crandall et al., 2017; Kleiman-Weiner et al., 2016; Rand et al., 2014; Hauser et al., 2014; Ouss and Peysakhovich, 2015; Arechar et al., 2016; Mao et al., 2017) and also human learning strategies (Erev and Roth, 1998; Fudenberg and Peysakhovich, 2016) is an important practical question to extending this work.

References

Antonio A Arechar, Anna Dreber, Drew Fudenberg, and David G Rand. 2016. I’m Just a Soul Whose Intentions Are Good: The Role of Communication in Noisy Repeated Games. (2016).

Monica Babes, Enrique Munoz De Cote, and Michael L Littman. 2008. Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 3. International Foundation for Autonomous Agents and Multiagent Systems, 1389–1392.

Colin Camerer. 2003. Behavioral game theory: Experiments in strategic interaction. Princeton University Press.

Hans Carlsson and Eric Van Damme. 1993. Global games and equilibrium selection. Econometrica: Journal of the Econometric Society (1993), 989–1018.

Wei Chen, Yajun Wang, and Siyu Yang. 2009. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 199–208.

6In such cases conditionally cooperative strategies should be considered (Lerer and Peysakhovich, 2017;

De Cote and Littman, 2012; Peysakhovich and Lerer, 2017; Littman and Stone, 2005). We note here that recently

proposed conditonally-cooperative algorithms for deep RL settings can actually be applied to the games studied

here. Further exploring the connections between social dilemmas and coordination is an interesting question for

future work.

Jacob W Crandall, Mayada Oudah, Fatimah Ishowo-Oloko, Sherief Abdallah, Jean-François Bonne- fon, Manuel Cebrian, Azim Shariff, Michael A Goodrich, Iyad Rahwan, et al. 2017. Cooperating with Machines. arXiv preprint arXiv:1703.06207 (2017).

Abhishek Das, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. 2017. Learning Cooper- ative Visual Dialog Agents with Deep Reinforcement Learning. arXiv preprint arXiv:1703.06585 (2017).

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. 2009. The complexity of computing a Nash equilibrium. SIAM J. Comput. 39, 1 (2009), 195–259.

Enrique Munoz De Cote and Michael L Littman. 2012. A polynomial-time Nash equilibrium algorithm for repeated stochastic games. arXiv preprint arXiv:1206.3277 (2012).

Sam Devlin and Daniel Kudenko. 2011. Theoretical considerations of potential-based reward shaping for multi-agent systems. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 225–232.

Sam Devlin, Daniel Kudenko, and Marek Grze´s. 2011. An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Advances in Complex Systems 14, 02 (2011), 251–278.

Ido Erev and Alvin E Roth. 1998. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American economic review (1998), 848–881.

Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. 2017. Emergent Language in a Multi-Modal, Multi-Step Referential Game. arXiv preprint arXiv:1705.10369 (2017).

Jakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems. 2137–2145.

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2017b. Counterfactual Multi-Agent Policy Gradients. arXiv preprint arXiv:1705.08926 (2017).

Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. 2017a. Learning with Opponent-Learning Awareness. arXiv preprint arXiv:1709.04326 (2017).

Drew Fudenberg and David K Levine. 1998. The theory of learning in games. Vol. 2. MIT press.

Drew Fudenberg and Eric Maskin. 1986. The folk theorem in repeated games with discounting or with incomplete information. Econometrica: Journal of the Econometric Society (1986), 533–554.

Drew Fudenberg and Alexander Peysakhovich. 2016. Recency, records, and recaps: Learning and nonequilibrium behavior in a simple decision problem. ACM Transactions on Economics and Computation (TEAC) 4, 4 (2016), 23.

John C Harsanyi, Reinhard Selten, et al. 1988. A general theory of equilibrium selection in games. MIT Press Books 1 (1988).

Oliver P Hauser, David G Rand, Alexander Peysakhovich, and Martin A Nowak. 2014. Cooperating with the future. Nature 511, 7508 (2014), 220–223.

Serhii Havrylov and Ivan Titov. 2017. Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols. arXiv preprint arXiv:1705.11192 (2017).

Johannes Heinrich and David Silver. 2016. Deep reinforcement learning from self-play in imperfect- information games. arXiv preprint arXiv:1603.01121 (2016).

Matthew O Jackson. 2010. Social and economic networks. Princeton university press.

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.

Michihiro Kandori, George J Mailath, and Rafael Rob. 1993. Learning, mutation, and long run equilibria in games. Econometrica: Journal of the Econometric Society (1993), 29–56.

Spiros Kapetanakis and Daniel Kudenko. 2002. Reinforcement learning of coordination in cooperative multi-agent systems. AAAI/IAAI 2002 (2002), 326–331.

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

Max Kleiman-Weiner, MK Ho, JL Austerweil, Michael L Littman, and Josh B Tenenbaum. 2016. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In Proceedings of the 38th annual conference of the cognitive science society.

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. 2017. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. arXiv preprint arXiv:1711.00832 (2017).

Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multi-agent cooperation and the emergence of (natural) language. In International Conference on Learning Representations.

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. Multi- agent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 464–473.

Adam Lerer and Alexander Peysakhovich. 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 (2017).

Mike Lewis, Denis Yarats, Yann N Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. arXiv preprint arXiv:1706.05125 (2017).

Michael L Littman. 2001. Friend-or-foe Q-learning in general-sum games. In ICML, Vol. 1. 322–328.

Michael L Littman and Peter Stone. 2005. A polynomial-time Nash equilibrium algorithm for repeated games. Decision Support Systems 39, 1 (2005), 55–66.

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv preprint arXiv:1706.02275 (2017).

Andrew Mao, Lili Dworkin, Siddharth Suri, and Duncan J Watts. 2017. Resilient cooperators stabilize long-run cooperation in the finitely repeated Prisoner?s Dilemma. Nature communications 8 (2017), 13800.

Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2012. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering Review 27, 1 (2012), 1–31.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.

Stephen Morris. 2000. Contagion. The Review of Economic Studies 67, 1 (2000), 57–78.

J v Neumann. 1928. Zur theorie der gesellschaftsspiele. Mathematische annalen 100, 1 (1928), 295–320.

Martin A Nowak. 2006. Evolutionary dynamics. Harvard University Press.

Aurélie Ouss and Alexander Peysakhovich. 2015. When punishment doesn’t pay: ’cold glow’ and decisions to punish. Journal of Law and Economics 58, 3 (2015).

Ioannis Panageas and Georgios Piliouras. 2016. Average case performance of replicator dynamics in potential games via computing regions of attraction. In Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 703–720.

Liviu Panait, Karl Tuyls, and Sean Luke. 2008. Theoretical advantages of lenient learners: An evolutionary game theoretic perspective. Journal of Machine Learning Research 9, Mar (2008), 423–457.

Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. 2017. Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games. arXiv preprint arXiv:1703.10069 (2017).

Alexander Peysakhovich and Adam Lerer. 2017. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975 (2017).

David G Rand, Alexander Peysakhovich, Gordon T Kraft-Todd, George E Newman, Owen Wurzbacher, Martin A Nowak, and Joshua D Greene. 2014. Social heuristics shape intuitive cooperation. Nature communications 5 (2014).

Tuomas W Sandholm and Robert H Crites. 1996. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37, 1-2 (1996), 147–166.

Hirokazu Shirado and Nicholas A Christakis. 2017. Locally noisy autonomous agents improve global human coordination in network experiments. Nature 545, 7654 (2017), 370–374.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.

Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. 2016. Value iteration networks. In Advances in Neural Information Processing Systems. 2154–2162.

Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation and competition with deep reinforcement learning. PloS one 12, 4 (2017), e0172395.

John B Van Huyck, Raymond C Battalio, and Richard O Beil. 1990. Tacit coordination games, strategic uncertainty, and coordination failure. The American Economic Review 80, 1 (1990), 234–248.

John B Van Huyck, Raymond C Battalio, and Richard O Beil. 1991. Strategic uncertainty, equilibrium selection, and coordination failure in average opinion games. The Quarterly Journal of Economics 106, 3 (1991), 885–910.

Xiaofeng Wang and Tuomas Sandholm. 2003. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in neural information processing systems. 1603– 1610.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning 8, 3-4 (1992), 229–256.

Yuxin Wu and Yuandong Tian. 2016. Training agent for first-person shooter game with actor-critic curriculum learning. (2016).

Wako Yoshida, Ray J Dolan, and Karl J Friston. 2008. Game theory of mind. PLoS computational biology 4, 12 (2008), e1000254.

5 Appendix

Proof of Theorem 1: To be able to use basin of attraction arguments, we must have that the structure of Nash equilibria of the game does not change as we change the utility functions. A key aspect of the Stag Hunt is that changing the prosociality parameters of the players does not delete the payoff-dominant equilibrium (since H gives the highest payoff to both players), though if c > m it may delete the payoff-dominated one. In addition, given any and (H, F) are never equilibria (since the player choosing F always wants to deviate). Thus, it is sufficient to look at the basins of attraction of the two original equilibria.

Let p be the probability that the other agent plays Hunt. For an agent with prosociality to (weakly) prefer Hunt over Forage the following must hold:

Solving for , the minimum p that leads the agent to play Hunt yields

By the Stag Hunt inequalities, all three bracketed quantities must be strictly positive; therefore, decreases in (i.e. the basin of attraction increases as increases). Solving for determines the values which makes Hunt a weakly dominant strategy:

By the Stag Hunt inequalities, , and both are positive, therefore,

Proof of Corollary 1: Note that because the game restricted to any two strategies is a Stag Hunt, we can use the same argument as in Theorem 1 to look only at the original set of equilibria: the best possible equilibrium is not removed when we add prosociality (since it optimizes the joint reward), worse equilibria can be removed, and no other strategy pair (i, j) can become an equilibrium.

Consider an payoff matrix U for agent 1; we assume symmetry, i.e. for simplicity. We can order the strategies such that for all without loss of generality. Then every subspace of U is a stag hunt if for all i < j,

for all i < j. Thus for agent 1, i weakly dominates j for all i < j. If agent 1 plays 0, then 0 is a dominant strategy for agent 2 (for any

5.1 Details of Markov Game Training

For grid world Markov games, we train policies modeled by a multi-layer convolutional neural network. For a given board size, the model has repeated layers, each consisting of a 2D convolution with kernel size 3, followed by batch normalization and ReLU. The first layer has stride 1, while the successive layers each have stride 2, which decreases the width and height from k to while doubling the number of channels.

We perform reinforcement learning via policy gradient using Reinforce Williams (1992) and RMSProp for optimization. The model is updated episodically in batches of 64 episodes, with a discount rate of 0.99. Markov stag hunt and Harvest are trained for 200,000 episodes with episode length drawn from exp(250), while Escalation is trained for 500,000 episodes of length at most 50 (since Escalation only extends for the duration of coordination). For Markov Stag Hunt, we use an entropy regularization weight of 0.05 Mnih et al. (2016).

For Escalation Pong, we use the ALE environment modified for 2-player play as proposed in Tampuu et al. (2017). We modify the game so that each paddle hit gives a reward of +1 to each player, and when a ’point’ is scored, the game ends and the player who scored the point receives a reward of is the total number of hits and a is a hyperparameter. We make the escalation level k observable in the state by brightening a patch in the center of the board by an amount proportional to k.

We train policies directly from pixels, using the pytorch-a3c package (https://github.com/ ikostrikov/pytorch-a3c).

Policies are trained directly from pixels via A3C (Mnih et al., 2016). Inputs are rescaled to and normalized, and we augment the state with the difference between successive frames. We use 38 threads for A3C, and train for a total of 380,000 games (10,000 per thread), which takes 3 – 4 hours on a 40-core machine. We use the default settings from pytorch-a3c: a discount rate of 0.99, 20-step returns, and entropy regularization weight of 0.01. We found a slightly lower learning rate of was required for stable convergence.

The policy is implemented as a convolutional neural network with four layers, following pytorch-a3c. Each layer uses a kernel with stride 2, followed by ELU. The network has two heads for the actor and critic. We elide the LSTM layer used in the pytorch-a3c library, as we found it to be unnecessary.

designed for accessibility and to further open science