The recent availability of large datasets collected from various resources, such as digital transactions, location data and government census, is transforming the ways we study and understand social systems1. Researchers and policy makers are able to observe and model social interactions and dynamics in great detail, including the structure of friendship networks2, the behavior of cities3, politically polarized societies4, or the spread of information on social media5. These studies show the behaviors present in the data but do not explore the space of possibilities that human dynamics may evolve to. Robust policies should consider mechanisms to respond to every type of events6, including those that are very rare7. Therefore it is crucial to develop simulation environments such that potentially unobserved social dynamics can be assessed empirically.
Agent Based Modeling (ABM) is a generative approach to study social phenomena based on the interaction of individuals8. These models show how different types of individual behavior give rise to emergent macroscopic regularities9, such as unequal wealth distributions10, new political actors11, multipolarity in interstate systems12 and cultural differentiation13. Moreover, ABM allows testing core sociological theories against simulations10 with emphasis on heterogeneous, autonomous actors with bounded, spatial information14. However, the rules of agent interactions are generally fixed which limits the exploration of the space of possible behaviors.
Reinforcement Learning (RL) is a simulation method where agents become intelligent and create new, optimal behaviors based on the state of their environment and a previously defined structure of incentives. This method is referred as Multi-Agent Reinforcement Learning (MARL) if multiple agents are employed. Recently, the combination of RL with Deep Learning architectures achieve human level performance in complex tasks, including video gaming15, motion in harsh environments16, and effective communication networks without assumptions17. Moreover, it has been recently applied to study societal dilemma and game theory problems18 such as the emergence of cooperation19,20, the Prisoner’s Dilemma21 and payoff matrices in equilibrium22. Although Deep RL algorithms applied to multiple agents (MARL) can shed light on social phenomena, to the best of our knowledge, the applications of these methods has been confined to classical game-theoretic problems23 and drawing connections to real-world examples remains unexplored.
In this paper we extend the standard ABM of social segregation using MARL in order to explore the space of possible behaviors as we modify the structure of incentives and promote the interaction among agents of different kinds. The idea is to observe the behavior of agents that want to segregate from each other when interactions are promoted. We achieve the segregation dynamics by considering the rules from the Schelling model9. The creation of interdependencies among agents of different kinds is inspired by the dynamics of the Predator-Prey model8 where agents hunt each other. Our experiments show that spatial segregation diminishes as more interdependencies among agents of different kinds are added. Moreover, our results
Table 1. Training parameters of the Deep Q-Networks used during the experiments.
shed light on previously unknown behaviors regarding segregation and the age of individuals which we confirmed using Census data. These methods can be extended to study other type of social phenomena and inform policy makers on possible actions.
The organization of the paper is as follows: In Section 2 we explain the experimental setup. Section 3 illustrates the experiment outcomes. In Section 4 we conclude and discuss our results. Future improvements are presented in Section S1 in the Supplement.
We design a game in which agents are promoted to both self-segregate and interact with others. By varying the reward of interactions we are able to explore different incentives that affect the self-organizing process of segregation. Our experiments are based on two types of agents: A and B. Agents try to survive in a 50x50 grid where they can move around and interact with other agents. They observe an 11x11 patch of the grid centered around their current position and can live for a total of 100 iterations in isolation. Figure 1 shows an schematic view of the grid world and the agents. Distinct colors indicate the agents’ types and the green square represents the observation window of the agent illustrated in green.
Each type of agent utilizes one Deep Q-Network for maximizing rewards15. The rewards of the game, R, are as following:
• Segregation reward. This incentive promotes agents to self-segregate. An agent is rewarded +1 for each agent of similar kind that joins its observation window, and -1 for each agent of different kind.
• Interdependence reward. This incentive promotes interactions among agents of different kinds. When an agent meets another agent of different kind, we randomly choose a winner of the interaction (following hunting dynamics). The winner (hunter) receives a positive reward, that we vary across experiments, and an extension of its lifetime by one iteration.
• Vigilance reward. This incentive promotes agents to stay alive by providing +0.1 reward for every time step they survive.
• Death reward. This incentive rewards negatively agents who die or are hunted by agents of opposite kind. Agents receive -1 reward when they die.
• Occlusion reward. This incentive rewards movements towards occupied cells negatively. If an agent tries to move towards an occluded area, the agent receives -1 reward.
• Stillness reward. This incentive promotes the exploration of space. Agents who choose to stay still receive -1 reward.
Every agent takes one action at each iteration. The sequence of agents who take actions is chosen randomly. There are five possible actions for agents: to stay still or to move left, right, up or down. Agents are confined to the borders of the grid and cannot move towards agents of their own kind. If an agent moves to a location occupied by an agent of the opposite kind, it receives the interdependence reward and the opponent receives the death reward.
Mathematically, agents of type A are represented as , B as +1, empty space as 0 and border as
on the grid. Hence every agent’s spatial observation at time t is O
. Moreover, every agent has the
Figure 1. Schematic of the model simulation and network architecture. Top panel: Grid world of experiments. The grid size is 50x50 locations. Red and blue squares denote two types of agents respectively. White represents empty regions. Each type has its own Deep Q-Network. Every agent has a field of view of 11x11 locations. Green border denotes the field of view of the
agent illustrated in green. Agents can move across empty spaces. Bottom panel: Network structure of . Each network receives an input of 11x11 locations, runs it through five convolution steps and concatenates the resulting activations with the agent’s remaining age normalized by the maximum initial age. The feature vector is mapped over the action space using a fully connected layer. The action with the maximum Q-value is taken for the agent.
Figure 2. Agents collective behavior for multiple values of interdependence reward (rows) at multiple times (columns). Rows represent outcomes associated to different values of the interdependence reward (IR). Columns show the state of the system at different points of the simulation. Experiments are initialized with equal initial conditions and random seed. The heat maps are obtained by averaging over the last 1000 iterations. In Panel (a) red regions denote biased occupation of type A agents where areas fully occupied with type A agents are indicated by type concentration of +1. Blue regions denote biased occupation of
type B agents where full occupation of blues are indicated by type concentration of -1. White areas indicate uniform mixing
across types, indicated by type concentration of 0. In Panel (b) color indicates the age of agents irrespective of their type. As color shades from blue to red agent age increases. In Section 3, we introduce a set of videos that represent the experiments used in creating the heat maps.
Figure 3. Segregation dynamics for multiple values of interdependence reward (IR). Colors correspond to the results for multiple values of interdependence reward, ranging from yellow (low) to black (high). The curves are obtained by averaging 50 iterations over 10 experiment realizations. Shades denote the standard deviation across experiments.
information of its remaining normalized life time, represented as O. Full observation of the agent i at time t is oit
. Let
and
denote the Q-Networks of type A and B. Then the networks’ goal is to satisfy Equations 1 and 2.
where NX denotes the number of agents of type X, denotes the discount factor, rt denotes the reward at time t and Q
denotes the Q-Network of agents of type X.
Each network is initialized with the same parameters. In order to homogenize the networks’ inputs, we normalize the observation windows by the agents’ own kind, such that positive and negative values respectively represent equal and opposite kind for each agent. Actions are taken by following -Greedy exploration strategy. Exploration rate decays exponentially. In order to stabilize the learning process, we use Adam optimizer24, Experience Replay25 and Double Q-Learning26. Networks are trained in parallel over 12 CPUs using data parallelism. We run one episode per experiment. Each episode is comprised of 5000 iterations. Each experiment is repeated 10 times for statistical analysis. Network details are given in Figure 1 (bottom) and training details are given in Table 1.
Experiments are conducted by setting up different values of incentives and observing the emergent collective behavior associated with each experiment. During simulations, agents explore the space of possible behaviors and inform which behaviors are promoted under certain incentives and environmental rules. As a result, we create an artificial environment for testing hypotheses and obtaining information through simulations hard to anticipate given the complexity of the space of possibilities.1
In this case, we create agents who want to segregate from other kinds and provide incentives to create interactions and interdependencies across kinds. For this purpose, we model the Schelling dynamics for segregation and combine it with the interdependence reward. The interdependence reward is given when agents of different kinds compete and win against each other following hunting dynamics. The one who is hunted dies and the hunter gets a positive reward and life-extension. In total, there are four different experiments with interdependence reward of 0, 25, 50 and 75 respectively. A set of videos are available
1Demonstration of the experiments: (IR: 0) https://youtu.be/AgAeYMe2tUE (IR: 25) https://youtu.be/OZbl8qD50Mg (IR: 50) https: //youtu.be/Ca2p2cATmlw (IR: 75) https://youtu.be/R32Xu_EUpBQ.
Figure 4. Dynamics of the experiment results. Panel (a) shows the average age of agents. Panel (b) shows the maximum age of agents. Panel (c) shows the percentage of agents that hunt at each iteration. Panel (d) shows the hunters’ cluster size prior to hunting the opposing agent. Panel (e) shows the age of the agents hunting. Color is proportional to the interdependence reward (IR). Darker color indicates higher interdependence reward. Each plot is obtained by averaging results of ten experiments and 100 iterations. Shaded areas denote the standard deviation among the experiments.
with one simulation for each setting. In the videos, colors yellow/orange and cyan/magenta denote the types of agents. The color brightness indicates the age of agents for both kinds.
Interdependence rewards diminish spatial segregation among different types. In Figure 2a we show the collective behavior of the population, using heat maps proportional to the probability of agents location during simulations according to their type. The heat maps are visualized over one trial of the experiments. Blue and red regions show biases towards each kind. White regions show uniform occupation. The dynamics of segregation quickly result in patches of segregated groups (top panels). As interdependence rewards increase, the probability of one grid being occupied by agent of type A or B becomes uniform and plots become white (bottom right panels). By creating interdependencies among agents, they increase their interactions and reduce the spatial segregation.
We measure segregation among agents using multiscale entropy. We convolve the grid space with low pass filters of size 6x6, 12x12 and 25x25 using sliding windows whose output is the window average value. We measure the entropy of the distribution of window averages after each convolution across all iterations. The segregation per iteration is defined as the average entropy across the distributions resulting from the different filter sizes. The resulting segregation dynamics is visualized in Figure 3. Segregation is high when interdependencies are not rewarded (yellow curve). As interdependencies increase (purple and black curves), the agents mix and the spatial segregation is significantly reduced (p < 0.001, see Section S2 in the Supplement).
Interdependencies affect the group dynamics. As we increase the reward for interdependencies, the initially stable patches emerging from promoting segregation become dynamic and mix with the other kind. The properties of the population and associated activities reflect the change of dynamics. Agents create an internal hierarchy where younger agents go out and hunt and elder agents segregate and ensure reproduction. Evidences of such behavior are that the average age of agents decreases and the hunting rate increases (Figure 4a and 4c) and average hunter age is much lower than the average agent age (Figure 4a and 4e). Moreover, the maximum age of agents per kind increases (Figure 4b) showing that some agents stay protected and do not hunt. The hunting strategy of agents is also affected by increasing interdependencies. Pack size increases consistently with interdependence rewards. Figure 4d shows the size of hunting clusters one step before hunting an agent. The increasing cluster size given interdependence rewards suggests that agent association yields better results. It also shows that hostile systems favor agglomeration of agents for safety which can result in ultimate polarization. Additionally, we also analyzed the effects of the vigilance rewards on the dynamics for multiple reward values. Results show that higher vigilance reward increases intra-kind interaction and results in more segregation (see Section S3).
Diverse areas attract younger people and people are older in segregated areas. We show that older agents are more segregated than younger ones in the model (see Figure 2b). The behavior has been observed in the model and verified with human behavior using Census data. We analyzed the relationship between age and segregation using Census data across the whole US (see Section S4). A segregation metric based on racial entropy correlated positively with median age by census tract (r=0.4). Our simulation shed light on an observation that is not trivial about current societies.
In summary, our experiments show that increasing interdependencies among kinds can be applied to reduce segregation. Moreover, hostile interdependencies will result in in-group cooperation for hunting and competition for sheltering. The emergent behavior of the population can be framed in the exploiter and explorer discussion. A part of it chooses to segregate and another one to go out and explore. The one who explores hunts and is vulnerable to be hunted, but creates spatial integration. The one who segregates lives longer and ensures reproduction of its own kind. In this model, explorers tend to be younger and keepers tend to live longer. Spatial mixing was achieved by increasing interaction rewards but was accompanied by larger clusters of agents of the same size. Polarization may arise when there is an adversarial relationship between the parts that segregate from each other. More generally, emergent behaviors lie in a non-linear space where interaction properties determine outcomes which may happen simultaneously and in different combinations.
We created an artificial environment for testing rules of interactions and incentives by observing the behaviors that emerge when applied to multi-agent populations. Incentives can generate surprising behaviors because of the complexity of social systems. As problems become complex, evolutionary computing is necessary to achieve sustainable solutions. We combine system modeling (ABMs) with artificial intelligence (RL) in order to explore the space of solutions associated to promoted incentives. RL provides ABMs the information processing capabilities that enables the exploration of strategies that satisfy the conditions imposed by the interaction rule. In turn, ABMs provide RL with access to models of collective behavior that achieve emergence and complexity. While ABMs provide access to the complexity of the problem space, RL facilitates the exploration of the solution space. Our methodology opens a new avenue for policy makers to design and test incentives in artificial environments.
We would like to thank Intel AI DevCloud Team for granting access to their cloud with powerful parallel processing capabilities. Also, we would like to thank Dhaval Adjodah for his valuable suggestions on training RL algorithms.
ES, YBY and AJM contributed equally in the conceptualization, development and interpretation of the experiments as well as in the paper write up.
The source code of the model implementation as well as the data generated to create this report will be made available upon publication.
We declare that have no competing interests.
1. Lazer, D. et al. Computational social science. Science 323, 721–723 (2009).
2. Eagle, N., Pentland, A. S. & Lazer, D. Inferring friendship network structure by using mobile phone data. Proc. national academy sciences 106, 15274–15278 (2009).
3. Morales, A. J., Vavilala, V., Benito, R. M. & Bar-Yam, Y. Global patterns of synchronization in human communications. J. The Royal Soc. Interface 14, 20161048, DOI: 10.1098/rsif.2016.1048 (2017).
4. Morales, A., Borondo, J., Losada, J. C. & Benito, R. M. Measuring political polarization: Twitter shows the two sides of venezuela. Chaos: An Interdiscip. J. Nonlinear Sci. 25, 033114 (2015).
5. Vosoughi, S., Roy, D. & Aral, S. The spread of true and false news online. Science 359, 1146–1151 (2018).
6. Ashby, W. R. Requisite variety and its implications for the control of complex systems. In Facets of systems science, 405–417 (Springer, 1991).
7. Taleb, N. N. Black swans and the domains of statistics. The Am. Stat. 61, 198–200 (2007).
8. Sayama, H. Introduction to the modeling and analysis of complex systems (Open SUNY Textbooks, 2015).
9. Schelling, T. C. Dynamic models of segregation. J. mathematical sociology 1, 143–186 (1971).
10. Epstein, J. M. & Axtell, R. Growing artificial societies: social science from the bottom up (Brookings Institution Press, 1996).
11. Axelrod, R. A model of the emergence of new political actors. In Artificial Societies, 27–44 (Routledge, 2006).
12. Cederman, L.-E. Emergent actors in world politics: how states and nations develop and dissolve, vol. 2 (Princeton University Press, 1997).
13. Axelrod, R. The dissemination of culture: A model with local convergence and global polarization. J. conflict resolution 41, 203–226 (1997).
14. Epstein, J. M. Agent-based computational models and generative social science. Complexity 4, 41–60 (1999).
15. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529 (2015).
16. Heess, N. et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286 (2017).
17. Sert, E., Sönmez, C., Baghaee, S. & Uysal-Biyikoglu, E. Optimizing age of information on real-life tcp/ip connections through reinforcement learning. In 2018 26th Signal Processing and Communications Applications Conference (SIU), 1–4 (IEEE, 2018).
18. Lanctot, M. et al. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 4190–4203 (2017).
19. de Cote, E. M., Lazaric, A. & Restelli, M. Learning to cooperate in multi-agent social dilemmas. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, 783–785 (ACM, 2006).
20. Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J. & Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, 464–473 (International Foundation for Autonomous Agents and Multiagent Systems, 2017).
21. Sandholm, T. W. & Crites, R. H. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37, 147–166 (1996).
22. Wunder, M., Littman, M. L. & Babes, M. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 1167–1174 (Citeseer, 2010).
23. Zawadzki, E., Lipson, A. & Leyton-Brown, K. Empirically evaluating multiagent learning algorithms. arXiv preprint arXiv:1401.8074 (2014).
24. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
25. Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. learning 8, 293–321 (1992).
26. Van Hasselt, H., Guez, A. & Silver, D. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence (2016).
27. Nikolov, N., Kirschner, J., Berkenkamp, F. & Krause, A. Information-directed exploration for deep reinforcement learning. arXiv preprint arXiv:1812.07544 (2018).
28. Tang, H. et al. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2753–2762 (2017).
29. Fu, J., Co-Reyes, J. & Levine, S. Ex2: Exploration with exemplar models for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2577–2587 (2017).
30. Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. & Mordatch, I. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748 (2017).
31. Macy, M. W. & Willer, R. From factors to factors: computational sociology and agent-based modeling. Annu. review sociology 28, 143–166 (2002).
There are many potential improvements to our work. We classify directions of future work under three categories: representation, training and experimentation. Our method can be advanced by representing agents more realistically such as introducing heterogeneous personalities to agents or facilitating network structure over agents to promote alliances. Moreover, training RL agents yield better results with sophisticated exploration strategies27–29. In addition to exploration strategies, MARL is shown to perform better with curriculum learning30. Our aim is to extend the work on multi agent curriculum learning to our problem.
Schelling and Predator - Prey models cover just a little portion of the ABM domain31. We are currently working on extending this artificial environment to other ABMs, i.e. Axelrod model13. Our goal is to develop an easy interface where policy makers and AI researchers can collaborate on solving societal problems.
We validate the significance of the patterns we observe along the execution of the simulation as we change the IR incentive in Figure 3. We analyze the distribution of values across the last 1000 interactions for each IR values and test the difference among their averages. In Table S1 we summarize the results of the statistical tests. The differences in averages are statistically significant (p < 0.001) across all pairs of curves.
We analyze the effects of the Vigilance Rewards (VR) on the dynamics of agents. In Figures S1 and S2 we the impact on segregation and age distribution for multiple values of VR. The results show that increased VR increases intra-kind behavior and as a results increases segregation. Therefore, segregation may also be fostered by other types of behaviors.
We analyze the significance of the spatial distribution of agent ages shown in Figure 2b. In Figure S3 we show the entropy of the spatial distribution of agent ages at multiple iteration times and interdependence reward (IR), together with a randomized case for comparison. The randomized case is constructed by drawing agent ages on the grid from a uniform distribution between (0, 1) and calculating the entropy. The difference between the random case and the empirical results is significant in all cases. We tested significance by comparing each curve. A summary of the test results are presented in Table S2.
The behavior has been observed in the model and verified it with human behavior using Census data. We analyzed the relationship between age and segregation using Census data across the whole US. A segregation metric based on racial entropy correlated positively with median age by census tract (r=0.4). In Figure S4 we present a scatter plot of the segregation metric (x-axis) and average age (y-axis) of each census tract (dots).
Table S1. Statistical tests results comparing segregation outcomes for different values of Interdependence Reward denoted as R1 and R2. We tested the difference in segregation over the last 1000 iterations for every pair of curves. The tests show that the average segregation differs across curves of different rewards.
Figure S1. Spatial distribution of agent types with varying Vigilance Reward (VR) (vertical) and Iteration (horizontal). Color indicates concentration of each type.
Figure S2. Spatial distribution of agent ages with varying Vigilance Reward (VR) (vertical) and Iteration (horizontal).
Figure S3. Entropy of Spatial distribution of agent ages with iteration. Random case is constructed by drawing agent ages on the grid from a uniform distribution and calculating the entropy. The difference in entropy corresponds to spatial clustering of ages. Entropy is normalized by the logarithm of the grid size.
Table S2. Statistical significance of comparing the average age entropy for different Interdependence Reward, denoted by V1 and V2. Random curves are denoted by sampling entropy value per iteration from a uniform distribution between zero and one.
Figure S4. Age and racial segregation by census tract (dots). Pearson correlation r annotated in the Figure.