With the advent of deep reinforcement learning, there has been a resurgence of interest in situated emergent communication (EC) research [1, 2, 3, 4, 5]. However, there remain many open design decisions, each of which may significantly bias the nature of the constructed language, and any agent policy which makes use of it.
In this work, we consider how incoming messages are integrated into a listening agent’s policy. A common approach to this decision is to concatenate the message and state observation together, or to pass the message to the agent policy directly [6, 7]. But what type of communication does this design decision cater to? We contend that this approach arguably gives the speaker too great an influence in shaping the listener’s policy, potentially allowing the speaker to utilize the communication channel to encode an optimal policy. Auxiliary losses that pressure the listening agent to consistently alter its short-term behaviour in response to messages (e.g., causal influence of communication, CIC loss [6, 4, 8]) may further bias the agent towards this behaviour. In this situation, messages become merely commands.
In contrast, we aim to develop agents that utilize information consistently, regardless of whether that information was obtained from their observations or via communication with other agents. By constraining the agent in this way, we aim to shape the use of the communication channel towards conveying more substantive information. We formalize this guiding assumption as follows:
In a partially-observed setting, let o be an observation beyond the listener’s field-of-view. Assuming the speaker is as reliable as the listener’s perception, the listener’s actions upon receiving full information regarding o should be identical to those if given the perceptual ability to observe o themselves.
Figure 1: A Language-Conditional World Model, adapted from Scott McCloud’s Understanding Comics [10] and World Models [11]. Here the cyclist has limited observability of the world around him (blindfolded), and conceptualizes danger by interpreting language within the context of his world model.
Figure 2: The partially-observable worlds that the agents interact in. The speaker (unseen) is able to view the entire map, whereas the listener (blue) only views a pixel in each direction. At the start of each game, a flag (green) is randomly placed in one of two paths. Both the speaker and listener receive a reward if the listener is able to find the flag by choosing the correct corridor.
Therefore, we propose to explicitly separate the process of message interpretation from the agent’s decision-making. By forcing the agent to ground incoming messages prior to taking action, we decrease the odds of the message directly controlling the agent’s policy. However, accomplishing this raises a difficult modelling problem: how does an agent learn to ground messages which refer to objects outside their field of view at the time when that message is received?
We introduce Language World Models (LWMs), an adaptation of world models [9] to partially-observable worlds which are trained to predict future states based on messages. This, in turn, creates a visual grounding for agents to exploit: upon receiving a message, an agent can immediately “visualize” information beyond its field-of-view and act accordingly. These visualizations begin as a reflection of entire trajectories, but refine their focus and may eventually hone in on the intended message semantics if the model is exposed to sufficient variation in the environment.
Our contributions are the following:
1. We introduce the LWM and apply it to a 2D gridworld EC task, in which an all-seeing speaking agent must guide a listening agent to a goal location. We show that agents trained with access to a LWM exhibit greater success in this domain than those which do not, especially in sparse-messaging, longer-trajectory scenarios.
2. Reflecting on the aforementioned experiments, we discuss necessary conditions for successfully training a LWM.
3. By virtue of this, we provide a solution to the long-standing challenge of how we can interpret or evaluate the listener’s understanding of incoming messages (see [6] for an overview).
4. We provide two additional losses formulated around this model-based approach, one to promote efficient speaking, the other to promote effective listening and show they accelerate learning along both dimensions.
Revisiting the motivating principles of World Models [9], we return to the following quote from Jay Wright Forrester:
“The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.”
Table 1: A comparison of various world models.
To this end, a world model (WM) aims to identify such important concepts through their ability to predict future states. Formally, given a latent code for time t, a WM models a future state as
, the outcome of action
. Thus, a world model is useful because it presents to the agent a glimpse into a future world, beyond its current sights.
A similar problem arises in partially-observable worlds, where important concepts lie not in a possible future, but outside of the agent’s field-of-view. Here we must reconsider the modelling objective, and swap short-term dependencies in time, for potentially long-term dependencies in space. At a given time t, let be the observation by an omniscient agent (speaker),
be a message sent by that agent, and
be a partial observation (of the listener) of
. A straightforward model for visually grounding the message is
. The disadvantage of this formulation is that estimating it directly from
pairs violates important assumptions of emergent communication, making it unsuitable for our use purposes.
We instead aim to approximate this by using the message to predict future states along the listening agent’s trajectory. As in WM’s, we project an observation into a latent code
introduce Language World Models as:
Where is the set of future time-points greater than t. LWMs (Fig 1) provides useful information for agent planning by interpreting incoming messages, and passing the resulting latent code to the agent as a form of enhanced observation – one that contains important objects both within its field of vision, as well those in more distant locations.
This formulation has connections to previous work, most notably other generative models of the world. Table 1 provides an overview. In comparison to WM’s, the predictive power of is replaced by
. Conceptually we are replacing “What do I expect will happen if I do this?” with “What do I expect to find if I hear this?”. But as the WM conditions on
and models purely local phenomena (t + 1), it captures the physics of the world in ways that the LWM does not. Instead the LWM has a broader temporal window, as found in a temporal difference variational auto-encoder [12], but differs in that we do not know or specify how far in the future to predict.
2.1 Guiding Assumptions
Several important guiding assumptions must hold for the effective training of a LWM in the EC setting, where supervision comes only from the task loss.
Trustworthiness LWMs makes strong assumptions linking the task reward to the grounding of the message: it is assumed that the information conveyed in the message is important for the completion of the task, and that therefore the agent is more likely to observe the intended target of the message along trajectories for which it receives high reward than those which it does not. Thus, it is assumed that the speaker is trustworthy, and aims to send useful information.
Contrast and Consistency Second, while we do not know which observation(s) the speaker is trying to communicate, it is assumed that the observations which correspond to the true target of the message will, over time, be observed more often than those which do not. While we show that the LWM is effective even when there are only two possible trajectories (one resulting in success, one in failure), a LWM undesirably captures both the important objects and closely-related (but unimportant) states. Only over many trials and permutations of the environment will the intended semantics of the message emerge.
Object Permanence A final constraint in our current formulation is that objects must remain fixed across time. This allows an observation at , some arbitrary m steps into the future, to provide an understanding about what the speaker implied at time t.
Returning to the task of emergent communication, we now provide an overview of our system.
3.1 Model of the Listening Agent
The proposed modelling contributions which involve the LWM are contained within the listening agent. We describe this agent’s components in terms of the WM categories defined in [9]: a vision network (V), a memory network (M) and a controller (C).
VAE-Seq (V) The environment produces an observation at each step t. In our case
is a 2D image from a grid-world game. From this observation, the listening agent observes a partial view,
The VAE-Seq component compresses the observation
into a latent code
. We use a simple 2D convolutional network for this component.
Latent Belief Network (M) The role of the Latent Belief network (LBF) is to compress the latent codes across the time axis, and to tie this representation to a message. Here a message is a sequence of discrete tokens. For each sequence of observations, when a message is sent at t = i, we take the pair , and use this infer all future observations
Controller (C) The controller takes an action based on the input features
, and current belief state. Following previous work [6], we model the agent policy using a feed-forward network, which we train via REINFORCE [14].
Listener Agent Summary P(z|o) compresses the observation, then grounds the message with future compressed observations — we further decompose the model
by assuming there exists a latent variable
(persistent memory) such that
. Where we define the specific case where no message is received as
, otherwise the model uses amortised VI [15] to estimate
. Lastly
selects an action conditioned on
and compressed observation
3.2 Model of the Speaking Agent
The speaker agent takes in the global observation and produces a message
. In this work, the speaker is modelled using a convolutional network, linear layer, and softmax to produce a 1-hot representation of
, which becomes the message
Previous work [6] defines an effective communication strategy as one that achieves positive signalling and positive listening. We detail these metrics and introduce a loss function that explicitly/ implicitly promotes each.
Figure 3: A depiction of how the listener agent grounds incoming messages. Where is the variational approximation to
is the variational approximation to
See Sec 3.1 for a reminder of the listener agents derivation.
4.0.1 Listener
An agent exhibits positive listening if the distribution over its actions is correlated with the messages it receives, i.e., that messages influence the agent’s behaviour. A previously-proposed method for biasing agents towards positive listening is the causal influence of communication (CIC) loss[6, 4]. We describe a variant of this loss below:
While this loss has been shown empirically to be effective in improving communication, it is difficult to apply such intuition to natural language. Humans process vast collections of new information daily, but seldom use it to adjust their immediate behaviour. In contrast, we would like to push different messages to have separate groundings, and allow the agent to act on them accordingly. In theory, this might entail taking a particular action many time steps later, or even ignoring the message completely.
Proposition 1 Our formulation of the LWM yields the following optimisation problem.
Where is the objective function for the acting policy-gradient agent [6].
Where is the REINFORCE update for the action policy, and V (o) is the learned value function to reduce variance and
is the value function mean squared error loss and
is an entropy term to encourage exploration.
are scalar hyper-parameters.
4.0.2 Speaker
Positive signalling is defined as a positive correlation between the speaker’s observation and the corresponding message it sends, i.e., the speaker should produce similar messages when in similar situations. Various methods exist to measure positive signalling, such as speaker consistency, context independence, and instantaneous coordination [6]. However, there is an absence of methods that promote positive signalling. We propose Concept-Clustering (CC), in which the speaker is encouraged to maintain consistency between messages and visual information.
This is directly inspired from the Observation model from Table 1, re-interpreted for a speaker agent as
is the speakers own message.
denotes a function which outputs the speaker agents message policy logits.
is any decoder architecture which reconstructs the observation conditioned on a message,
is a batch of sender observations. The softmax temperature parameter is
. Intuitively the CC loss encourages diversity across the set of possible messages, clustering the observations using a reconstruction loss.
We evaluate our models on a set of navigation tasks in a 2-dimensional gridworld (Fig 2). The speaker receives an observation of the entire map, while the listener can only observe one pixel in any direction. Each map contains a forking path, one of which contains a flag. Which path contains the flag is randomly assigned, and the listening agent is given only enough time to fully explore one path, thus requiring an effective communication protocol to achieve greater than 50% success.
We repeated each experiment a minimum of three times, allowing each experiment to run for 200,000 episodes. We plot the 95% confidence interval for each experiment.
5.1 Evaluating Positive Signalling
Fig 4a shows that even in this simple game, without the positive signalling loss to help promote the speaker’s message consistency, the baseline models are unable to produce a useful communication protocol which solves the task. Similarly, Fig 4b shows the LWM and the baseline model have
Figure 4: (a) evaluates the use of CC loss to promote positive signalling on a classification game, (b) compares the performance of our LWM framework to existing methods, with and without our positive signalling loss, (c) compares the same models on a game with a longer horizon (i.e the flag placed further away from the agent).
comparable performance. The increase of success above the non-communicative agent (˜50% success), supports the claim that the agents learn effective communication, while we find the recurrent baseline model (LSTM) is unable to surpass the performance of a non-communicative agent. We believe this is due to an increase in parameters introduced to the learning problem. Fig 4c Shows a dramatic drop in performance when the game complexity is increased.
Figure 5: Value distribution of linear weight’s during training, specifically, the listener agents communication channel (first layer units).
Figure 6: CIC measure during training of the LWM.
5.2 Evaluating Positive Listening
As shown previously [6], positive signalling does not imply positive listening. We provide two additional sources of proof for positive listening in our LWM agent. Fig 7c supports the claim that our listener satisfies the positive listening criterion, as we see a widening of the input communication weights when the reward increases above the non-communicative agent (˜50% success), suggesting it has become more sensitive to the communication channel. Fig 7d provides further support for positive listening, as we directly plot the CIC metric. CIC returning a value 1.0 implies the listener agents policy is extremely sensitive to communication.
5.3 Visualizing grounding
Fig 7 sheds light on the type of message-conditional representations which have been incorporated into the belief state, a visual depiction of the learned message semantics. We see a visual depiction of
Figure 7: Visualisations of the LWM belief , conditioned on a discrete message token and latent observation code
, using the decoder
directly on the belief code
. We enhanced saturation and exposure for clarity. An analysis of learned visualizations throughout various stages of optimizations is provided in the appendix.
the partial trajectory associated with the success in the task scenario, as conveyed by the message. In these figures the world is stationary, and only the location of the flag differs. Therefore, the learned semantics of each message is quite coarse-grained, but it is expected that additional permutations in world state will hone message semantics into the specific object of interest.
We believe that as the community begins to pursue EC with more complex tasks, such methods for grounding messages in a persistent memory will become increasingly more valuable.
There are two largely studied classes of learning algorithms in the field of emergent communication, with the training scheme either decentralised training [4, 7, 16, 17] or centralised training [7], both with decentralised acting. In our work we focus on utilising a decentralised training regime. A recent focus has been on the language developed, such as investigating emergent compositional language [18, 19], emergent referential language [20], large scale multi-agent communicative systems[21].
A typical method for promoting emergence of communication is through encoding biases via auxiliary losses, such as using Casual Influence of Communication [4, 6]. Instead of adding an auxiliary loss to the policy, we continuously ground our language in observations. It has also been shown that grounding the messaging is a way of combating language drift [22].
There is a wealth of work in agents using a persistent memory for tasks [12, 23, 24, 25]. The work of [12] is most relevant, as their memory unit also contains a belief of future and past states, that is learned through “jumpy prediction“, which breaks the single-step transition modelling limitation of typical recurrent models and allow modelling distant, temporally separated points. Other forms of multi-agent belief exist, such as public belief[26].
In the same way humans collect knowledge from a myriad of sources, from language and observation alike, only to compose facts and ultimately act upon them at a later time, we argue for the need to develop more persistent information in EC. We introduce the Language World Model as a way to accomplish this in a manner that is compatible with the strict constraints of EC work, and that is grounded in a domain which is amenable to available supervision. This allows us to develop effective auxiliary loss functions for guiding the emergence of effective communication.
This is not to discount the importance of current trends in EC, including EC in which communication more closely resembles the commander/follower paradigm we cited as a motivation for developing our method. Indeed, many simple EC games, including the ones presented here, may be solved equally well in such a paradigm, Here the distinction between “go left” and “the flag is down the left corridor” is not currently an important one. Yet to develop complex multi-agent strategy in games like Capture-the-Flag or Counter-Strike, it is likely that a combination of both directive and object-property communication will be necessary. We present this work as a first step towards these goals.
In future work, we aim to extend this method to more complex, multi-object scenarios, where we relax some of the assumptions necessary for success in the current work. For instance, extending this work to scenarios where the reward function is not as deeply connected with the target of message grounding, where objects are dynamic, and where the visual grounding is not a projection of the observation alone, but may require additional steps of reasoning.
We would like to thank Yuta Tsuboi, Sosuke Kobayashi, Prabhat Nagarajan, and the anonymous reviewers for helpful discussion and feedback on the draft of this work.
[1] A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra, “Learning cooperative visual dialog agents with deep reinforcement learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2951–2960.
[2] A. Lazaridou, A. Peysakhovich, and M. Baroni, “Multi-agent cooperation and the emergence of (natural) language,” arXiv preprint arXiv:1612.07182, 2016.
[3] S. Kottur, J. M. Moura, S. Lee, and D. Batra, “Natural language does not emerge’naturally’in multi-agent dialog,” arXiv preprint arXiv:1706.08502, 2017.
[4] N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J. Z. Leibo, and N. De Freitas, “Social influence as intrinsic motivation for multi-agent deep reinforcement learning,” in International Conference on Machine Learning, 2019, pp. 3040–3049.
[5] S. Havrylov and I. Titov, “Emergence of language with multi-agent games: Learning to communicate with sequences of symbols,” in Advances in neural information processing systems, 2017, pp. 2149–2159.
[6] R. Lowe, J. Foerster, Y.-L. Boureau, J. Pineau, and Y. Dauphin, “On the pitfalls of measuring emergent communication,” arXiv preprint arXiv:1903.05168, 2019.
[7] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” CoRR, vol. abs/1605.06676, 2016. [Online]. Available: http://arxiv.org/abs/1605.06676
[8] T. Eccles, Y. Bachrach, G. Lever, A. Lazaridou, and T. Graepel, “Biases for emergent communication in multi-agent reinforcement learning,” in Advances in Neural Information Processing Systems 32, 2019.
[9] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018, pp. 2451–2463, https://worldmodels.github.io. [Online]. Available: https: //papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution
[10] S. McCloud and A. Manning, “Understanding comics: The invisible art,” IEEE Transactions on Professional Communications, vol. 41, no. 1, pp. 66–69, 1998.
[11] D. Ha and J. Schmidhuber, “World models,” arXiv preprint arXiv:1803.10122, 2018.
[12] K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber, “Temporal difference variational auto-encoder,” arXiv preprint arXiv:1806.03107, 2018.
[13] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” arXiv preprint arXiv:1811.04551, 2018.
[14] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[15] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[16] A. Singh, T. Jain, and S. Sukhbaatar, “Learning when to communicate at scale in multiagent cooperative and competitive tasks,” arXiv preprint arXiv:1812.09755, 2018.
[17] I. Mordatch and P. Abbeel, “Emergence of grounded compositional language in multi-agent populations,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[18] B. Bogin, M. Geva, and J. Berant, “Emergence of communication in an interactive world with consistent speakers,” arXiv preprint arXiv:1809.00549, 2018.
[19] E. Choi, A. Lazaridou, and N. de Freitas, “Compositional obverter communication learning from raw visual input,” arXiv preprint arXiv:1804.02341, 2018.
[20] A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark, “Emergence of linguistic communication from referential games with symbolic and pixel input,” CoRR, vol. abs/1804.03984, 2018. [Online]. Available: http://arxiv.org/abs/1804.03984
[21] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “Tarmac: Targeted multi-agent communication,” CoRR, vol. abs/1810.11187, 2018. [Online]. Available: http://arxiv.org/abs/1810.11187
[22] J. Lee, K. Cho, and D. Kiela, “Countering language drift via grounding,” 2018.
[23] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.
[24] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.
[25] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi´nska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou et al., “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, no. 7626, p. 471, 2016.
[26] J. N. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson, M. Botvinick, and M. Bowling, “Bayesian action decoder for deep multi-agent reinforcement learning,” arXiv preprint arXiv:1811.01458, 2018.
Figure 8: Visualisations of LWM’s belief , conditioned on a fixed discrete message token and latent observation code
, using the decoder
directly on the belief code
. We enhanced saturation and exposure for clarity. We track one agent belief state over training, whilst fixing the message we visualise the belief over, in order to see how the message tokens representation/ belief changes over time. (a)-(c) we can see the agent is just learning the visual representation of the game, where (d) we can see the agent hasn’t quite learned to link the message with the flags location, which is first seen in (e), whereby the agent first successfully learns the message token is connected visually with the flag in the rightmost corridoor.