Visual Reasoning with Multi-hop Feature Modulation

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt— on-par with single-hop FiLM generation — while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

Keywords: Deep Learning, Computer Vision, Natural Language Understanding, Multi-modal Learning

Computer vision has witnessed many impressive breakthroughs over the past decades in image classification [27,15], image segmentation [30], and object detection [12] by applying convolutional neural networks to large-scale, labeled datasets, often exceeding human performance. These systems give outputs such as class labels, segmentation masks, or bounding boxes, but it would be more natural for humans to interact with these systems through natural language. To this end, the research community has introduced various multi-modal tasks, such as image captioning [48], referring expressions [23], visual question-answering [1,34], visual reasoning [21], and visual dialogue [6,5].

These tasks require models to effectively integrate information from both vision and language. One common approach is to process both modalities independently with large unimodal networks before combining them through concatenation [34], element-wise product [25,31], or bilinear pooling [11]. Inspired by the success of attention in machine translation [3], several works have proposed


Fig. 1: The ReferIt task identifies a selected object (in the bounding box) using a single expression, while in GuessWhat?!, a speaker localizes the object with a series of yes or no questions.

to incorporate various forms of spatial attention to bias models towards focusing on question-specific image regions [48,47]. However, spatial attention sometimes only gives modest improvements over simple baselines for visual question answering [20] and can struggle on questions involving multi-step reasoning [21].

More recently, [44,38] introduced Feature-wise Linear Modulation (FiLM) layers as a promising approach for vision-and-language tasks. These layers apply a per-channel scaling and shifting to a convolutional network’s visual features, conditioned on an external input such as language, e.g., captions, questions, or full dialogues. Such feature-wise affine transformations allow models to dynamically highlight the key visual features for the task at hand. The parameters of FiLM layers which scale and shift features or feature maps are determined by a separate network, the so-called FiLM generator, which predicts these parameters using the external conditioning input. Within various architectures, FiLM has outperformed prior state-of-art for visual question-answering [44,38], multi-modal translation [7], and language-guided image segmentation [40].

However, the best way to design the FiLM generator is still an open question. For visual question-answering and visual reasoning, prior work uses single-hop FiLM generators that predict all FiLM parameters at once [38,44]. That is, a Recurrent Neural Network (RNN) sequentially processes input language tokens and then outputs all FiLM parameters via a Multi-Layer Perceptron (MLP). In this paper, we argue that using a Multi-hop FiLM Generator is better suited for tasks involving longer input sequences and multi-step reasoning such as dialogue. Even for shorter input sequence tasks, single-hop FiLM generators can require a large RNN to achieve strong performance; on the CLEVR visual reasoning task [21] which only involves a small vocabulary and templated questions, the FiLM generator in [38] uses an RNN with 4096 hidden units that comprises almost 90% of the model’s parameters. Models with Multi-hop FiLM Generators may thus be easier to scale to more difficult tasks involving human-generated language involving larger vocabularies and more ambiguity.

As an intuitive example, consider the dialogue in Fig. 1 through which one speaker localizes the second girl in the image, the one who does not “have a blue frisbee.” For this task, a single-hop model must determine upfront what steps of reasoning to carry out over the image and in what order; thus, it might decide in a single shot to highlight feature maps throughout the visual network detecting either non-blue colors or girls. In contrast, a multi-hop model may first determine the most immediate step of reasoning necessary (i.e., locate the girls), highlight the relevant visual features, and then determine the next immediate step of reasoning necessary (i.e., locate the blue frisbee), and so on. While it may be appropriate to reason in either way, the latter approach may scale better to longer language inputs and/or or to ambiguous images where the full sequence of reasoning steps is hard to determine upfront, which can even be further enhanced by having intermediate feedback while processing the image.

In this paper, we therefore explore several approaches to generating FiLM parameters in multiple hops. These approaches introduce an intermediate context embedding that controls the language and visual processing, and they alternate between updating the context embedding via an attention mechanism over the language sequence (and optionally by incorporating image activations) and predicting the FiLM parameters. We evaluate Multi-hop FiLM generation on ReferIt [23] and GuessWhat?! [6], two vision-and-language tasks illustrated in Fig. 1. We show that Multi-hop FiLM models significantly outperform their single-hop counterparts and prior state-of-the-art for the longer input sequence, dialogue-based GuessWhat?! task while matching the state-of-the-art performance of other models on ReferIt. Our best GuessWhat?! model only updates the context embedding using the language input, while for ReferIt, incorporating visual feedback to update the context embedding improves performance. In summary, this paper makes the following contributions:

We introduce the Multi-hop FiLM architecture and demonstrate that our approach matches or significantly improves state-of-the-art on the GuessWhat?! Oracle task, GuessWhat?! Guesser task, and ReferIt Guesser task.

We show Multi-hop FiLM models outperforms their single-hop counterparts on vision-and-language tasks involving complex visual reasoning.

We find that updating the context embedding of Multi-hop FiLM Generator based on visual feedback may be helpful in some cases, such as for tasks which do not include object category labels like ReferIt.

In this section, we explain the prerequisites to understanding our model: RNNs, attention mechanisms, and FiLM. We subsequently use these building blocks to propose a Multi-hop FiLM model.

2.1 Recurrent Neural Networks

One common approach in natural language processing is to use a Recurrent Neural Network (RNN) to encode some linguistic input sequence l into a fixed-size embedding. The input (such as a question or dialogue) consists of a sequence of words  ω1:Tof length T, where each word  ωtis contained within a predefined vocabulary V. We embed each input token via a learned look-up table e and obtain a dense word-embedding  eωt = e(ωt). The sequence of embeddings  {eωt}Tt=1is then fed to a RNN, which produces a sequence of hidden states  {st}Tt=1by repeatedly applying a transition function  f: st+1 = f(st, eωt) To better handle long-term dependencies in the input sequence, we use a Gated Recurrent Unit (GRU) [4] with layer normalization [2] as transition function. In this work, we use a bidirectional GRU, which consists of one forward GRU, producing hidden states −→stby running from  ω1to  ωT, and a second backward GRU, producing states ←−stby running from  ωTto  ω1. We concatenate both unidirectional GRU states  st = [−→st; ←−st] at each step t to get a final GRU state, which we then use as the compressed embedding  elof the linguistic sequence l.

2.2 Attention

The form of attention we consider was first introduced in the context of machine translation [3,33]. This mechanism takes a weighted average of the hidden states of an encoding RNN based on their relevance to a decoding RNN at various decoding time steps. Subsequent spatial attention mechanisms have extended the original mechanism to image captioning [48] and other vision-and-language tasks [47,24]. More formally, given an arbitrary linguistic embedding  eland image activations  Fw,h,cwhere w, h, c are the width, height, and channel indices, respectively, of the image features F at one layer, we obtain a final visual embedding  evas follows:


where MLP is a multi-layer perceptron and g(., .) is an arbitrary fusion mechanism (concatenation, element-wise product, etc.). We will use Multi-modal Lowrank Bilinear (MLB) attention [24] which defines g(., .) as:


where  ◦denotes an element-wise product and where U and V are trainable weight matrices. We choose MLB attention because it is parameter efficient and has shown strong empirical performance [24,22].

2.3 Feature-wise Linear Modulation

Feature-wise Linear Modulation was introduced in the context of image stylization [8] and extended and shown to be highly effective for multi-modal tasks such as visual question-answering [44,38,7].

A Feature-wise Linear Modulation (FiLM) layer applies a per-channel scaling and shifting to the convolutional feature maps. Such layers are parameter efficient (only two scalars per feature map) while still retaining high capacity, as they are able to scale up or down, zero-out, or negate whole feature maps. In vision-and-language tasks, another network, the so-called FiLM generator h, predicts these


Fig. 2: The Multi-hop FiLM architecture, illustrating inputs (green), layers (blue), and activations (purple). In contrast, Single-hop FiLM models predict FiLM parameters directly from  el,T.

modulating parameters from the linguistic input  el. More formally, a FiLM layer computes a modulated feature map ˆFw,h,cas follows:


where  γand  βare the scaling and shifting parameters which modulate the activations of the original feature map  F.,.,c. We will use the superscript  k ∈[1; K] to refer to the  kthFiLM layer in the network.

FiLM layers may be inserted throughout the hierarchy of a convolutional network, either pre-trained and fixed [6] or trained from scratch [38]. Prior FiLMbased models [44,38,7] have used a single-hop FiLM generator to predict the FiLM parameters in all layers, e.g., an MLP which takes the language embedding elas input [44,38,7].

In this section, we introduce the Multi-hop FiLM architecture (shown in Fig. 2) to predict the parameters of FiLM layers in an iterative fashion, to better scale to longer input sequences such as in dialogue. Another motivation was to better disantangle the linguistic reasoning from the visual one by iteratively attending to both pipelines.

We introduce a context vector  ckthat acts as a controller for the linguistic and visual pipelines. We initialize the context vector with the final state of a bidirectional RNN  sTand repeat the following procedure for each of the FiLM layers in sequence (from lowest to highest convolutional layer): first, the context vector is updated by performing attention over RNN states (extracting relevant language information), and second, the context is used to predict a layer’s FiLM parameters (dynamically modulating the visual information). Thus, the context vector enables the model to perform multi-hop reasoning over the linguistic pipeline while iteratively modulating the image features. More formally, the context vector is computed as follows:




where the dependence of  χkton (ck−1, st) may be omitted to simplify notation. MLPAttnis a network (shared across layers) which aids in producing attention weights.  g′can be any fusion mechanism that facilitates selecting the relevant context to attend to; here we use a simple dot-product following [33], so g′(ck, st) = ck ◦ st. Finally, FiLM is carried out using a layer-dependent neural network  MLP kF iLM:


As a regularization, we append a normalization-layer [2] on top of the context vector after each attention step.

External information. Some tasks provide additional information which may be used to further improve the visual modulation. For instance, GuessWhat?! provides spatial features of the ground truth object to models which must answer questions about that object. Our model incorporates such features by concatenating them to the context vector before generating FiLM parameters.

Visual feedback. Inspired by the co-attention mechanism [31,54], we also explore incorporating visual feedback into the Multi-hop FiLM architecture. To do so, we first extract the image or crop features  F k(immediately before modulation) and apply a global mean-pooling over spatial dimensions. We then concatenate this visual state into the context vector  ckbefore generating the next set of FiLM parameters.

In this section, we first introduce the ReferIt and GuessWhat?! datasets and respective tasks and then describe our overall Multi-hop FiLM architecture1.



Fig. 3: Overall model, consisting of a visual pipeline (red and yellow) and linguistic pipeline (blue) and incorporating additional contextual information (green).

4.1 Dataset

ReferIt [23,51] is a cooperative two-player game. The first player (the Oracle) selects an object in a rich visual scene, for which they must generate an expression that refers to it (e.g., “the person eating ice cream”). Based on this expression, the second player (the Guesser) must then select an object within the image. There are four ReferIt datasets exist: RefClef, RefCOCO, RefCOCO+ and RefCOCOg. The first dataset contains 130K references over 20K images from the ImageClef dataset [35], while the three other datasets respectively contain 142K, 142K and 86K references over 20K, 20k and 27K images from the MSCOCO dataset [29]. Each dataset has small differences. RefCOCO and RefClef were constructed using different image sets. RefCOCO+ forbids certain words to prevent object references from being too simplistic, and RefCOCOg only relies on images containing 2-4 objects from the same category. RefCOCOg also contains longer and more complex sentences than RefCOCO (8.4 vs. 3.5 average words). Here, we will show results on both the Guesser and Oracle tasks.

GuessWhat?! [6] is a cooperative three-agent game in which players see the picture of a rich visual scene with several objects. One player (the Oracle) is randomly assigned an object in the scene. The second player (Questioner) aims to ask a series of yes-no questions to the Oracle to collect enough evidence to allow the third player (Guesser) to correctly locate the object in the image. The GuessWhat?! dataset is composed of 131K successful natural language dialogues containing 650k question-answer pairs on over 63K images from MSCOCO [29]. Dialogues contain 5.2 question-answer pairs and 34.4 words on average. Here, we will focus on the Guesser and Oracle tasks.

4.2 Task Descriptions

Game Features. Both games consist of triplets (I, l, o), where  I ∈ R3×M×Nis an RGB image and l is some language input (i.e., a series of words) describing an object o in I. The object o is defined by an object category, a pixel-wise segmentation, an RGB crop of I based on bounding box information, and handcrafted spatial information  xspatial, where


We replace words with two or fewer occurrences with an <unk> token.

The Oracle task. Given an image I, an object o, a question q, and a sequence  δof previous question-answer pairs (q, a)1:δwhere  a ∈ {Yes, No, N/A}, the oracle’s task is to produce an answer a that correctly answers the question q.

The Guesser task. Given an image I, a list of objects  O = o1:Φ, a target object  o∗ ∈ Oand the dialogue D, the guesser needs to output a probability σφthat each object  oφis the target object  o∗. Following [17], the Guesser is evaluated by selecting the object with the highest probability of being correct. Note that even if the individual probabilities  σφare between 0 and 1, their sum can be greater than 1. More formally, the Guesser loss and error are computed as follows:


where 1 is the indicator function and  Φnthe number of objects in the  nthgame.

4.3 Model

We use similar models for both ReferIt and GuessWhat?! and provide its architectural details in this subsection.

Object embedding The object category is fed into a dense look-up table  ecat, and the spatial information is scaled to [-1;1] before being up-sampled via nonlinear projection to  espat. We do not use the object category in ReferIt models.

Visual Pipeline We first resized the image and object crop to 448×448 before extracting 14  ×14  ×1024 dimensional features from a ResNet-152 [15] (block3) pre-trained on ImageNet [41]. Following [38], we feed these features to a 3  ×3 convolution layer with Batch Normalization [19] and Rectified Linear Unit [37] (ReLU). We then stack four modulated residual blocks (shown in Fig 2), each producing a set of feature maps  F kvia (in order) a 1  ×1 convolutional layer (128 units), ReLU activations, a 3  ×3 convolutional layer (128 units), and an untrainable Batch Normalization layer. The residual block then modulates  F kwith a FiLM layer to get ˆF k, before again applying ReLU activations. Lastly, a residual connection sums the activations of both ReLU outputs. After the last residual block, we use a 1  ×1 convolution layer (512 units) with Batch Normalization and ReLU followed by MLB attention [24] (256 units and 1 glimpse) to obtain the final embedding  ev. Note our model uses two independent visual pipeline modules: one to extract modulated image features  eimgv, one to extract modulated crop features  ecropv.

To incorporate spatial information, we concatenate two coordinate feature maps indicating relative x and y spatial position (scaled to [−1,1]) with the image features before each convolution layer (except for convolutional layers followed by FiLM layers). In addition, the pixel-wise segmentations  S ∈ {0, 1}M×Nare rescaled to 14  ×14 floating point masks before being concatenated to the feature maps.

Linguistic Pipeline We compute the language embedding by using a word-embedding look-up (200 dimensions) with dropout followed by a Bi-GRU (512×2units) with Layer Normalization [2]. As described in Section 3, we initialize the context vector with the last RNN state  c0 = sT. We then attend to the other Bi-GRU states via an attention mechanism with a linear projection and ReLU activations and regularize the new context vector with Layer Normalization.

FiLM parameter generation We concatenate spatial information  espatand object category information  ecatto the context vector. In some experiments, we also concatenate a fourth embedding consisting of intermediate visual features F kafter mean-pooling. Finally, we use a linear projection to map the embedding to FiLM parameters.

Final Layers We first generate our final embedding by concatenating the output of the visual pipelines  efinal = [eimgv ; ecropv] before applying a linear projection (512 units) with ReLU and a softmax layer.

Training Process We train our model end-to-end with Adam [26] (learning rate 3e−4), dropout (0.5), weight decay (5e−6) for convolutional network layers, and a batch size of 64. We report results after early stopping on the validation set with a maximum of 15 epochs.

4.4 Baselines

In our experiments, we re-implement several baseline models to benchmark the performance of our models. The standard Baseline NN simply concatenates the image and object crop features after mean pooling, the linguistic embedding, and the spatial embedding and the category embedding (GuessWhat?! only), passing those features to the same final layers described in our proposed model. We refer to a model which uses the MLB attention mechanism to pool the visual features as Baseline NN+MLB. We also implement a Single-hop FiLM mechanism which is equivalent to setting all context vectors equal to the last state of the Bi-GRU el,T. Finally, we experiment with injecting intermediate visual features into the FiLM Generator input, and we refer to the model as Multi-hop FiLM (+img).

Table 1: ReferIt Guesser Error.


4.5 Results

ReferIt Guesser We report the best test error of the outlined methods on the ReferIt Guesser task in Tab. 1. Note that RefCOCO and RefCOCO+ split test sets into TestA and TestB, only including expression referring towards people and objects, respectively. We do not report [50] and [52] scores on RefCOCOg as the authors use a different split (umd). Our initial baseline achieves 77.6%, 60.8%, 63.1%, 73.4% on the RefCOCO, RefCOCO+, RefCOCOg, RefClef datasets, respectively, performing comparably to state-of-the-art models. We observe a significant improvements using a FiLM-based architecture, jumping to 84.9%, 87.4%, 73.8%, 71.5%, respectively, and outperforming most prior methods and achieving comparably performance with the concurrent MAttN [50] model. Interestingly, MAttN and Multi-hop FiLM are built in two different manners; while the former has three specialized reasoning blocks, our model uses a generic feature modulation approach. These architectural differences surface when examining test splits: MAttN achieves excellent results on referring expression towards objects while Multi-hop FiLM performs better on referring expressions towards people.

GuessWhat?! Oracle We report the best test error of several variants of GuessWhat?! Oracle models in Tab. 2. First, we baseline any visual or language biases by predicting the Oracle’s target answer using only the image (46.7% error) or the question (41.1% error). As first reported in [6], we observe that the baseline methods perform worse when integrating the image and crop inputs (21.1%) rather than solely using the object category and spatial location (20.6%). On the other hand, concatenating previous question-answer pairs to answer the current question is beneficial in our experiments. Finally, using Single-hop FiLM reduces the error to 17.6% and Multi-hop FiLM further to 16.9%, outperforming the previous best model by 2.4%.

GuessWhat?! Guesser We provide the best test error of the outlined methods on the GuessWhat?! Guesser task in Tab. 3. As a baseline, we find that random object selection achieves an error rate of 82.9%. Our initial model baseline performs significantly worse (38.3%) than concurrent models (36.6%), highlighting

Table 2: GuessWhat?! Oracle Error by Model and Input Type.


that successfully jointly integrating crop and image features is far from trivial. However, Single-hop FiLM manages to lower the error to 35.6%. Finally, Multihop FiLM architecture outperforms other models with a final error of 30.5%.

Single-hop FiLM vs. Multi-hop FiLM In the GuessWhat?! task, Multi-hop FiLM outperforms Single-hop FiLM by 6.1% on the Guesser task but only 0.7% on the Oracle task. We think that the small performance gain for the Oracle task is due to the nature of the task; to answer the current question, it is often not necessary to look at previous question-answer pairs, and in most cases this task does not require a long chain of reasoning. On the other hand, the Guesser task needs to gather information across the whole dialogue in order to correctly retrieve the object, and it is therefore more likely to benefit from multi-hop reasoning. The same trend can be observed for ReferIt. Single-hop FiLM and Multi-hop FiLM perform similarly on RefClef and RefCOCO, while we observe 1.3% and 2% gains on RefCOCO+ and RefCOCOg, respectively. This pattern of performance is intuitive, as the former datasets consist of shorter referring expressions (3.5 average words) than the latter (8.4 average words in RefCOCOg), and the latter datasets also consist of richer, more complex referring expressions due e.g. to taboo words (RefCOCO+). In short, our experiments demonstrate that Multihop FiLM is better able reason over complex linguistic sequences.

Reasoning mechanism We conduct several experiments to better understand our method. First, we assess whether Multi-hop FiLM performs better because of increased network capacity. We remove the attention mechanism over the linguistic sequence and update the context vector via a shared MLP. We observe that this change significantly hurts performance across all tasks, e.g., increasing the Multi-hop FiLM error of the Guesser from 30.5 to 37.3%. Second, we in-


vestigate how the model attends to GuessWhat?! dialogues for the Oracle and Guesser tasks, providing more insight into how to the model reasons over the language input. We first look at the top activation in the (crop) attention layers to observe where the most prominent information is. Note that similar trends are observed for the image pipeline. As one would expect, the Oracle is focused on a specific word in the last question 99.5% of the time, one which is crucial to answer the question at hand. However, this ratio drops to 65% in the Guesser task, suggesting the model is reasoning in a different way. If we then extract the top 3 activations per layer, the attention points to <yes> or <no> tokens (respectively) at least once, 50% of the time for the Oracle and Guesser, showing that the attention is able to correctly split the dialogue into question-answer pairs. Finally, we plot the attention masks for each FiLM layer to have a better intuition of this reasoning process in Fig. 4.

Crop vs. Image. We also evaluate the impact of using the image and/or crop on the final error for the Guesser task 3. Using the image alone (while still including object category and spatial information) performs worse than using the crop. However, using image and crop together inarguably gives the lowest errors, though prior work has not always used the crop due to architecture-specific GPU limitations [44].

Visual feedback We explore whether adding visual feedback to the context embedding improves performance. While it has little effect on the GuessWhat?! Oracle and Guesser tasks, it improves the accuracy on ReferIt by 1-2%. Note that ReferIt does not include class labels of the selected object, so the visual feedback might act as a surrogate for this information. To further investigate this hypothesis, we remove the object category from the GuessWhat?! task and report results in Tab. 5 in the supplementary material. In this setup, we indeed observe a relative improvement 0.4% on the Oracle task, further confirming this hypothesis.

Pointing Task In GuessWhat?!, the Guesser must select an object from among a list of items. A more natural task would be to have the Guesser directly point out the object as a human might. Thus, in the supplementary material, we introduce this task and provide initial baselines (Tab. 7) which include FiLM models. This task shows ample room for improvement with a best test error of 84.0%.


Fig. 4: Guesser (left) and Oracle (right) attention visualizations for the visual pipeline which processes the object crop.

The ReferIt game [23] has been a testbed for various vision-and-language tasks over the past years, including object retrieval [36,51,52,54,32,50], semantic image segmentation [16,39], and generating referring descriptions [51,32,52]. To tackle object retrieval, [36,51,50] extract additional visual features such as relative object locations and [52,32] use reinforcement learning to iteratively train the object retrieval and description generation models. Closer to our work, [17,54] use the full image and the object crop to locate the correct object. While some previous work relies on task-specific modules [51,50], our approach is general and can be easily extended to other vision-and-language tasks.

The GuessWhat?! game [6] can be seen as a dialogue version of the ReferIt game, one which additionally draws on visual question answering ability. [42,28,53] make headway on the dialogue generation task via reinforcement learning. However, these approaches are bottlenecked by the accuracy of Oracle and Guesser models, despite existing modeling advances [54,44]; accurate Oracle and Guesser models are crucial for providing a meaningful learning signal for dialogue generation models, so we believe the Multi-hop FiLM architecture will facilitate high quality dialogue generation as well.

A special case of Feature-wise Linear Modulation was first successfully applied to image style transfer [8], whose approach modulates image features according to some image style (i.e., cubism or impressionism). [44] extended this approach to vision-and-language tasks, injecting FiLM-like layers along the entire visual pipeline of a pre-trained ResNet. [38] demonstrates that a convolutional network with FiLM layers achieves strong performance on CLEVR [21], a task that focuses on answering reasoning-oriented, multi-step questions about synthetic images. Subsequent work has demonstrated that FiLM and variants thereof are effective for video object segmentation where the conditioning input is the first image’s segmentation (instead of language) [49] and language-guided image segmentation [40]. Even more broadly, [9] overviews the strength of FiLMrelated methods across machine learning domains, ranging from reinforcement learning to generative modeling to domain adaptation.

There are other notable models that decompose reasoning into different modules. For instance, Neural Turing Machines [13,14] divide a model into a controller with read and write units. Memory networks use an attention mechanism to answer a query by reasoning over a linguistic knowledge base [45,43] or image features [46]. A memory network updates a query vector by performing several attention hops over the memory before outputting a final answer from this query vector. Although Multi-hop FiLM computes a similar context vector, this intermediate embedding is used to predict FiLM parameters rather than the final answer. Thus, Multi-hop FiLM includes a second reasoning step over the image.

Closer to our work, [18] designed networks composed of Memory, Attention, and Control (MAC) cells to perform visual reasoning. Similar to Neural Turing Machines, each MAC cell is composed of a control unit that attends over the language input, a read unit that attends over the image and a write unit that fuses both pipelines. Though conceptually similar to Multi-hop FiLM models, Compositional Attention Networks differ structurally, for instance using a dynamic neural architecture and relying on spatial attention rather than FiLM.

In this paper, we introduce a new way to exploit Feature-wise Linear Modulation (FiLM) layers for vision-and-language tasks. Our approach generates the parameters of FiLM layers going up the visual pipeline by attending to the language input in multiple hops rather than all at once. We show Multi-hop FiLM Generator architectures are better able to handle longer sequences than their single-hop counterparts. We outperform state-of-the-art vision-and-language models significantly on the long input sequence GuessWhat?! tasks, while maintaining state-of-the-art performance for the shorter input sequence ReferIt task. Finally, this Multi-hop FiLM Generator approach uses few problem-specific priors, and thus we believe it can extended to a variety of vision-and-language tasks, particularly those requiring complex visual reasoning.

Acknowledgements The authors would like to acknowledge the stimulating research environment of the SequeL Team. We also thank Vincent Dumoulin for helpful discussions. We acknowledge the following agencies for research funding and computing support: Project BabyRobot (H2020-ICT-24-2015, grant agreement no.687831), CHISTERA IGLU and CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020, NSERC, Calcul Qu´ebec, Compute Canada, the Canada Research Chairs, and CIFAR.

1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proc. of ICCV (2015)

2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. Deep Learning Sympo- sium (NIPS) (2016)

3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proc. of ICLR (2015)

4. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur- rent neural networks on sequence modeling. In: Proc. of ICML (2015)

5. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: Visual dialog. In: Proc. of CVPR (2017)

6. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proc. of CVPR (2017)

7. Delbrouck, J.B., Dupont, S.: Modulating and attending the source image dur- ing encoding improves multimodal translation. Visually-Grounded Interaction and Language Workshop (NIPS) (2017)

8. Dumoulin, V., Shlens, J., Kudlur, M.: A Learned Representation For Artistic Style. In: Proc. of ICLR (2017)

9. Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H.d., Courville, A., Bengio, Y.: Feature-wise transformations. Distill (2018). https://doi.org/10.23915/distill.00011, https://distill.pub/2018/feature-wise-transformations

10. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)

11. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi- modal compact bilinear pooling for visual question answering and visual grounding. In: Proc. of EMNLP (2016)

12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu- rate object detection and semantic segmentation. In: Proc. of of CVPR (2014)

13. Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)

14. Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska- Barwi´nska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou, J., et al.: Hybrid computing using a neural network with dynamic external memory. Nature 538(7626), 471 (2016)

15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. of CVPR (2016)

16. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expres- sions. In: Proc. of ECCV (2016)

17. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proc. of CVPR (2016)

18. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine rea- soning. In: Proc. of ICL (2018)

19. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of ICML (2015)

20. Jabri, A., Joulin, A., van der Maaten, L.: Revisiting visual question answering baselines. In: Proc. of ECCV (2016)

21. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proc. of CVPR (2017)

22. Kafle, K., Kanan, C.: Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163, 3–20 (2017)

23. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proc. of EMNLP (2014)

24. Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard Product for Low-rank Bilinear Pooling. In: Proc. of ICLR (2017)

25. Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.: Multimodal residual learning for visual qa. In: Proc. of NIPS (2016)

26. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. of ICLR (2014)

27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Proc. of of NIPS (2012)

28. Lee, S.W., Heo, Y.J., Zhang, B.T.: Answerer in questioner’s mind for goal-oriented visual dialogue. Visually-Grounded Interaction and Language Workshop (NIPS) (2018)

29. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proc. of ECCV (2014)

30. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proc. of CVPR (2015)

31. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Proc. of NIPS (2016)

32. Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions. In: Proc. of CVPR (2017)

33. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proc. of EMNLP (2015)

34. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based ap- proach to answering questions about images. In: Proc. of ICCV (2015)

35. Mller, H., Clough, P., Deselaers, T., Caputo, B.: ImageCLEF: Experimental Eval- uation in Visual Information Retrieval. Springer (2012)

36. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Proc. of ECCV (2016)

37. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma- chines. In: Proc. of ICML (2010)

38. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proc. of AAAI (2018)

39. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Proc. of ECCV (2016)

40. Rupprecht, C., Laina, I., Navab, N., Hager, G.D., Tombari, F.: Guide me: Inter- acting with deep networks. In: Proc. of CVPR (2018)

41. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)

42. Strub, F., De Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems harm de vries. In: Proc. of IJCAI (2017)

43. Sukhbaatar, S., Weston, J., Fergus, R., et al.: End-to-end memory networks. In: Proc. of NIPS (2015)

44. de Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Proc. of NIPS (2017)

45. Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprint arXiv:1410.3916 (2014)

46. Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: Proc. of ICML (2016)

47. Xu, H., Saenko, K.: Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: Proc. of ECCV (2016)

48. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Proc. of ICML (2015)

49. Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: Proc. of CVPR (2018)

50. Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Mod- ular attention network for referring expression comprehension. In: Proc. of CVPR (2018)

51. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Proc. of ECCV (2016)

52. Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speakerlistener-reinforcer model for referring expressions. In: Proc. of CVPR (2016)

53. Zhu, Y., Zhang, S., Metaxas, D.: Reasoning about fine-grained attribute phrases using reference games. In: Visually-Grounded Interaction and Language Workshop (NIPS) (2017)

54. Zhuang, B., Wu, Q., Shen, C., Reid, I.D., van den Hengel, A.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. Proc. of CVPR (2018)

ReferIt ImageClef

Table 4: ReferIt Guesser Test Error.


Oracle (Without Category Label)

Table 5: GuessWhat?! Oracle Test Error without Object Category Label.


Guesser (Without Category Label)

Table 6: GuessWhat?! Guesser Test Error without Object Category Label.


Table 7: Guesser pointing error for different IoU thresholds.


For existing tasks on the GuessWhat?! dataset, the Guesser selects its predicted target object from among a provided list of possible answers. A more natural task would be for the Guesser to directly point out the object, much as a human might. Thus, we introduce a pointing task as a new benchmark for GuessWhat?!. The specific task is to locate the intended object based on a series of questions and answers; however, instead of selecting the object from a list, the Guesser must output a bounding box around the object of its guess, making the task more challenging. This task also does not include important side information, namely object category and (x,y)-position [6], making the object retrieval more difficult than the originally introduced Guesser task as well. The bounding box is defined more specifically as the 4-tuple (x, y, width, height), where (x, y) is the coordinate of the top left corner of the box within the original image I, given an input dialogue.

We assess bounding box accuracy using the Intersection Over Union (IoU) metric: the area of the intersection of predicted and ground truth bounding boxes, divided by the area of their union. Prior work [10,12], generally considers an object found if IoU exceeds 0.5.


We report model error in Table 7. Interestingly, the baseline obtains 92.0% error while Multi-hop FiLM obtains 84.0% error. As previously mentioned, reinjecting visual features into the Multi-hop FiLM Generator’s context cell is beneficial. The error rates are relatively high but still in line with those of similar pointing tasks such as SCRC [16,17] (around 90%) on ReferIt.


Fig. 5: The crop pipeline Oracle’s attention over the last question when the model succeeds.


Fig. 6: The crop pipeline Oracle’s attention over the last question, showing more advanced reasoning.


Fig. 7: The crop pipeline Oracle’s attention over the last question when the model fails.


Fig. 8: The crop pipeline Guesser’s attention when the model succeeds.


Fig. 9: The crop pipeline Guesser’s attention when the model fails.

Designed for Accessibility and to further Open Science