The past years have witnessed a rapid progress in the research area of Computer Vision (CV) and Natural Language Processing (NLP) thanks to the fast development of deep neural networks. Plenty of successful applications have been commercialised due to the reliability and instantaneity of the backbone algorithms. Visual Grounding (VG), a research direction lying across CV and NLP, has also attracted lots of attention in those days [6, 16, 25, 41, 42]. Generally, VG requires a machine to respond to a natural language query by specifying the most relevant region in an image. In practice, this technology can be widely used in the human-computer interaction systems of new generation intelligence devices, such as home robots and autonomous vehicles; it can also be embedded into some fancy applications on PCs and smart phones, like the virtual assistants. However, we still have not seen any single application like this, mainly because both of the efficiency and accuracy of the current methods are not positive enough.
After a careful review, we find that almost all previous approaches follow a common two-stage pipeline: i) use an object detector such as Faster RCNN [28] or YOLO [27] to generate a set of object proposals based on the input image; ii) compute a matching score between each proposal and the query, after which the proposal with the highest score is adopted as the model prediction. However, these two-stage solutions have two notable issues:
Low efficiency: Two-stage methods can be unbearably slow since each object proposal as well as the query need to be embedded into the same dimensional space (in stage-ii) in order to obtain a matching score. There are normally tens or even hundreds of proposals generated in the stage-i, which makes it infeasible for real-time VG applications.
Low accuracy: The quality of the object proposals obtained in stage-i can severely affect the final performance. In fact, since the state-of-the-art object detectors still yield limited performance in practice, there usually exists lots of misalignment between the generated proposals and the ground-truth objects, which may impede the learning procedure in stage-ii. More critically, the object detector may even miss the target in stage-i (which is quite common), in which case the stage-ii is actually a wasting of time since the VG task will never succeed.
Figure 1. Two-stage vs. one-stage visual grounding pipeline: Twostage framework consists of an object proposing module and a matching module. Our proposed one-stage detection model detects the object conditioned on the image-query attention map without proposing multiple regions, and it omits the redundant matching stage.
To alleviate above issues and facilitate the applications of VG in real-world scenarios, in this paper, we propose a simple, fast and accurate one-stage VG method named as YOLLO (You Only Look & Listen Once) which generates the bounding box of the target object directly from raw image and query input. The key innovations of our proposed YOLLO, are the Relation-to-Attention module which can learn to focus on the regions in the input image that best match the input query, and a RPN-like (Region Proposal Network, [28]) Target Detection Network that predicts the target bounding box directly without proposing even a single irrelevant proposal. Thus the further matching module as proposed in [6, 29, 45] can be eliminated. We show a comparison between our one-stage pipeline and the commonly used two-stage pipeline in Figure 1.
To evaluate the effectiveness of our proposed model, we conduct experiments on three benchmark visual grounding datasets, namely RefCOCO [17], RefCOCO+ [17] and RefCOCOg [25]. Compared with previous state-of-the-art two-stage VG methods, our YOLLO model not only significantly improves the performance by in terms of the successful grounding rate, but also greatly accelerates the inference speed by
on NVIDIA Titan XP GPUs (i.e., achieves real-time inference).
As a research direction across vision and language, visual grounding has been benefited from the development of convolutional neural networks (CNNs) [13, 19], recurrent neural networks (RNNs) [4, 11], and other research direction such as image captioning [35, 39], visual question answering (VQA) [36, 38], and object detection [23, 27, 28]. In the following sections, we first investigate the related works in visual grounding, and then we review the region proposal model and the attention mechanism, which is also related to our proposed method.
Visual Grounding Visual grounding, also known as Referring Expression Comprehension, aims to respond to a query by specifying a corresponding object in an image. Some methods [16, 25, 29] view VG as a reverse of the Image Captioning, where they feed each proposal into an Image Captioning model to calculate the probability of generating the input query as its caption. Then, the proposal that yields the largest probability is selected as the target object.
On the other hand, some methods [8, 42] construct a vector embedding for the query as well as for each proposal, and calculate the matching score for each query-object embedding pair. In [42], the authors introduce a Speaker-Listener-Reinforcer model, where both the listener and the speaker have the ability to ground the query into the image. Specifically, the listener tries to align the query embedding with the embedding of the target object, while the speaker tries to reconstruct the query based on the target object embedding. By jointly training the speaker, listener, and the reinforcer, their model achieves the state-of-the-art results.
Attention mechanism becomes popular in this topic later, for example, some recent works [15, 40, 44] use self-attention mechanism to decompose the query into subcomponents and learn separate features for each of the resulting parts. Co-attention mechanism has been used [6, 45] to extract features from the language queries and the image regions jointly by attending on each other.
However, nearly all of the above methods in VG follow a common two-stage pipeline, which is inherently a matching problem over a pre-defined set of object proposals. That means the region proposal stage is an isolated step without considering the input query. In practice, they choose to use either the ground-truth candidate bounding boxes or the region proposals detected by object detectors such as FasterRCNN [28], which largely restricts the visual grounding speed in practice. Different from them, we treat the visual grounding as a detection problem conditioned on the language query, where we directly predict the region most related to the query, without generating tens or even hundreds of proposals and further selecting one of them via matching.
Region Proposal and Detection Most previous VG methods require a pre-trained object detector to generate a set of object proposals, which makes their performance very sensitive to the quality of the object proposals. Some object detectors [5, 12, 23, 28] can generate relatively accurate object proposals for objects in the image, but only limited on the object categories that have been seen in the training data. There are also many faster but less accurate object proposal models, such as RPN [28], MultiBox [7], BING [3], and Selective Search [33], where they have to increase the number of proposals to improve the recall rate. Instead of using these pre-built stand-alone region proposal models for VG, we devise a network to predict object bounding box according to the language query.
Attention Mechanism Recently, the attention mechanism has become a research hot-spot, for its simplicity and effectiveness in dealing with multi-modal information. In practice, it has already been applied to a wide range of CV and NLP [1, 24, 32, 39, 43], and has significantly boosted the performance. In this paper, we adopt the Scaled Dot Product Attention [34] to capture the relationship between image and language.
Self-attention mechanism [21, 34, 43] learns to assign attention among the information with the guidance from itself, In practice, self-attention has proved its effectiveness in capturing the latent relations within different parts of the information. For example, in [14], an object relation module is developed to modelling the spatial relationships between objects in an image, which is built upon self-attention. Self-attention can also be viewed as a specific instantiation of the non-local operation, which can be used to capture the space-time relations within a video [37]. In [31], self-attention is used for relational reasoning in the VQA task.
Formally, given an image I containing k objects O = and a query Q, visual grounding requires a model to map Q and I to a target object
of interest. Many previous works [6, 15, 25, 42, 45] treat O as a given list (either provided as ground-truth or detected by a stand-alone object detector), in practice, however, O is not given and need to be proposed during the visual grounding process.
In this section, We present a simple, fast and accurate one-stage (YOLLO) model, as illustrated in Figure 2(a) to solve the visual grounding problem without the need of the candidate list. Our YOLLO contains three steps: 1) use a feature encoder to extract and embed the feature vectors for dense regions in the input image and each word in the input query; 2) based on the extracted features, we adopt a stack of Relation-to-Attention (Rel2Att) modules to capture dense relations among regions in the image and words in the query, and further derive an attention mask for both the image and the query from the relation maps; 3) use a target detection network that takes as input the attended image feature to generate the bounding box of the target object . We now describe the proposed YOLLO in detail.
3.1. The Feature Encoder
Given an image I and a natural language query Q as input, we extract a feature sequence from I using the feature map
obtained from a pre-trained CNN, where each feature vector
corresponds to a dense grid region in M as well as a small patch
in the original image I. The sequence length
denotes the number of regions.
For the words in the query
, we first obtain the distributed representation
for each word
with a word embedding layer. Following [9] and [34], we also equip our model with a sense of order by embedding the absolute positions of the input words
in the query Q, denote as
where
. Then, the sum of
and
, i.e.,
, is adopted as the the final feature vector for
. As a result, we can obtain a feature sequence
from Q, where n is the number of words in the query.
3.2. The Relation-to-Attention (Rel2Att) Module
On top of the image feature sequence V and the query feature sequence T , we propose a Relation-to-Attention (Rel2Att) module which employs the Scaled Dot-product Attention mechanism [34] to construct a dense relation map among the elements in V and T and derive attention masks for them to pay more attention to the useful information in the image and the query.
As illustrated in Figure 2(b), we use four two-layer FeedForward Neural Networks (FFNs) for both V and T to obtain two sets of feature matrices:
where , and
indicates the dimension of the feature vectors (e.g.,
s are learnable parameters for those FFNs. We then concatenate
and
along the first dimension to obtain a fused feature matrix:
where k = m + n. Meanwhile, we obtain
from
and
analogously. Then, we construct a dense relation map for the elements in V and T through Scaled Dot Product Attention:
As shown in Figure 2(b), can be split into four parts, i.e.,
and
, where
and
are for self-attention, which capture the internal relations within the patches
in the image and the words
in the query, respectively.
and
are build for co-attention, which capture the external relations among
and
Figure 2. (a) The YOLLO architecture. YOLLO is a one-stage method which generates region proposal for the target object directly for the input image and query. Moreover, YOLLO stacks the Relation-to-Attention (Rel2Att) module for multiple times to obtain a better feature representation. (b) The architecture of the Relation-to-attention module.
. All of these attentions have been individually proved important for the success of VG task in previous works [6, 40, 45], here we use a compact representation to cover all of them.
Afterwards, the relation map R are utilised to derive the attention masks for the feature sequences V and T simultaneously, where we simply average over both the first and the second dimension to obtain two k-dimensional vectors
. Then,
and
are elementwisely summed together into att, where first m elements in att is adopted as the attention mask for V , and the rest n elements is adopted as the attention mask for T , denote as
and
, respectively. We then weight the input feature sequences by conducting elementwise multiplication between them and their corresponding attention masks to obtain the updated feature sequences:
Typically, we can stack our Rel2Att module for multiple times as in [34], where each attention module takes as input the attended feature sequences and
from the previous Rel2Att module so as to obtain more effective and representative features for both the image and the query. Shortcut connections are built among the attention modules to facilitate the information propagation. Note that in the last Rel2Att module we only compute the new image feature sequence
with Equation (4) since the query feature sequence
is not needed in the subsequent steps. Moreover, we reconstruct the updated feature map
from
as the output of the last attention module.
During training, we utilise the ground-truth bounding box of the target object
to guide the attention, where we first scale down
to match the size of the feature map
, and obtain the bounding box
of the corresponding region of interest on
. We then assign a non-zero value
to the pixels that inside the
and set the grid values outside
to zero, which gives us a
. Then, we transform
to a distribution
with a softmax layer. The loss function for our Rel2Att modules can be expressed as:
3.3. Target Detection Network
In this step, we feed the attended image feature map obtained from the stacked Rel2Att modules to generate the region proposal for the target object
. We adopt a RPNlike architecture (Region Proposal Network, [28]) for our target detection network.
Specifically, we first employ two convolution layers to map to a lower dimensional space. The output is then fed into two sibling fully-connected layers, i.e., a binary classification layer and a bounding box regression layer. Similar as in RPN, we define K anchors
with different scales and aspect ratios for each sliding window in
. Then, our target detection network predicts a confidence score
for each anchor
of being the target object, and simultaneously predicts a bounding box offset tuple
which is used to refine
to obtain the corresponding region proposal. After that, we simply pick the top-1 scored region proposal as the final prediction.
As illustrated in Figure 3, if in the stacked Rel2Att modules we have successfully attended to the feature vectors that are relevant to the target object , i.e., the regions corresponding to
look much “brighter” than other parts of the feature map, then the classification branch of our target detection network can just simply assign a larger confi-dence score
to the anchor with larger grid values inside it to make a relatively good prediction.
To train our target detection network, we mark the anchors that have IoUs (Intersection over Union) overlap higher than a threshold with the target object
as the positive samples with a label
, while
if the IoU between the anchor and the target object is lower than
. During training, we sample N anchors from the positive and negative anchors to form a mini-batch. In our experiments, we heuristically set the value of N,
and
to 256, 0.5 and 0.25, respectively.
As in [28], the loss function for the target detection network is consist of a classification loss and a regression loss
, where
is the binary softmax cross-entropy between the predicted score
and the ground-truth label
(being the target object or not) of the anchors:
and is the smooth
loss [10] between the predicted bounding box offset tuples
and the ground-truth offset tuples
:
Our YOLLO model can be trained end-to-end by minimising the total loss:
where is a balancing factor.
In this section, we first evaluate the effectiveness and the efficiency of the proposed YOLLO on three benchmark datasets, namely ReferCOCO [17], ReferCOCO+ [17] and ReferCOCOg [25]. To evaluate the contribution of each component of the proposed model, we then conduct ablation studies for the Relation-to-Attention (Rel2Att) module. We test the effect of removing the self-attention and
Figure 3. The relevant area has been highlighted before sending to the detection module by using an attention mask obtained from our proposed Rel2Att module.
co-attention, separately. At the last part, we provide visualisation analysis for the proposed YOLLO.
4.1. Datasets
We evaluate the proposed YOLLO on three benchmark datasets based on MS COCO [20]: ReferCOCO, ReferCOCO+ and ReferCOCOg. In ReferCOCO and ReferCOCO+ [17], the average length of queries is around 3.6, indicating that their queries are mostly short phrases. The difference is that the queries in ReferCOCO+ are not supposed to contain any location word, such as “left”, “front”. In ReferCOCOg [25], the queries are normal sentences, which have an average length of 8.43. Moreover, in ReferCOCO(+), the average number of objects of the same type is about 3.9, whilst in ReferCOCOg, this number is limited to 1.6. Other important statistics are recorded in Table 1.
Table 1. Datasets statistics
Following [41], we split ReferCOCO(+) into 40,000 training, 5,000 validation, and 5,000 testing samples, where the testing set are further split into “TestA” and “TestB”. More precisely, images containing multiple people are put into “TestA”, while images containing other objects are in “TestB”. ReferCOCOg is split into 44,822 training and 5,000 validation samples.
4.2. Implementation details
In our basic implementation, we use ResNet-501 as our backbone image feature extractor, where we adopt the feature map from the last layer of the fourth stage, i.e., C4 feature, to obtain our image feature sequence. We pre-train the backbone CNN on the ImageNet [30] dataset. The input image is resized to before fed into the ResNet-50. To obtain the query feature, we pre-train the word embeddings (512-D) with Word2Vec [26] on the LM-1B [2] corpus. For the missed tokens in the pre-trained word embeddings, we replace them with a UNK token. Moreover, we pad the query for every training sample to the maximum query length (39 for RefCOCO, 24 for RefCOCO+, 46 for RefCOCOg) with a PAD token. We stack the Rel2Att module for 3 times for all experiments. The
in Equation (9) is set to 1. During the training, we use Adam [18] with a learning rate of 5e-5 for all dataset, and fine-tune the backbone ResNet as well as the word embedding layers with other parts together. The proposed YOLLO is easy to train, i.e., we train our model for 30 epochs, which takes us 5 hours on 8 TITAN XP GPUs with a total batch size of 256.
4.3. Overall Performance
In this section, we report our experimental results on RefCOCO [17], RefCOCO+ [17] and RefCOCOg [25]. Following the common settings in VG, if the IoU score between the generated region and the target object is greater than a threshold , we consider this a correct prediction. Denote this metric as ACC@0.5. As shown in Table 2, the proposed YOLLO significantly outperforms all previous baselines with
absolute improvements in term of ACC@0.5. Moreover, our method performs consistently on all datasets as well as all test splits, indicating the robustness of the proposed YOLLO. We further plot the training curves for different dataset. As illustrated in Figure 4, the proposed YOLLO is able to converge quickly on all datasets, which verifies the optimisation ability of YOLLO.
Generalisation To evaluate the generalisation ability of YOLLO, we train our model on one dataset and test it on the other two datasets (e.g., trained on RefCOCO and tested on RefCOCO+). From Table 2, we can observe that the proposed YOLLO still yields competitive performance compared with the previous methods in these cases. For example, when training our method on the RefCOCO+ and test it on the RefCOCO, we obtain 68.32% ACC@0.5 while the previous state-of-the-art result is 67.44%. This experiment shows the superior generalisation ability of YOLLO.
Different evaluation metrics We further verify the performance of YOLLO under different evaluation metrics. As shown in Table 3, ACC@0.75 indicates that setting the threshold to 0.75, and ACC is calculated by averaging the accuracy values obtained by setting
from 0.5 to 0.95 with 0.05 as the incremental step size. Moreover, we obtain MIOU by averaging the IoUs between the predicted region proposals and the target objects among the whole dataset. From the table, we can see YOLLO show excellent performance under all evaluation metrics. ACC@0.75 is lower mainly because we set the anchors with IoU greater than
as the positive samples for training our target detection network (Section 3.3). We believe that we can improve the performance of the proposed YOLLO under ACC and ACC@0.75 by setting
to a properly larger value, e.g., 0.7, but we leave this to the future work.
4.4. Ablation studies on the Rel2Att module
In this section, we evaluate the effectiveness of the components in our Relation-to-Attention module. Specifically, we consider the following experimental settings, i.e., YOLLO without self-attention on the image and query, and YOLLO without the co-attention between the image and query, where we simply wipe out the corresponding blocks in the relation map. The results are shown in Table 4.
From the table, we observe that removing the self-attention for image and query lead to about 30% to 40% performance degradation on all comparisons. This indicates that the self-attention on the image and query is indispensable for our YOLLO model.
Moreover, by removing co-attention among the image and the query, we force our YOLLO model to predict region proposals for the target object without any awareness of the input query (the self-attention for query is also not considered). Surprisingly, YOLLO still achieves about 35% ACC@0.5 on all the test splits. This may because of the biases introduced when the human annotators building these datasets, e.g., given an image contains several objects, the annotators may be more likely to pick the objects of some specific categories or locate at some specific places in the image scene as the target of a visual grounding data sample. This bias may be captured by the image self-attention through minimising the attention loss .
4.5. Inference time evaluation
We now compare the inference speed of the proposed YOLLO with the models proposed in [42]. Note that the compared models directly utilises the pre-computed Faster RCNN proposals provided in the dataset, thus they do not actually perform the Faster RCNN algorithm during the inference, i.e., they only perform the matching model to calculate a score between each provided proposal and the query, and choose the top-1 as the final prediction. As a consequence, the input of their model is a query, a image and a list of proposal bounding boxes. For fair com-
Table 2. Comparisons on ReferCOCO, ReferCOCO+ and ReferCOCOg. “-” denotes that the results are not reported. “speaker+listener+reinforcer” and “speaker+listener+reinforcer” mean using the speaker or listener module of a joint module [42] to do the viusal grounding task respectively.
Table 3. Evaluations under different evaluation metrics.
Figure 4. Training curves on RefCOCO (red), RefCOCO+ (green) and RefCOCOg (blue). Note that YOLLO is able to converge within 5000 iterations.
parisons, we report their inference time by adding the inference time of a Faster RCNN model based on ResNet-50.
For YOLLO, the input image size is , which is about
smaller than the canonical input size
for Faster RCNN. For general object detection models, a larger input size may lead to a better performance, however, for YOLLO, since we already achieve very strong results
without bells and whistles, we do not adopt the canonical input size to reduce the computation and the memory consuming. As shown in Table 5, YOLLO is significantly faster than the baseline models, and is able to run in real-time.
4.6. Visualisation
In this section, we provide visualisations for the final results and attention masks of our YOLLO model. Specifi-cally, we obtain the attention mask for image from the last Rel2Att module. The results are shown in Figure 5. As we can see, the highlighted areas (image attention masks) generated from the relation map perfectly match with the final predicted bounding boxes, conditioned on the query sentences. In many cases, the attended area (and the fi-nal predicted bounding box) can be adjusted to the right place when the query sentence changes, even in the same image for the similar objects. For example, “left most toilet” vs. “right urinal” in the second row middle, and “man blue shirt” vs. “lady in white shirt” in the third row middle in Figure 5.
Visual Grounding (VG) can be very useful in our daily life if it can achieve real-time and human-level accuracy. However, most of the previous state-of-the-art methods in VG suffer from the low efficiency and accuracy of a conventionally adopted two-stage manner. In this paper, we instead proposed a fast and accurate one-stage VG method named as YOLLO (You Only Look & Listen Once) that generates bounding box for the target object directly from the input image and query. The proposed model only accesses to the image and query once by using a novel Relation-to-Attention module, avoid proposing tens or even hundreds of region proposals and then scoring them one by one con-
Figure 5. Some qualitative results from our model. Attended areas from our Rel2Att module are highlighted and targets detected by our target detection network are labelled with the red bounding box. Queries are shown under the corresponding images.
Table 4. Ablation studies on ReferCOCO, ReferCOCO+ and ReferCOCOg to verify the contributions of different attention types in our Rel2Att produced attention masks.
Table 5. Inference speed comparison. The values in the parenthe- sis denote the additional inference time of stage-i, i.e., the region proposing time by using the Faster RCNN model.
ditioned on the query. During the inference, our approach is about faster than previous methods and, remarkably, it achieves
absolute performance improvement on top of the state-of-the-art results on several benchmark datasets. To the best of our knowledge, this is the first model that achieves real-time and 90%+ accuracy in the Visual Grounding task.
[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
[2] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
[3] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Bina- rized normed gradients for objectness estimation at 300fps. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3286– 3293, 2014.
[4] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[5] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., pages 764–773, 2017.
[6] C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
[7] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2147–2154, 2014.
[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. Proc. Conf. Empirical Methods in Natural Language Processing, pages 457–468, 2016.
[9] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In Proc. Int. Conf. Mach. Learn., pages 1243–1252, 2017.
[10] R. Girshick. Fast r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pages 1440–1448, 2015.
[11] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. IEEE Trans. Neural Netw. & Learn. Syst., 28(10):2222–2232, 2017.
[12] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In Proc. IEEE Int. Conf. Comp. Vis., pages 2980–2988. IEEE, 2017.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 770–778, 2016.
[14] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., volume 2, 2018.
[15] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
[16] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Dar- rell. Natural language object retrieval. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4555–4564, 2016.
[17] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proc. Conf. Empirical Methods in Natural Language Processing, pages 787–798, 2014.
[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Inf. Process. Syst., pages 1097– 1105, 2012.
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comp. Vis., pages 740–755. Springer, 2014.
[21] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[22] J. Liu, L. Wang, M.-H. Yang, et al. Referring expression generation and comprehension via attributes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In Proc. Eur. Conf. Comp. Vis., pages 21–37. Springer, 2016.
[24] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering.
In Proc. Advances in Neural Inf. Process. Syst., pages 289– 297, 2016.
[25] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 11–20, 2016.
[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Inf. Process. Syst., pages 3111–3119, 2013.
[27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 779–788, 2016.
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., pages 91–99, 2015.
[29] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In Proc. Eur. Conf. Comp. Vis., pages 817–834. Springer, 2016.
[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–252, 2015.
[31] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Proc. Advances in Neural Inf. Process. Syst., pages 4967–4976, 2017.
[32] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recogni- tion using visual attention. arXiv preprint arXiv:1511.04119, 2015.
[33] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Int. J. Comput. Vision, 104(2):154–171, 2013.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. Advances in Neural Inf. Process. Syst., pages 5998–6008, 2017.
[35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3156–3164, 2015.
[36] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel. Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell., 2017.
[37] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv preprint arXiv:1711.07971, 10, 2017.
[38] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proc. Eur. Conf. Comp. Vis., pages 451–466. Springer, 2016.
[39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi- nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proc. Int. Conf. Mach. Learn., pages 2048–2057, 2015.
[40] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring ex-
pression comprehension. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
[41] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In Proc. Eur. Conf. Comp. Vis., pages 69–85. Springer, 2016.
[42] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker listener-reinforcer model for referring expressions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., volume 2, 2017.
[43] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self- attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
[44] H. Zhang, Y. Niu, and S.-F. Chang. Grounding referring ex- pressions in images by variational context. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
[45] B. Zhuang, Q. Wu, C. Shen, I. Reid, and A. van den Hen- gel. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.