[Anderson et al. 2018] Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and topdown attention for image captioning and visual question answering. In CVPR.
[Cadene et al. 2019] Cadene, R.; Ben-Younes, H.; Thome, N.; and Cord, M. 2019. Murel: Multimodal Relational Reasoning for Visual Question Answering. In CVPR.
[Chen, Kovvuri, and Nevatia 2017] Chen, K.; Kovvuri, R.; and Nevatia, R. 2017. Query-guided regression network with context policy for phrase grounding. In ICCV.
[Chung et al. 2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
[Das et al. 2017] Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura, J. M. F.; Parikh, D.; and Batra, D. 2017. Visual dialog. In CVPR.
[Devlin et al. 2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[Feng et al. 2019] Feng, Y.; Ma, L.; Liu, W.; and Luo, J. 2019. Unsupervised image captioning. In CVPR.
[Fukui et al. 2016] Fukui, A.; Park, D. H.; Yang, D.; Rohrbach, A.; Darrell, T.; and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847.
[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
[He et al. 2017] He, K.; Gkioxari, G.; Doll´ar, P.; and Girshick, R. 2017. Mask r-cnn. In ICCV.
[Hu et al. 2019] Hu, R.; Rohrbach, A.; Darrell, T.; and Saenko, K. 2019. Language-conditioned graph networks for relational reasoning. arXiv preprint arXiv:1905.04405.
[Justin et al. 2015] Justin, J.; Ranjay, K.; Michael, S.; Li-Jia, L.; David, A. S.; and Li, F.-F. 2015. Image retrieval using scene graphs. In CVPR.
[Kottur et al. 2018] Kottur, S.; Moura, J. M. F.; Parikh, D.; Batra, D.; and Rohrbach, M. 2018. Visual coreference resolution in visual dialog using neural module networks. In ECCV.
[Krishna et al. 2017] Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV.
[Li et al. 2017] Li, Y.; Ouyang, W.; Zhou, B.; Wang, K.; and Wang, X. 2017. Scene graph generation from objects, phrases and region captions. In ICCV.
[Lili et al. 2016] Lili, M.; Rui, M.; Ge, L.; Yan, X.; Lu, Z.; Rui, Y.; and Zhi, J. 2016. Natural language inference by tree-based convolution and heuristic matching. In ACL.
[Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV.
[Mogadala, Kalimuthu, and Klakow 2019] Mogadala, A.; Kalimuthu, M.; and Klakow, D. 2019. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. arXiv preprint arXiv:1907.09358.
[Mun et al. 2018] Mun, J.; Lee, K.; Shin, J.; and Han, B. 2018. Learning to specialize with knowledge distillation for visual question answering. In NeuIPS.
[Nagaraja, Morariu, and Davis 2016] Nagaraja, V. K.; Morariu, V. I.; and Davis, L. S. 2016. Modeling context between objects for referring expression understanding. In ECCV.
[Nam et al. 2019] Nam, V.; Lu, J.; Chen, S.; Kevin, M.; Li-Jia, L.; Li, F.-F.; and James, H. 2019. Composing text and image for image retrieval - an empirical odyssey. In CVPR.
[Pelin, Leonid, and Markus 2019] Pelin, D.; Leonid, S.; and Markus, G. 2019. Neural sequential phrase grounding (seqground). In CVPR.
[Peng et al. 2019] Peng, W.; Qi, W.; Jiewei, C.; Chunhua, S.; Lianli, G.; and Anton, v. d. H. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR.
[Peters et al. 2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL.
[Plummer et al. 2015] Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo, J. C.; Hockenmaier, J.; and Lazebnik, S. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV.
[Plummer et al. 2017] Plummer, B. A.; Mallya, A.; Cervantes, C. M.; Hockenmaier, J.; and Lazebnik, S. 2017. Phrase localization and visual relationship detection with comprehensive imagelanguage cues. In ICCV.
[Plummer et al. 2018] Plummer, B. A.; Kordas, P.; Hadi Kiapour, M.; Zheng, S.; Piramuthu, R.; and Lazebnik, S. 2018. Conditional image-text embedding networks. In ECCV.
[Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeuIPS.
[Rohrbach et al. 2016] Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; and Schiele, B. 2016. Grounding of textual phrases in images by reconstruction.
[Schuster et al. 2015] Schuster, S.; Krishna, R.; Chang, A.; Fei-Fei, L.; and Manning, C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Workshop on Vision and Language (VL15). Association for Computational Linguistics.
[Wang et al. 2016] Wang, M.; Azab, M.; Kojima, N.; Mihalcea, R.; and Deng, J. 2016. Structured matching for phrase localization. In ECCV.
[Wang et al. 2018a] Wang, L.; Li, Y.; Huang, J.; and Lazebnik, S. 2018a. Learning two-branch neural networks for image-text matching tasks. IEEE TPAMI.
[Wang et al. 2018b] Wang, Y.-S.; Liu, C.; Zeng, X.; and Yuille, A. 2018b. Scene graph parsing as dependency parsing. In NAACL.
[Wang, Li, and Lazebnik 2016] Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning deep structure-preserving image-text embeddings. In CVPR.
[Yang et al. 2018] Yang, J.; Lu, J.; Lee, S.; Batra, D.; and Parikh, D. 2018. Graph r-cnn for scene graph generation. In ECCV.
[Yeh et al. 2017] Yeh, R.; Xiong, J.; Hwu, W.-M.; Do, M.; and Schwing, A. 2017. Interpretable and globally optimal prediction for textual grounding using image concepts. In NeuIPS.
[Yu et al. 2018a] Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; and Berg, T. L. 2018a. Mattnet: Modular attention network for referring expression comprehension. In CVPR.
[Yu et al. 2018b] Yu, Z.; Yu, J.; Xiang, C.; Zhao, Z.; Tian, Q.; and Tao, D. 2018b. Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508.
[Zhang et al. 2017] Zhang, H.; Kyaw, Z.; Chang, S.-F.; and Chua, T.-S. 2017. Visual translation embedding network for visual relation detection. In CVPR.
Learning Cross-modal Context Graph Networks for Visual Grounding Supplementary Material
{liuyf3, wanbo, hexm}@shanghaitech.edu.cn xiaodan.zhu@queensu.ca
1.1 Spatial Feature of Object
We generate a coordinate map with the same spatial size as the convolution feature map
. The coordinate map
consists of two channels, indicating the x, y coordinates for each pixel in
, and normalized by the feature map center. For each object proposal
, we crop a coordinate map from
with RoI-Align and embed it into a spatial feature vector
by multiple fully connection layers.
1.2 Spatial Feature of Union Region
We generate a two-channel binary mask for arately where locations within object proposal
1 and others fill 0. Then the two-channel binary mask is resized to
. And we use multiple fully connected layers to embed it to a geometric feature vector
1.3 Scene Graph Parser For a given sentence, we use a public toolkit1 to generate a language scene graph, in which nodes encode noun phrases and edges are the relationships between them. In this language scene graph parser, a dependency parser is first applied to the input sentence and then hand-crafted rules are
employed to generate language scene graphs. However, we observe some issues associated with the off-the-shelf parser: 1) noun phrases in the parses sometimes do not correspond to the given phrases; 2) some phrases and their relationships are still missing the in parses.
To address the aforementioned limitations, we perform additional post-processing on the Flickr30K Entities dataset. First, we take all given phrases as graph nodes. For each phrase, we pick a noun phrase in the parse that has a maximum word overlap with this given phrase. We then assign the parsed relations to these nodes. However, there are still some isolated nodes in the resulting graph. We further recall some missing relations by taking advantage of the coarse categories of the given phrases. Specifically, for an isolated phrase, if its type is clothing or bodyparts, we find a phrase with the type of people as its subject, and assign a relationship wear / have to them. If there are multiple phrases with the type of people in the graph nodes, we select the one that has a minimum word distance in the sentence with the isolated phrase. The motivation of our rules design comes from the observation that most of clothing / bodyparts phrases are related to a people phrase, and their relationships are generally wear / have.
1.4 Solving Structured Prediction
We solve the structured prediction problem by taking an exhaustive search on all the possibilities of s in Equ. 12 with a maximal depth when noun phrase number N is less than 6, and applying only node matching between the phrase graph and visual scene graph otherwise. The motivation of the solving strategy comes from the observation that 96.12% language scene graphs in Flickr30K dataset have less than 6 nodes. The complexity of exhaustive search with a maximal depth is , which is not time-consuming when N is small.
2.1 Model details with ResNet-50 backbone
We take an off-the-shelf object detector with ResNet-50 as its backbone to generate the initial set of proposals. It is based on FasterRCNN and pre-trained on the MSCOCO dataset (Lin et al.2014). Other settings are same to the model with ResNet-10 backbone. During the training stage, we use SGD optimizer with initial learning rate 1e-1, weight decay 1e-4 and momentum 0.9. The model is trained with 60k iterations totally with batch size 24, and decay the learning rate 10 times in 20k and 40k iterations respectively.
2.2 Ablations with ResNet-50 backbone
Table 1: Ablation studies on Flickr30K val set with ResNet-50 backbone.
In order to investigate the effectiveness the individual component of our framework with ResNet-50 backbone, we also conduct a series of ablation studies. As shown in Tab. 1, the accuracy shows the same growth trend compared to ResNet-101 backbone. In particular, we can observe a sig-nificant performance improvement when adopting proposal pruning over baseline model, which improves the accuracy from 60.31% to 66.77%. This indicates that proposal pruning is critical for visual grounding task when the object detector doesn’t perform well.
2.3 Additional Experiments on VOGN To validate the effectiveness of VOGN, we conduct some additional experiments as shown in Tab. 2. In the baseline model, we compute the similarity score and regression offset for each phrase-box pair. Then we adopt proposal pruning strategy over baseline model without PGN, which can improve grounding accuracy from 73.46% to 74.6%.
Table 2: Additional Experments of VOGN with ResNet-101 backbone on Flickr30K val set
Furthermore, we add our VOGN under this setting and observe a significant improvement from 74.60% to 75.59%, which indicates the visual object representation can be more discriminative with its context cues.
Finally, the performance will drop sharply from 75.59% to 74.80% without considering visual relations feature during message passing, which suggests that vi- sual relations play an important role in computing attention among objects.