Given an image/video and a language query, image/video grounding aims to localize a spatial region in the image (Plummer et al., 2015; Yu et al., 2017, 2018) or a specific frame in the video (Zhou et al., 2018) which semantically corresponds to the language query. Grounding has broad applications, such as text based image retrieval (Chen et al., 2017; Ma et al., 2015), description generation (Wang et al., 2018a; Rohrbach et al., 2017;
A brown and white dog is lying on the grass and then it stands up.
Figure 1: The proposed WSSTG task aims to localize a spatio-temporal tube (i.e., the sequence of green bounding boxes) in the video which semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training.
Wang et al., 2018b), and question answer (Gao et al., 2018; Ma et al., 2016). Recently, promising progress has been made in image grounding (Yu et al., 2018; Chen et al., 2018c; Zhang et al., 2018) which heavily relies on fine-grained annotations in the form of region-sentence pairs. Fine-grained annotations for video grounding are more complicated and labor-intensive as one may need to annotate a spatio-temporal tube (i.e., label the spatial region in each frame) in a video which semantically corresponds to one language query.
To avoid the intensive labor involved in dense annotations, (Huang et al., 2018) and (Zhou et al., 2018) considered the problem of weakly-supervised video grounding where only aligned video-sentence pairs are provided without any fine-grained regional annotations. However, they both ground only a noun or pronoun in a static frame of the video. As illustrated in Fig. 1, it is difficult to distinguish the target dog (denoted by the green box) from other dogs (denoted by the red boxes) if we attempt to ground only the noun “” in one single frame of the video. The main reason is that the textual description of “
is not sufficiently expressive and the visual appearance in one single frame cannot characterize the spatio-temporal dynamics (e.g., the action and movements of the “
In this paper, we introduce a novel task, referred to as weakly-supervised spatio-temporally grounding sentence in video (WSSTG). Specifi-cally, given a natural sentence and a video, we aim to localize a spatio-temporal tube (i.e., a sequence of bounding boxes), referred to as an instance, in the video which semantically matches the given sentence (see Fig. 1). During training, we do not rely on any fine-grained regional annotations. Compared with existing weakly-supervised video grounding problems (Zhou et al., 2018; Huang et al., 2018), our proposed WSSTG task has the following two advantages and challenges. First, we aim to ground a natural sentence instead of just a noun or pronoun, which is more comprehensive and flexible. As illustrated in Fig. 1, with a detailed description like “”, the target dog (denoted by green boxes) can be localized without ambiguity. However, how to comprehensively capture the semantic meaning of a sentence and ground it in a video, especially in a weakly-supervised manner, poses a challenge. Second, compared with one bounding box in a static frame, a spatio-temporal tube (denoted by a sequence of green bounding boxes in Fig. 1) presents the temporal movements of “
”, which can characterize its visual dynamics and thereby semantically match the given sentence. However, how to exploit and model the spatio-temporal characteristics of the tubes as well as their complicated relationships with the sentence poses another challenge.
To handle the above challenges, we propose a novel model realized within the multiple instance learning framework (Karpathy and Fei-Fei, 2015; Tang et al., 2017, 2018). First, a set of instance proposals are extracted from a given video. Features of the instance proposals and the sentence are then encoded by a novel attentive interactor that exploits their fine-grained relationships to generate semantic matching behaviors. Finally, we propose a diversity loss, together with a ranking loss, to train the whole model. During testing, the instance proposal which exhibits the strongest semantic matching behavior with the given sentence is selected as the grounding result.
To facilitate our proposed WSSTG task, we contribute a new grounding dataset, called VIDsentence, by providing sentence descriptions for the instances of the ImageNet video object detection dataset (VID) (Russakovsky et al., 2015). Specifically, 7, 654 instances of 30 categories from 4, 381 videos in VID are extracted. For each instance, annotators are asked to provide a natural sentence describing its content. Please refer to Sec. 4 for more details about the dataset.
Our main contributions can be summarized as follows. 1) We tackle a novel task, namely weakly-supervised spatio-temporally video grounding (WSSTG), which localizes a spatio-temporal tube in a given video that semantically corresponds to a given natural sentence, in a weakly-supervised manner. 2) We propose a novel attentive interactor to exploit fine-grained relationships between instances and the sentence to characterize their matching behaviors. A diversity loss is proposed to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones during training. 3) We contribute a new dataset, named as VID-sentence, to serve as a benchmark for the novel WSSTG task. 4) Extensive experimental results are analyzed, which illustrate the superiority of our proposed method.
Grounding in Images/Videos. Grounding in images has been popular in the research community over the past decade (Kong et al., 2014; Matuszek et al., 2012; Hu et al., 2016; Wang et al., 2016a,b; Li et al., 2017; Cirik et al., 2018; Sadeghi and Farhadi, 2011; Zhang et al., 2017; Xiao et al., 2017; Chen et al., 2019, 2018a). In recent years, researchers also explore grounding in videos. Yu and Siskind (2015) grounded objects in constrained videos by leveraging weak semantic constraints implied by a sequence of sentences. Vasudevan et al. (2018) grounded objects in the last frame of stereo videos with the help of text, motion cues, human gazes and spatial-temporal context. However, fully supervised grounding requires intensive labor for regional annotations, especially in the case of videos.
Figure 2: The architecture of our model. An instance generator is used to produce spatio-temporal instances. An attentive interactor is proposed to exploit the complicated relationships between instances and the sentence. Multiple instance learning is used to train the model with a ranking loss and a diversity loss.
grounded nouns/pronouns in specific frames by constructing a visual grounded action graph. The work closest to ours is (Zhou et al., 2018), in which the authors grounded a noun in a specific frame by considering object interactions and loss weighting given one video and one text input. In this work, we also focus on grounding in a videotext pair. However, different from (Zhou et al., 2018) whose text input consists of nouns/pronouns and output is a bounding box in a specific frame, we aim to ground a natural sentence and output a spatio-temporal tube in the video.
Given a natural sentence query q and a video v, our proposed WSSTG task aims to localize a spatio-temporal tube, referred to as an instance, in the video sequence, where
resents a bounding box in the t-th frame and T denotes the total number of frames. The localized instance should semantically correspond to the sentence query q. As WSSTG is carried out in a weakly-supervised manner, only aligned video-sentence pairs {v, q} are available with no fine-grained regional annotations during training. In this paper, we cast the WSSTG task as a multiple instance learning problem (Karpathy and Fei- Fei, 2015). Given a video v, we first generate a set of instance proposals by an instance generator (Gkioxari and Malik, 2015). We then identify which instance semantically matches the natural sentence query q.
We propose a novel model for handling the WSSTG task. It consists of two components, namely an instance generator and an attentive interactor (see Fig. 2). The instance generator links bounding boxes detected in each frame into instance proposals (see Sec. 3.1). The attentive interactor exploits the complicated relationships between instance proposals and the given sentence to yield their matching scores (see Sec. 3.2). The proposed model is optimized with a ranking loss and a novel diversity loss
Sec. 3.3). Specifically,
aims to distinguish aligned video-sentence pairs from the unaligned ones, while
targets strengthening the matching behaviors between reliable instance-sentence pairs and penalizing the unreliable ones from the aligned video-sentence pairs.
3.1 Instance Extraction
Instance Generation. As shown in Fig. 2, the first step of our method is to generate instance proposals. Similar to (Zhou et al., 2018), the region proposal network from Faster-RCNN (Ren et al., 2015) is used to detect frame-level bounding boxes with corresponding confidence scores, which are then linked to produce spatio-temporal tubes.
Let denote a detected bounding box at time
denote another box at time t + 1. Following (Gkioxari and Malik, 2015), we define the linking score
where IoU
is the intersection-over-union (IoU) of
is a balancing scalar which is set to 0.2 in our implementation.
As such, one instance proposal viewed as a path
over the whole video se- quence with energy
We identify the instance proposal with the maximal energy by the Viterbi algorithm (Gkioxari and Malik, 2015). We keep the identified instance proposal and remove all the bounding boxes associated with it. We then repeat the above process until there is no bounding box left. This results in a set of instance proposals being the total number of proposals.
Feature Representation. Since an instance proposal consists of bounding boxes in consecutive video frames, we use I3D (Carreira and
Figure 3: The architecture of the attentive interactor. It consists of two components, namely interaction and matching behavior characterization. Adenotes the attention mechanism in Eqs. (4-6).
denotes the function in Eq. (7).
Zisserman, 2017) and Faster-RCNN to generate the RGB sequence feature I3D-RGB, the flow sequence feature I3D-Flow, and the frame-level RoI pooled feature, respectively. Note that it is not effective to encode each bounding box as an instance proposal may include thousands of bounding boxes. We therefore evenly divide each instance proposal into segments and average the features within each segment.
is set to 20 for all our experiments. We concatenate all three kinds of visual features before feeding it into the following attentive interactor. Taking each segment as a time step, each proposal p is thereby represented as
, a sequence of
dimensional concatenated visual features at each step.
3.2 Attentive Interactor
With the instance proposals from the video and the given sentence query, we propose a novel attentive interactor to characterize the matching behaviors between each proposal and the sentence query. Our attentive interactor consists of two coupling components, namely interaction and matching behavior characterization (see Fig. 3).
Before diving into the details of the interactor, we first introduce the representation of the query sentence q. We represent each word in q using the 300-dimensional word2vec (Mikolov et al., 2013) and omit words that are not in the dictionary. In this way, each sentence q is represented as is the total number of words in the sentence and
denotes the dimension of the
word embedding.
3.2.1 Interaction
Given the sequential visual features of one candidate instance and the sequential textual features
of the query sentence, we propose an interaction module to exploit their complicated matching behaviors in a fine-grained manner. First, two long short-term memory networks (LSTMs) (Hochreiter and Schmid- huber, 1997) are utilized to encode the instance proposal and sentence, respectively:
where -th row representations in
, respectively. Due to the natural characteristics of LSTM,
, as the yielded hid- den states, encode and aggregate the contextual information from the sequential representation, and thereby yield more meaningful and informative visual features
and sentence repre- sentations
Different from (Rohrbach et al., 2016; Zhao et al., 2018) which used only the last hidden state as the feature embedding for the query sentence, we generate visually guided sentence features
by exploiting their fine-grained relationships based on
Specifically, given the i-th visual feature
attention mechanism (Xu et al., 2015) is used to adaptively summarize
with respect to
where are the learnable parameters that map visual and sentence features to the same K-dimension space.
work on the coupled textual and visual features and yield their affinity scores. With respect to
), the gen- erated visually guided sentence feature
more attention on the words more correlated with
by adaptively summarizing
Owning to the attention mechanism in Eqs. (4- 6), our proposed interaction module makes each visual feature interact with all the sentence features and attentively summarize them together. As such, fine-grained relationships between the visual and sentence representations are exploited.
3.2.2 Matching Behavior Characterization
After obtaining a set of visually guided sentence features , we character- ize the fine-grained matching behaviors between the visual and sentence features. Specifically, the matching behavior between the i-th visual and sentence features is defined as
The instantiation of can be realized by different approaches, such as multi-layer perceptron (MLP), inner-product, or cosine similarity. In this paper, we use cosine similarity between
for simplicity. Finally, we define the matching behavior between an instance proposal p and the sentence q as
3.3 Training
For the WSSTG task, since no regional annotations are available during the training, we cannot optimize the framework in a fully supervised manner. We, therefore, resort to MIL to optimize the proposed network based on the obtained matching behaviors of the instance-sentence pairs. Specifi-cally, our objective function is defined as
where is a ranking loss, aiming at distinguishing aligned video-sentence pairs from the unaligned ones.
is a novel diversity loss, which is proposed to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones from the aligned video-sentence pair.
is a scalar which is set to 1 in all our experiments.
Ranking Loss. Assume that {v, q} is a semantically aligned video-sentence pair. We define the visual-semantic matching score S between v and
where -th proposal generated from the video
is the matching behavior computed by Eq. (8), and N is the total number of instance proposals.
Suppose that are negative samples that are not semantically correlated with q and v, respectively. Inspired by (Karpathy and Fei-Fei,
2015), we define the ranking loss as
where is a margin which is set to 1 in all our experiments.
directly encourages the matching scores of aligned video-sentence pairs to be larger than those of unaligned pairs. Diversity Loss. One limitation of the ranking loss defined in Eq. (11) is that it does not consider the matching behaviors between the sentence and different instance proposals extracted from an aligned video. A prior for video grounding is that only a few instance proposals in the paired video are semantically aligned to the query sentence, while most of the other instance proposals are not. Thus, it is desirable to have a diverse distribution of the matching behaviors
To encourage a diverse distribution of , we propose a diversity loss
to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones during training. Specifically, we first normalize
and then penalize the entropy of the distribution of by defining the diversity loss as
Note that the smaller is, the more diverse
will be, which implicitly encour- ages the matching scores of semantically aligned instance-sentence pairs being larger than those of the misaligned pairs.
3.4 Inference
Given a testing video and a query sentence, we extract candidate instance proposals, and characterize the matching behavior between each instance proposal and the sentence by the proposed attentive interactor. The instance with the strongest matching behavior is deemed the result of the WSSTG task.
A brown and white dog is lying on the grass and then standing up A large elephant runs in the water from left to right
Figure 4: Samples of the newly constructed VIDsentence dataset. Sentences are shown on the top of images and the associated target instances are enclosed with green bounding boxes.
A main challenge for the WSSTG task is the lack of suitable datasets. Existing datasets like TACoS (Regneri et al., 2013) and YouCook (Das et al., 2013) are unsuitable as they do not provide spatio-temporal annotations for target instances in the videos, which are necessary for the WSSTG task for evaluation. To the best of our knowledge, the most suitable existing dataset is the Personsentence dataset provided by (Yamaguchi et al., 2017), which is used for spatio-temporal person search among videos. However, this dataset is too simple for the WSSTG task since it contains only people in the videos. To this end, we contribute a new dataset by annotating videos in ImageNet video object detection dataset (VID) (Rus- sakovsky et al., 2015) with sentence descriptions. We choose VID as the visual materials for two primary reasons. First, it is one of the largest video detection datasets containing videos of diverse categories in complicated scenarios. Second, it provides dense bounding-box annotations and instance IDs which help avoid labor-intensive annotations for spatio-temporal regions of the validation/testing set.
VID-sentence Annotation. With 30 categories, VID contains 3826, 555 and 937 videos for training, validation and testing respectively. We first divide videos in training and validation sets1 into trimmed videos based on the provided instance IDs, and delete videos less than 9 frames. As such, there remain 9, 029 trimmed videos in total. In each trimmed video, one instance is identified as a sequence of bounding boxes. A group of annotators are asked to provide sentence descriptions for the target instances. Each target instance is
annotated with one sentence description. An instance is discarded if it is too difficult to provide a unique and precise description. After annotation, there are 7, 654 videos with sentence descriptions. We randomly select 6, 582 videos as the training set, and evenly split the remaining videos into the validation and testing sets (i.e., each contains 536 videos). Some examples from the VID-sentence dataset are shown in Fig. 4. Dataset Statistics. To summarize, the created dataset has 6, 582/536/536 spatio-temporal instances with descriptions for training/validation/testing. It covers all 30 categories in VID, such as ““
The size of the vocabulary is 1, 823 and the average length of the descriptions is 13.2. Table 1 shows the statistics of our constructed VID-sentence dataset. Compared with the Person-sentence dataset, our VID-sentence dataset has a similar description length but includes more instances and categories.
It is important to note that, although VID provides regional annotations for the training set, these annotations are not used in any of our experiments since we focus on weakly-supervised spatio-temporal video grounding.
In this section, we first compare our method with different kinds of baseline methods on the created VID-sentence dataset, followed by the ablation study. Finally, we show how well our model generalizes on the Person-sentence dataset.
5.1 Experimental Settings
Baseline Models. Existing weakly-supervised video grounding methods (Huang et al., 2018; Zhou et al., 2018) are not applicable to the WSSTG task. Huang et al. (2018) requires temporal alignment between a sequence of transcription descriptions and the video segments to ground a noun/pronoun in a certain frame, while Zhou et al. (2018) mainly grounds nouns/pronouns in specific frames of videos. As such, we develop three baselines based on DVSA (Karpathy and Fei-Fei, 2015), GroundeR (Rohrbach et al., 2016), and a variant frame-level method modified from (Zhou et al., 2018) for performance comparisons. Following recent grounding methods like (Rohrbach et al., 2016; Chen et al., 2018b), we use the last hidden state of an LSTM encoder as the sentence
Table 1: Statistics of the VID-sentence dataset and previous Person-sentence dataset Yamaguchi et al. (2017).
embedding for all the baselines.
Since DVSA and GroundeR are originally proposed for image grounding, in order to adapt to video, we consider three methods to encode visual features including averaging (Avg), NetVLAD (Arandjelovic et al., 2016), and LSTM. For the variant baseline modified from (Zhou et al., 2018), we densely predict each frame to generate a spatio-temporal prediction. Implementation Details. Similar to (Zhou et al., 2018), we use the region proposal network from Faster-RCNN pretrained on MSCOCO (Lin et al., 2014) to extract frame-level region proposals. For each video, we extract 30 bounding boxes for each frame and link them into 30 spatio-temporal tubes with the method (Gkioxari and Malik, 2015). We map the word embedding to 512-dimension before feeding it to the LSTM encoder. Dimension of the hidden state of all the LSTMs is set to 512. Batch size is 16, i.e., 16 videos with total 480 instance proposals and 16 corresponding sentences. We construct positive and negative video-sentence pairs for training within a batch for efficiency, i.e., roughly 16 positive pairs and 240 negative pairs for the triplet construction. SGD is used to optimize the models with a learning rate of 0.001 and momentum of 0.9. We train all the models with 30 epochs. Please refer to supplementary materials for more details. Evaluation Metric. We use the bounding box localization accuracy for evaluation. An output instance is considered as “accurate” if the overlap between the detected instance and the ground-truth is greater than a threshold
. The definition of the overlap is the same as (Yamaguchi et al., 2017), i.e., the average overlap of the bounding boxes in annotated frames.
for extensive evaluations.
5.2 Performance Comparisons
Table 2 shows the performance comparisons between our model and the baselines. We additionally show the performance of randomly choosing an instance proposal and the upper bound perfor-
Table 2: Performance comparisons on the proposed VID-sentence dataset. The top entry of all the methods except the upper bound is highlighted in boldface.
mance of choosing the instance proposal of the largest overlap with the ground-truth.
The results suggest that, 1) models with NetVLAD (Arandjelovic et al., 2016) perform the worst. We suspect that models based on NetVLAD are complicated and the supervisions are too weak to optimize the models sufficiently well. 2) Models with LSTM embedding achieve only comparable performances compared with models based on simple averagingf. It is mainly due to the fact that the power of LSTM has not been fully exploited. 3) The variant method of (Zhou et al., 2018) performs better than both DVSA and GroundeR with various kinds of visual encoding techniques, indicating its power for the task. 4) Our model achieves the best results, demonstrating its effectiveness, showing that our model is better at characterizing the matching behaviors between the query sentence and the visual instances in the video.
To compare the methods qualitatively, we show an exemplar sample in Fig. 5. Compared with GroundeR+LSTM and DVSA+LSTM, our method identifies a more accurate instance from the candidate instance proposals. Moreover, the instances generated by our method are more temporally consistent compared with the modified frame-level method (Zhou et al., 2018). This can be attributed to the exploitation of the temporal information during instance generation and attentive interactor in our model.
5.3 Ablation Study
To verify the contributions of the proposed attentive interactor and diversity loss, we perform the following ablation study. To be specific, we compare the full method with three variants, includ-
Figure 5: An exemplar of the results by different methods. The sentence is shown on the top. Three frames of the detected results and the ground-truth are respectively bounded with blue lines and green dotted lines. IoU scores between the detected instances and the ground-truth are shown below the images. Best viewed on screen.
Table 3: Ablation study of the proposed attentive interactor and diversity loss.
Figure 6: Visualization of the attentive interaction. On the top, we show an instance highlighted in the blue box in three different segments. On the bottom, we show the corresponding distributions of the attention weights. Darker colors mean larger attentive weights. Intuitively, the attention weight matches well with the visual contents such as “puppy” in all three segments and “hand” in the segment with ID 2. Best viewed on screen.
ing: 1) removing both the attentive interactor and diversity loss, which is equivalent to the DVSA model using LSTM for encoding both the visual features and sentence features, termed as Base; 2) Base+Div, which is formed by introducing the diversity loss; 3) Base+Int with the attentive interactor module.
Table 3 shows the corresponding results. Compared with Base, both the diversity loss and attentive interactor constantly improve the performance. Moreover, to show the effectiveness of the proposed attentive interactor, we visualize the adaptive weight a in Eq. (5). As shown in Fig. 6,
Figure 7: Comparison of the distribution of the matching behaviors of instances.
Table 4: Performance comparisons on the Personsentence dataset (Yamaguchi et al., 2017).
our method adaptively pays more attention to the words that match the instance such as the “in all three segments and the “
” in segment with ID 2. To show the effectiveness of the diversity loss, we divide instance proposals in the testing set into 10 groups based on their IoU scores with the ground-truth and then calculate the average matching behaviors of each group, predicted by counterparts with and without the diversity loss. As shown in Fig. 7, the proposed diversity loss
penalizes the matching behaviors of the instances of lower IoU with ground-truth while strengthens instances of higher IoU.
5.4 Experiments on Person-sentence Dataset
We further evaluate our model and the baseline methods on the Person-sentence dataset (Yam- aguchi et al., 2017). We ignore the bounding box annotations in the training set and carry out experiments for the proposed WSSTG task. For fair comparisons, all experiments are conducted on the visual feature extractor provided by (Carreira and Zisserman, 2017).
Table 4 shows the results. Similarly, the proposed attentive interactor model (without the diversity loss) outperforms all the baselines. Moreover, the diversity loss further improves the performance. Note that the improvement of our model on this dataset is more significant than that on the VID-sentence dataset. The reason might be that the upper bound performance of the Personsentence is much higher than that of the VIDsentence (77.9 for Person-sentence versus 47.6 for VID-sentence on average). This also suggests that the created VID-sentence dataset is more challenging and more suitable as a benchmark dataset.
In this paper, we introduced a new task, namely weakly-supervised spatio-temporally grounding natural sentence in video. It takes a sentence and a video as input and outputs a spatio-temporal tube from the video, which semantically matches the sentence, with no reliance on spatio-temporal annotations during training. We handled this task based on the multiple instance learning framework. An attentive interactor and a diversity loss were proposed to learn the complicated relationships between the instance proposals and the sentence. Extensive experiments showed the effectiveness of our model. Moreover, we contributed a new dataset, named as VID-sentence, which can serve as a benchmark for the proposed task.
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, pages 5297–5307.
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 4724–4733.
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018a. Temporally grounding natural sentence in video. In EMNLP.
Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In AAAI.
Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, and Ram Nevatia. 2017. Amc: Attention guided multi-modal correlation learning for image search. In CVPR, pages 6203–6211.
Kan Chen, Jiyang Gao, and Ram Nevatia. 2018b. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879.
Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018c. Real-time referring expression comprehension by single-stage grounding network. In arXiv: 1812.03426.
Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis- Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In AAAI.
P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR.
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Neva- tia. 2018. Motion-appearance co-memory networks for video question answering. arXiv preprint arXiv:1803.10906.
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR, pages 759–768.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR, pages 770–778.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In CVPR, pages 4555– 4564.
De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding “it”: Weakly-supervised reference-aware visual grounding in instructional videos. In CVPR.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descriptions. In CVPR, pages 3128–3137.
Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are you talking about? text-to-image coreference. In CVPR, pages 3558– 3565.
Jianan Li, Yunchao Wei, Xiaodan Liang, Fang Zhao, Jianshu Li, Tingfa Xu, and Jiashi Feng. 2017. Deep attribute-preserving metric learning for natural language object retrieval. In MM, pages 181–189.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV, pages 740– 755.
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to answer questions from image using convolutional neural network. In AAAI.
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV.
Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- moyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. arXiv preprint arXiv:1206.6423.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649.
Michaela Regneri, Marcus Rohrbach, Dominikus Wet- zel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99.
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In ECCV, pages 817–834.
Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, and Bernt Schiele. 2017. Generating descriptions with grounded and co-referenced people. arXiv preprint arXiv:1704.01518, 3.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252.
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In CVPR, pages 1745–1752.
Peng Tang, Xinggang Wang, Song Bai, Wei Shen, Xi- ang Bai, Wenyu Liu, and Alan Loddon Yuille. 2018. Pcl: Proposal cluster learning for weakly supervised object detection. IEEE transactions on pattern analysis and machine intelligence.
Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu. 2017. Multiple instance detection network with online instance classifier refinement. In CVPR.
Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. 2018. Object referring in videos with language and human gaze. arXiv preprint arXiv:1801.01582.
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018a. Reconstruction network for video captioning. In CVPR.
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018b. Bidirectional attentive fusion with context gating for dense video captioning. In CVPR.
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text embeddings. In CVPR, pages 5005–5013.
Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia Deng. 2016b. Structured matching for phrase localization. In ECCV, pages 696–711.
Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057.
Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In ICCV.
Haonan Yu and Jeffrey Mark Siskind. 2015. Sentence directed video object codetection. arXiv preprint arXiv:1506.02059.
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In CVPR.
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A joint speakerlistener-reinforcer model for referring expressions. In CVPR, volume 2.
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223.
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In CVPR, pages 4158–4166.
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I- An Huang, and Honglak Lee. 2017. Discriminative bimodal networks for visual localization and detection with natural language queries. In CVPR.
Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng. 2018. Weakly supervised phrase localization with multi-scale anchored transformer network. In CVPR, pages 5696–5705.
Luowei Zhou, Nathan Louis, and Jason J Corso. 2018. Weakly-supervised video object grounding from text by loss weighting and object interaction. BMVC.
We provide more descriptions of baseline methods and implementation details in this supplementary material section. Baseline Details. We consider three methods to encode visual instance features cluding averaging (Avg), NetVLAD (Arandjelovic et al., 2016), and LSTM. For Avg, we simply average all the
segments and forward to two fully connected layers. For NetVLAD, we treat
independent
-dimension features and use a fully connected layer to obtain output with the desired dimension. For LSTM, we take
as a sequence features of
time steps and use the last hidden state of LSTM as the embedded visual representation.
For models based on DVSA, we evaluate the similarity between the spatio-temporal instance and the query sentence with cosine similarity. For models based on GroundeR, we concatenate the representations from the visual encoder and the sentence encoder as the input for the attention network and reconstruction network. For the variant of (Zhou et al., 2018), we densely predict each frame in the video to generate a spatio-temporal instance. This baseline is carefully implemented by modifying the original method (Zhou et al., 2018) with two aspects. On one hand, we replace the noun encoder with an LSTM to encode natural sentences, since we focus on grounding with natural sentences. On the other hand, we remove the frame-wise loss weighting term as it degrades the performance on the VID-sentence dataset. Such loss term is proposed to penalize the uncertainty of the existence of objects, which is not necessary as the video in our dataset contains the target instances in all frames.
The output of Avg and Net-VLAD (Arand- jelovic et al., 2016) is also set as 512 by a fully connected layer. The number of centers and the dimension of cluster-center for Net-VLAD are 32 and 128, respectively. Implementation Details. We give more details on how to generate instances from videos and extract the corresponding visual feature for each instance. We use the region proposal network from Faster-RCNN (Ren et al., 2015) to extract 30 region proposals for each video frame. The FasterRCNN model is based on ResNet-101 (He et al., 2016) pretrained on MSCOCO (Lin et al., 2014). For the frame-level RoI pooled feature, we use the
2048-dimensional feature from the last fully connected layer of the same Faster-RCNN model. For the I3D features (Carreira and Zisserman, 2017), we use the model pretrained on Kinetics to extract the RGB sequence features I3D-RGB and the flow sequence features I3D-Flow. For every 64 consecutive frames, we extract a set of (eight) 1024-dimensional I3D-RGB features and (eight) 1024-dimensional I3D-Flow features by the output of the last average pooling layer and dropping the last temporal pooling operation. We compute optical flow with a TV-L1 algorithm (Zach et al., 2007). We crop the region proposals from the RGB images and flow images and then resize them to before feeding to I3D.