Grounding language in visual regions provides a fine-grained perspective towards visual recognition and has become a prominent research problem in the computer vision and natural language processing communities [6, 19, 20, 25]. In this paper, we study the problem of video object grounding, where a video (segment) and an associated sentence are given and the goal is to localize the objects that are mentioned in the sentence in the video. This task is often formulated as a visual-semantic alignment problem [7] and has broad applications including retrieval [7, 8], description generation [20, 26], and human-robot interaction [1, 22].
Like most fine-grained recognition problems [15, 18], grounding can be extremely data intensive, especially in the context of unconstrained video. On the other hand, video-sentence pairs are easier to obtain than object region annotations (e.g., YouTube Automatic Speech Recognition scripts). We focus on the weakly-supervised version of the grounding problem where the only supervision is sentence descriptions; no spatially-aligned object bounding boxes are available for training. Sentence grounding can involve multiple interacting objects, which sets our work apart from the relatively well-studied weakly-supervised object localization problem, where one or more objects are localized independently [10, 17].
Existing work on visual grounding falls into two categories: multiple instance learning [6, 7] and visual attention [19]. In either case, the visual-semantic similarity is first measured between the target object/phrase and all the image-level, i.e. spatial, object region proposals. Then, either a ranking loss or a reconstruction loss—both of which we refer to here as matching losses—measures the quality of the matching. A naive extension of the existing approaches to the video domain is to treat the entire video segment as a bag of spatial object proposals. However, this presents two issues. First, existing methods rely on the assumption that the target object appears in at least one of the proposal regions. This assumption is weak when it comes to video, since a query object might appear sparsely across multiple frames1 and might not be detected completely. The segment-level supervision, i.e. object labels, could be potentially strengthened if applied to individual frames. Second, a video segment can last up to several minutes. Even with temporal down-sampling, this can bring in tens or hundreds of frames and hence thousands of proposals, which compromise the visual-semantic alignment accuracy.
To address these two issues, we propose a frame-wise loss weighting framework for video grounding. We ground the target objects on a frame-by-frame basis. We face the challenge that the segment-level supervision is not applicable to individual frames where the query object is off-screen, occluded, or just not present in the proposals for that frame. Our solution is to first estimate the likelihood that the query object is present in (a proposal in) each video frame. If the likelihood is high, we judge the matching quality mainly on the matching loss. Otherwise, we down-weight the matching loss while bringing in a penalty loss. The lower the confidence, the higher the penalty. With the conditioned frame-wise grounding framework, the proposed model can avoid being flooded with massive proposals even when the sampling rate is high and only make predictions for applicable frames.
We propose two approaches to estimate frame-wise object likelihood (confidence) scores. The first one is conditioned on both visual and textual inputs, namely, the maximum visual-semantic similarity scores in each frame. The second approach is inspired by the fact that the combination of objects can imply their order of appearance in the video. For example, when a sequence of objects “tomatoes”, “pan” and “plate” appears in the description, the video scene is likely to include a shot of tomatoes being grilled in the pan at the beginning, and a shot of tomatoes being moved to the plate at the end. In the temporal domain, “pan” appears mostly ahead of “plate” while “tomatoes” intersects with both. We implicitly model the object interaction with self-attention [23] and use textual guidance to estimate the frame-wise object likelihood.
For evaluation, due to lack of existing video grounding benchmarks, we have collected annotations over the large-scale instructional video dataset YouCook2, which provides over 15,000 video segment-description pairs. We sample the validation and testing videos at 1fps and draw bounding box for the 67 most frequent objects when they are present in both the video segment and the description. We compare our methods against competitive baselines on video grounding and our proposed methods achieve state-of-the-art performances.
Our contributions are twofold: 1) we propose a novel frame-wise loss weighting framework for the video object grounding problem that outperforms competitive baselines; 2) we provide a benchmark dataset for video grounding.
Grounding in Image/Video. Supervised grounding or referring has been intensively studied [15, 16, 28] in the image domain. These methods require dense bounding box annotations for training, which are expensive to obtain. Recently, an increasing amount of attention has shifted towards the weakly-supervised grounding problem [6, 7, 8, 19, 25], where only descriptive phrases, no explicit target grounding locations, are made accessible during training. Karpathy and Fei-Fei [7] propose to pair image regions to words in a sentence by computing a visual-semantic similarity score, finding the word that best describes the region. Rohrbach et al. [19] ground textual phrases in images by reconstructing the original phrase through visual attention. Yu and Siskind [27] ground objects from text in constrained videos. DeAn et al. [6] extend [7] to the video domain and further improve the work by modeling the reference relationships among segments. In this work, we tackle the problem from a novel aspect as fully exploiting the visual-semantic relations within each segment, i.e. frame-wise supervisions and object interactions.
Weakly-supervised Object Localization. Weakly-supervised object localization has been explored in both the image [2, 4, 5, 14, 21] and the video domain [10, 17]. Unlike object grounding from text, object localization typically involves localizing an object class or a video tag in the visual content. Existing works in the image domain naturally pursue a multiple instance learning (MIL) approach to this problem. Positive instances are images where the label is present, and negative instances are given as images with the label absent. In the video domain, the existing methods [10, 17] approach this problem by taking advantage of motion information and similarity between frames to generate spatio-temporal tubes. Note that these tubes are much more expensive to obtain compared with spatial proposals, hence we only consider the latter option.
Object Interaction. Object interaction was initially proposed to detect fine-grained visual details for action detection, such as the temporal relationships between objects in a scene, to overcome changes in illumination, pose, occlusion, etc. Some works have modeled object interaction using pairwise or higher-order relationships [11, 12, 13]. Ni et al. [13] consolidate object detections at each step by modeling pair-wise object relationships and hence enforce the temporal object consistency in each additional step. Ma et al. [12] implicitly model the higher-order interactions among object region proposals, using groups and subgroups rather than just pairwise interactions. Inspired by recent work [3, 25], where the linguistic structure of the input phrase is leveraged to infer the spatial object locations, we propose to model object interaction from a linguistic perspective as a textual guidance for grounding.
We start this section by introducing some background knowledge. In Sec. 3.2, we describe the video object grounding baseline. We then propose our framework in Sec. 3.3 by extending the segment-level object label supervision to the frame-level. Two novel approaches are proposed in judging under what circumstances the frame-level supervision is applicable.
3.1 Background
In this section we provide some background on visual-semantic alignment framework (grounding by ranking) and self attention, which are building blocks of our model.
Grounding by Ranking. We start by describing ranking-based grounding approach from [7]. Given a sentence description including O query objects/phrases and a set of N object region proposals from an image, the goal is to target each referred object in the query as one of the object proposals. Queries and visual region proposals are first encoded in a common d-dimensional space. Denote the object query feature vectors as {qk}, k = 1,2,...,O and the region proposal feature vectors as {ri}, i = 1,2,...,N. We pack the feature vectors into matrices Q = (q1,...,qO) and R = (r1,...,rN). The visual-semantic matching score of the description and the image is formulated as:
where aik ri measures the similarity between query qk and proposal ri. Defining negative samples Q
and R
as the query and proposal from texts and images that are not paired with R nor Q, the grounding by ranking framework minimizes the following margin loss:
where the first ranking term encourages the correct region proposal matching and the second ranking term encourages the correct sentence matching. is the ranking margin. During inference, the proposal with the maximal similarity score aik with each object query is selected. Self Attention. We now describe the scaled dot-product attention model. Define a set of queries qj
, a set of keys kt
and values vt
, where j = 1,2,...,O is the query index, t = 1,2,...,T is the key/value index. Given an arbitrary query qk, scaled dot-product attention computes the output as a weighted sum of values vt, where the weights are determined by the scaled dot-products of query q j and keys kt, as formulated below:
where the authors pack kt and vt into matrices K = (k1,...,kT) and V = (v1,...,vT), respectively. Self-attention [23] is a special case of the scaled dot-product attention where the queries, keys and values are all identical. In our case, they are all object encoding vectors and self-attention encodes the semantic relationships among the objects. We adopt a multi-head version of the self-attention layer [23, 30] for modeling object relationships, which deploys multiple paralleled self-attention layers.
3.2 Video Object Grounding
We adapt the Grounding by Ranking framework [7] to the video domain, and this adaptation will serve as our baseline. Denote the set of T frames in a video segment as { ft} and the object proposals in frame t as rti, i = 1,2,...,N. As before, define the object queries as qk, we compute the similarity between the query object and all the proposals {rti} in a segment. Note that the similarity dot product might grow large in magnitude as d increases [23]. Hence, we scale the dot-product by 1and restrict at
to be between 0 and 1 with a Sigmoid function.
Figure 1: An overview of our framework. Inputs to the system are a video segment and a phrase that describes the segment. The objects from the phrase are grounded for each sampled frame t. Object and proposal features are encoded to size d and visual-semantic similarity scores are computed. The ranking loss is weighted by a confidence score which combined with the penalty form the final loss. The object relations are further encoded to guide the loss weights (see Sec. 3.4 for details). During inference, the region proposal with the maximum similarity score with the object query is selected for grounding.
The similarity function and segment-description matching score are then:
This “brute-force” extension of Grounding by Ranking framework to the video domain presents two issues. First, depending on the video sampling rate, the total number of proposals per segment (T N) could be extremely large. Hence this solution does not scale well to long frame sequences. Second, an object existing sparsely across multiple frames might not be detected completely since successfully spotting it from one single frame would trigger a satisfactory match. We explain next how we propagate this weak supervisory signal from the segment level to frames that likely contain the target object.
3.3 Frame-wise Loss Weighting
In our framework, each frame is considered separately to ground the same target objects. Fig. 1 shows an overview of our model. We first estimate the likelihood that the query object is present in each video frame. If the likelihood is high, we judge the matching quality mainly on the matching loss (e.g., ranking loss). Otherwise, we down-weight the matching loss while bringing in a penalty loss. The lower the confidence, the higher the penalty. For clarity, we explain our idea when the matching loss is the ranking loss Lrank but note that this can be generalized to other loss functions.
Let the ranking loss for frame t be Ltrank and the similarity score between query k and proposal i be at. Let Q = (q1,...,qO) and Rt = (rt1,...,rtN). We define the confidence score of the prediction at frame t as the visual-semantic matching score:
where Sis defined in Eq. 1. The corresponding penalty is:
inspired by [9]. The final loss for the segment is a weighted sum of frame-wise ranking losses and penalties:
where is a static coefficient to balance the ranking loss and the penalty and can be validated on the validation set. A low
might cause the system to be over-confident on the prediction.
3.4 Object Interaction
We assume that the object types and their order in the language description can roughly determine when they appear in the video content, as motivated in Sec. 1. We show that this language prior can work as the frame-wise confidence score. To consider the interaction among objects, we further encode each object query feature qk as:
where MAis the multi-head self-attention layer [23], taking in the (query, key, value) triplet. It represents each query as the combination of all other queries based on their interrelations. The built-in positional encoding layer [23] in multi-head attention captures the order of objects appearing in the description. Note that the formulation is non-autoregressive, i.e., all the objects in the same description can interact with each other. We evenly divide each video segment into T
snippets and predict the confidence score
for object k to appear in each snippet based upon the concatenation of J(qk) and qk. Note
that T is a pre-specified constant that satisfies T
T. The language-based confidence score Clang = (C1lang,...,CT
is formulated as:
where indicates the feature concatenation, Wlang
and blang
are embedding weights and biases. We average the language-based and the similarity-based confidence score and rewrite Eq. 7 as:
4.1 Dataset
YouCook2-BoundingBox. YouCook2 [29] consists of 2000 YouTube cooking videos from 89 recipes. Each video has recipe steps temporally annotated (i.e. start timestamp and end
Figure 2: Frequency count of each class label (including referring expressions).
timestamp) and each segment is described by a natural language sentence. The average segment duration is 19.6s. Our training set is the same as the YouCook2 training split, only paired sentences are provided. For each segment-description pair in the validation and testing set however, we provide bounding box annotations for the most frequently appearing objects from the dataset, i.e. the top 63 recurring objects along with four referring expressions: it, them, that, they (see Fig. 2). These are used only during evaluation.
From YouCook2, we split each recipe step into a separate segment and sample it at 1 fps. We use Amazon Turk workers to draw bounding box around the objects in the video segment using the highlighted words in the sentence (from the 67 objects in our vocabulary). All annotations are further verified by the top 30 annotators. Please see the Appendix for more details on annotations and quality control.
4.2 Baselines and Metrics
Baselines. We include two competitive baselines from published work: DVSA [7] and GroundeR [19]. DVSA is the Grounding by Ranking method which we build all our methods upon. For fair comparison, all the approaches take in the same object proposals generated by Faster-RCNN [18] (pre-trained on MSCOCO). Following the convention from [6, 7], we select the top N = 20 proposals per frame and sample T = 5 frames per segment unless otherwise specified. We also evaluate the Baseline Random, which chooses a random proposal as the output.
Metrics. We evaluate the grounding quality by bounding box localization accuracy (denoted as Box Accuracy). The output is positive if the proposed box has over 50% IoU with the ground-truth annotation, otherwise negative. We compute accuracy for each object and average across all the object types.
4.3 Implementation Details
The number of snippets T in Sec. 3.4 is set to 5. The encoding size d is 128 for all the methods. Object labels are represented as one-hot vectors, which are encoded by a linear layer without the bias term. The loss factor
is cross-validated on the validation set and is set to 0.9. The ranking margin
is set to 0.1. For training, we use stochastic gradient descent (SGD) with Nesterov momentum. The learning rate is set at 0.05 and the momentum is 0.9. We implement the model in PyTorch and train it using either a single Titan Xp GPU with SGD or 4 GPUs with synchronous SGD, depending on the validation accuracy. The model typically takes 30 epochs, i.e. 4 hours to converge. More details are in the Appendix.
Table 1: Evaluation on localizing objects from the grounding-truth captions.
Figure 3: Top 10 accuracy increases & decreases by object category. (Left) Improvements of our Loss Weighting model over DVSA. (Right) Improvements of our Full Model over DVSA.
4.4 Results on Object Grounding
The quantitative results on object grounding are shown in Tab. 1. The model with the highest score on the validation set is evaluated on the test split. We compute the upper bound as the accuracy when proposing all 20 proposals, to see how far the methods are from the performance limit. Note that the upper bound reported here is lower than that in [19]. This is largely due to the domain shift from general scenes to cooking scenes and the large variance in our object states, e.g. zoom-in and zoom-out views, onions v.s. fried onion rings.
We show results on our proposed models, where the “Loss Weighting” model computes the confidence score with visual-semantic matching and the “Object Interaction” model computes the confidence score with textual guidance (Sec. 3.4). Our full model averages these two scores as the final confidence score (Eq. 11). The proposed methods demonstrate a steady improvement from the DVSA baseline, with a relative 1.40% boost from loss weighting and another 1.62% from combining object interaction, a total improvement of 3.02%. On the other hand, the baseline has a higher validation score, which indicates model overfitting. Note that text guidance alone (“Object Interaction”) works slightly worse than the baseline, showing that both visual and textual information are critical for inferring the frame-wise loss weights. Our methods also outperform other compared methods, GroundeR and Baseline Random by a large margin. Analysis. We show in Fig. 3 the top 10 accuracy increases and decreases of our methods over the DVSA baseline, by object category. Our methods make better predictions on static
Figure 4: Visualization of localization output from baseline DVSA and our proposed meth- ods. Red boxes indicate ground-truths and green boxes indicate proposed regions. The first two rows show examples where our methods perform better than DVSA. The last row displays a negative example where all methods perform poorly. Better viewed in color.
objects such as “squid”, “beef”, and “noodle” and worse predictions on cookwares, such as “wok”, “pan”, and “oven”, which involves more state changes, such as containing/not containing food or different camera perspectives. Our hypothesis is, our loss weighting framework favors consistent objects across frames, due to the shared frame-wise supervision. Impact of Sampling Rate. We investigate the impact of high video sampling rate on grounding accuracy by increasing the total number of frames per segment (T) from 5 to 20. The accuracy from DVSA drops from 30.80% to 29.90% and the accuracy from our Loss Weighted model drops from 31.23% to 30.93%. We expected these inferior performances, due to the excessive object proposals. However, our loss weighted method only compromises 0.96% of the accuracy while the accuracy from DVSA drops by 2.92%, showing that our method is less sensitive to high sampling rate and predicts better on long frame sequences.
Qualitative Results. Fig. 4 visualizes the grounded objects with DVSA and our proposed methods. The first two rows show some positive examples. In Fig. 4 (a), we see with DVSA baseline the "plate" object is grounded to the incorrect regions in the frames. However our methods correctly select regions with a large IOU with the ground truth box. In Fig. 4 (b) the labels "bacon" and "it" refer to the same target object. Per our annotation requirements, there is only one ground truth box instead of two. The full model correctly combines both "bacon" and "it" grounds them to the same region proposal. The last row that shows where all methods fail to ground the target objects adequately. This may be a result of errors in the top object proposals proposed since the scene is rather complicated. An additional explanation may be bias in the dataset, where during training the "bowl" object typically occupies the majority of the frame.
Limitations. There are two limitations in our method we hope to address in our future work. First, even though the frame-wise loss can to some degree enforce the temporal consistency between frames, we do not explicitly model the relation between frames, for instance motion information. The transition between object states across frames, e.g., raw meat to cooked meat, should be further studied. Second, our grounding performance is upper-bounded by the object proposal accuracy and we have no control over the errors from the proposals. An end-to-end version of the proposed method that solves both the proposing and the grounding
problem can potentially improve the grounding accuracy.
We propose a frame-wise loss weighted grounding model for video object grounding. Our model applies segment-level labels to the frames in each segment, while being robust to inconsistencies between the segment-level label and each individual frame. We also leverage object interaction as textual guidance for grounding. We evaluate the effectiveness of our models on the newly-collected video grounding dataset YouCook2-BoundingBox. Our proposed methods outperform competitive baseline methods by a large margin. Future directions include incorporating the video motion information and exploring an end-to-end solution for video object grounding.
This work has been supported by DARPA FA8750-17-2-0112. This article solely reflects the opinions and conclusions of its authors but not DARPA. We thank Tianhang Gao, Ryan Szeto and Mohamed El Banani for their helpful discussions.
[1] Muhannad Al-Omari, Paul Duckworth, David C Hogg, and Anthony G Cohn. Natural language acquisition and grounding for embodied robotic systems. In AAAI, pages 4349–4356, 2017.
[2] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold mil training for weakly supervised object localization. In CVPR, pages 2409–2416. IEEE, 2014.
[3] Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. AAAI, 2018.
[4] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari. Weakly supervised localization and learning with generic knowledge. IJCV, 100(3):275–293, 2012.
[5] Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, pages 3270–3277, 2014.
[6] De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. Finding “it”: Weakly-supervised reference-aware visual grounding in instructional video. To appear in CVPR, 2018.
[7] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
[8] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889–1897, 2014.
[9] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 2017.
[10] Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, and Cordelia Schmid. Unsupervised object discovery and tracking in video collections. In ICCV, pages 3173–3181. IEEE, 2015.
[11] Colin Lea, Austin Reiter, René Vidal, and Gregory D Hager. Segmental spatiotemporal cnns for fine-grained action segmentation. In ECCV, pages 36–52. Springer, 2016.
[12] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. Attend and interact: Higher-order object interactions for video understanding. arXiv preprint arXiv:1711.06330, 2017.
[13] Bingbing Ni, Xiaokang Yang, and Shenghua Gao. Progressively parsing interactional objects for fine grained action detection. In CVPR, pages 1020–1028, 2016.
[14] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In CVPR, pages 685–694, 2015.
[15] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649. IEEE, 2015.
[16] Bryan A Plummer, Paige Kordas, M Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. Conditional image-text embedding networks. arXiv preprint arXiv:1711.08389, 2017.
[17] Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. Learning object class detectors from weakly annotated video. In CVPR, pages 3282–3289. IEEE, 2012.
[18] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards realtime object detection with region proposal networks. TPAMI, 39(6):1137–1149, 2017.
[19] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, pages 817–834. Springer, 2016.
[20] Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, and Bernt Schiele. Generating descriptions with grounded and co-referenced people. arXiv preprint arXiv:1704.01518, 3, 2017.
[21] Hyun Oh Song, Ross Girshick, Stefanie Jegelka, Julien Mairal, Zaid Harchaoui, and Trevor Darrell. On learning to localize objects with minimal supervision. arXiv preprint arXiv:1403.1024, 2014.
[22] Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Justin Hart, Peter Stone, and Raymond J Mooney. Opportunistic active learning for grounding natural language descriptions. In Conference on Robot Learning, pages 67–76, 2017.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
[24] Carl Vondrick, Donald Patterson, and Deva Ramanan. Efficiently scaling up crowdsourced video annotation. IJCV, pages 184–204, 2013.
[25] Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371, 2017.
[26] Haonan Yu and Jeffrey Mark Siskind. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 53–63, 2013.
[27] Haonan Yu and Jeffrey Mark Siskind. Sentence directed video object codiscovery. International Journal of Computer Vision, 124(3):312–334, 2017.
[28] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. arXiv preprint arXiv:1801.08186, 2018.
[29] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. AAAI, 2018.
[30] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video captioning with masked transformer. arXiv preprint arXiv:1804.00819, 2018.
More on Implementation Details
When sampling frames from a segment, we evenly divide the segment into T clips and randomly sample one frame from each clip as temporal data augmentation. The negative sample sentence Qis randomly sampled from all available sentences, but we exclude sentences that have overlapped objects with the positive sample Q. For self attention, we use a 2-layer 6-head multi-head attention module with the hidden size set to 256 and the dropout ratio set to 0.2.
For fair comparison, all the approaches take in the same object proposals generated by Faster-RCNN [18]. The model is based upon ResNet-101 and pre-trained on MSCOCO for the object detection task.2 We take the 2048-dimension output after the RoI pooling as the region feature. We reduce the size of the region feature from 2048 to 128 with two linear layers, followed with dropout (p = 0.2) and ReLU.
More on Data Annotation
Quality Control. We use VATIC [24] as our annotation tool and Amazon Mechanical Turk (MTurk) as the crowdsourcing marketplace. To maintain quality control, a worker must annotate a gold-standard training video before being allowed to annotate the dataset. A gold-standard training video is an already annotated video segment that new workers are tested against. [24] introduced these videos to eliminate bad workers and limit annotation correction efforts. A worker is not aware that they are completing a training video, but they are given unlimited attempts until it is successfully completed. All of the gold-standard training videos consists of three objects to be annotated and the worker must achieve an IoU of at least 50% within every frame with one allowable mistake. The video segments were uploaded in batches, and with each new batch all workers were required to complete a different training video in order to continue annotating. We have a total of 94 annotators that completed the annotation tasks. The top 30 annotators (with the most accepted video segments) were selected to perform verification on the annotations. Dataset Statistics From the validation & testing segments annotated we have a total of 4,325 annotated segments with 2,962 validation and 1,363 testing segments, respectively. These segments were extracted from 647 videos that contain words from our vocabulary list.
Fig. 5 shows the distribution of the segment durations from YouCook2, with mean and standard deviation of 19.6s and 18.2s across all splits. Fig. 6 displays the number of target objects from the annotated YouCook2-BoundingBox segments. The mean target object per sentence is 2.05 with a standard deviation of 1.49. The target objects are words that belong in our vocabulary list of 67 objects.
When completing the annotations, the workers were given the option to mark an object as "outside of view frame", "occluded", or both. We define an object’s visibility as in view of the current frame with no occlusion. From our collected annotations, Fig. 7 shows each object’s visibility duration in the validation & testing split. In the validation split objects are visible 60.72% of the time, and 60.58% for testing. Note from Fig. 7 there is a spike in objects with 100% duration, this is attributed to the shorter segments from our collected data. It is perfectly reasonable to have a visible object for the entire duration of shorter segments, some as short as 2 seconds.
Figure 5: Distribution of segment durations for train/val/test splits.
Figure 6: Distribution of number of target objects within each segment for train/val/test splits. Target objects belong in our vocabulary of 67 words.
Figure 7: Span of object duration in each segment for annotated val/test splits.
Figure 8: Annotations completed by MTurk workers; The images on the left denote correct annotations and the right shows incorrect annotations. Each image is a frame from the video segment accompanied with its descriptive phrase. Better viewed in color.
After releasing the original version of the results, we discovered an error in the calculation of the evaluation metric (i.e., a scaling issue in the object proposal coordinates). This later version fixes that error. For completeness, we include the tables from both cases here for comparison (Tab. 1 for the initial results and Tab. 2 for the updated results). We note the performance ordering does not change, that all methods see a significant rise with respect to the baseline and the relative performance improvement decreases.
Table 2: (Initial.) Evaluation on localizing objects from the grounding-truth captions.
Table 3: (Updated.) Evaluation on localizing objects from the grounding-truth captions.