My stuff
Generating Descriptions with Grounded and Co-Referenced People

Learning how to generate descriptions of images or videos received major interest both in the Computer Vision and Natural Language Processing communities. While a few works have proposed to learn a grounding during the generation process in an unsupervised way (via an attention mechanism), it remains unclear how good the quality of the grounding is and whether it benefits the description quality. In this work we propose a movie description model which learns to generate description and jointly ground (localize) the mentioned characters as well as do visual co-reference resolution between pairs of consecutive sentences/clips. We also propose to use weak localization supervision through character mentions provided in movie descriptions to learn the character grounding. At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach. We then provide this (noisy) supervision into our description model which greatly improves its performance. Our proposed description model improves over prior work w.r.t. generated description quality and additionally provides grounding and local co-reference resolution. We evaluate it on the MPII Movie Description dataset using automatic and human evaluation measures and using our newly collected grounding and co-reference data for characters.

When humans talk about what they see, they not only use common objects and terms, but typically refer to reappearing entities, most commonly using names (“John”) and referential words such as pronouns (“he”, “it”). To correctly generate descriptions with reappearing entities, one needs to understand and link them across sentences and visual appearances (images/frames). Current image/video captioning datasets essentially ignore this aspect as they ask to independently describe each image/clip with a single sentence. At the same time, e.g. visual storytelling [18] and movie description [38] ultimately require solving this problem. However, the first approaches on visual story-


Figure 1: Bring in the color: our task is to generate grounded and co-referenced descriptions for the current clip using pronouns and new or reappearing character IDs, which are grounded, i.e. localized in the current clip (boxes and lines) and visually co-referenced to the previous clip (dashed lines). The visual grounding allows for co-reference to the previous clip/sentence which enables us using the pronoun “she” to refer to the first ID (Sophia).

telling [18] so far have not taken it into account, and current movie description challenges and approaches [37, 50] abstract from it by looking at a single clip at a time and replacing all the character mentions with e.g. “Someone”.

In this work we address grounded co-reference resolution, with application to movie description. The most prominent entities in movies are the people or characters. In fact, there is a long line of work which aims to link character mentions in movie or TV scripts with their visual tracks [5, 8, 43, 47, 29, 3, 33]. However, all these works are already given the description for all movies where they want to predict the linking. In contrast we want to generate a description, while jointly linking it with the currently and previously depicted character’s visual presence. Specifically, the task we address in this work is to generate descriptions for movies and at the same time localize or ground the characters, recognize their gender and refer to them consistently, i.e. co-reference them across sentences, as visualized in Figure 1. Importantly, rather than trying to obtain consistent ids in the entire movie, we focus on robust local co-reference resolution on two consecutive sentences/clips. We argue that local co-reference resolution is an important problem on itself. On the one hand there are many characters without proper names and/or with only a few occurrences, which can and should be resolved locally, e.g. “The priest takes their vows. He declares them wife and husband”. On the other hand, there are many hard decisions which have to be made locally, e.g. which character to describe and whether a character should be referenced by proper name or pronoun. To clarify, we do not generate the true proper names of the characters, but only identities with gender. We use a prede-fined set of names in our examples (e.g. Sophia). In future work we believe the true names could be extracted either from dialog, or from one/a few annotations per character.

Approaching the joint description and grounding task requires three main ingredients: we need to localize the characters, we need to decide which character(s) to pay attention to, and we need to co-reference visual characters’ appearances in neighboring sentences/clips. In Section 4 we detail how we approach character localization using head detection and tracking via a two-stage clustering approach. While generating the sentence, we advocate to jointly decide which character to pay attention to and if and how to co-reference it to the previous grounded characters. In Section 5, we propose to adapt the attention mechanism [1, 55] for this and extend it to attend jointly over both problems: grounding (i.e. track selection) and co-reference (i.e. track linking). A key insight is that this can not be learned purely from sentence supervision for generation. Instead, we supervise the joint-attention mechanism with automatically obtained linking of character mentions and tracks (Section 5.2). We note that at test time this supervision is not available and the system has learned, how to jointly ground, co-reference, and describe.

The contributions of our paper include: a) a new task of movie description with grounded and co-referenced characters; to foster research in this direction we will share our newly collected co-reference annotations and grounding of character mentions in the MPII-MD dataset (Section 3); b) a novel approach which addresses this problem by jointly learning to ground the described characters and perform local co-reference resolution between the neighboring clips; c) a robust automatic way of obtaining linking between character mentions in text and visual tracks in video, which we use to supervise our description approach and which we show is essential for the co-reference resolution task.

Our work aims to do three tasks jointly: generating video descriptions, grounding, and co-reference resolution. We review related work in these three directions with a focus on works which attempt multiple tasks at once. As we focus on people grounding and co-reference, we also discuss the related work on person re-identification and track naming.

Description generation. Generating natural language about visual content has received large interest since the emergence of recurrent networks. Typically the focus is to generate a single sentence about a single image [7, 20, 28, 52, 55], video [7, 13, 35, 39, 48], or most closely to this work, movie clip [36, 51]. Several works also produce grounding while generating the description: [55] propose an attention mechanism to ground each word to spatial CNN image features, [57] extend this to bounding boxes, [56] to video frames, and [61] to spatial-temporal proposals. [25] look into evaluating attention correctness for image captioning. [19] take a different direction and build a model which describes the entire image by jointly predicting large number of bounding boxes and a corresponding short phrase for each box. [24] parse the visual 3D scene and generate coherent multi-sentence descriptions where the objects are grounded in 3D cuboids. Multi-sentence image/video description has also been explored in e.g. [18, 35, 41, 59].

Grounding objects in images/video. Grounding nouns as well as complex natural language expressions in images [17, 21, 27, 32, 34, 54, 60] and video [23, 58] has recently received increased interest. The focus in our work is to localize people in a video while mentioning them in a generated sentence. For example, when mentioning a character who is jogging in a park, we want to localize this person in the video. Additionally we are interested in obtaining visual tracks for character mentions in text, for which rely on the semi-supervised grounding approach from [34].

Co-reference resolution. Co-reference resolution is the task defined in linguistic community [2], where the goal is to establish correct links between named entities and references to them, e.g. pronouns. [33] address co-reference resolution in TV show descriptions with a bidirectional optimization using character visual appearance and linguistic co-reference resolution features.

Person re-identification. Person re-identification from face/head images is a well studied problem and recently many deep learning based approaches have been proposed to address it [22, 30, 40, 44, 45, 64]. Our work is related to this line of work as we aim to re-identify characters between two video clips while generating a video description.

Linking tracks to names. Related works [5, 8, 33, 43, 47] propose datasets for character identification targeting TV shows, which rely on alignment of video to TV scripts. The goal is to track faces in the video and assign names to them. Typically the tracks include background characters. [3] attack the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. [29] propose a multiple instance learning based approach which focuses on recognizing background characters, and show significant improvement over prior work. There are two differences between ours and these prior works. First, we aim to re-identify characters locally, without ever seeing them before. Second, when obtaining the matching between names and tracks, our goal is to predict the grounding for a given character, not to name all the tracks.

One of the goals in this work is to learn the visual co-reference resolution. To address and evaluate this task we collected annotations both on language and visual sides. On the language side we want to know when different mentions actually refer to the same person. On the visual side we require grounding of names to visual appearances. Towards these goals we collect new annotations for character co-reference resolution and grounding for the MPII Movie Description (MPII-MD) dataset [37]. Co-reference annotations for character mentions.

In the first step, we aim to label all the character mentions in the movie descriptions of the MPII-MD. The standard version of the descriptions consists of sentences with all character names replaced with “Someone” and multiple names (e.g. “Ann and Bob”) with “people”. Along with the transformed descriptions, the MPII-MD dataset provides the original descriptions with all the character names preserved. We rely on these and run the Stanford Named Entity Recognizer (NER) [9] and obtain our initial name list. We perform manual cleaning and filter out non-human related entities. We also manually check for names missed by NER and add them to our list. With the final name list we label the names in the entire dataset which includes many instances missed by the original NER pass. As the second step, we annotate names and co-references for each movie. E.g. there might be different ways of referring to the same character (“Mary Jane” as “MJ”), so we link them together under one “alias”. Additionally, we annotate the gender of all the characters. As the last step, we annotate pronouns “he” and “she” in all descriptions. When possible we link them to one of the existing names (with some exceptions for rare characters which were not named). In total we label 45,325 name mentions and 17,839 pronouns, see Table 1. With this information we create our corpus MPII-MD Co-ref+Gender, where we transform the original MPII-MD descriptions so that every character mention, which appears in a previous sentence, is replaced with “MaleCoref”/“FemaleCoref”, otherwise with “MaleName”/“FemaleName”. We emphasize that this is the only difference to the standard MPII-MD, i.e. the video clips and splits are identical. Grounded character annotations. To evaluate the correctness of character grounding we annotate some characters with bounding boxes in video frames. For a subset of movies from MPII-MD Training, Validation and Test set we randomly select sentences and annotate all the mentioned characters. Specifically, whenever the character is mentioned in the sentence and is visible in the corresponding clip, we annotate a few frames of the clip with his/her head bounding boxes. As we also want to evaluate the co-


Table 1: Left: number of annotated mentions, right: number of named bounding boxes, on MPII-MD [37].

reference correctness, we additionally annotate pairs of consecutive sentences/clips from the Test set. In total we label 2,649 bounding boxes with names, see Table 1.

In this section our goal is to localize individual characters in video and extract visual representations informative of their appearance and context. Towards this goal we first detect, track, and extract localized representations for individual characters (Section 4.1), and then extract global representations which capture the scene and context not captured in localized representations (Section 4.2).

4.1. Character tracks and representations

To localize the characters in movies we focus on localizing their heads as most of the time the head of a character is shown, but frequently not the full body. In contrast to prior work [33] we do not only focus on frontal faces but also allow for more challenging views, e.g. back view. We detect the heads (Section 4.1.1) and track them with a two-step clustering approach, which is able to track across shot boundaries (Section 4.1.2). We extract visual representations on the tracks, informative for estimating characters’ identity, activity, gender, and importance (Section 4.1.3).

4.1.1 Head detection

We first detect all person instances in our videos using a head detector. Unlike conventional face detectors, our head detector can reliably detect profile faces and even back view heads. This is desirable because movies contain a large variety of view angles on heads. Our detector is based on the Faster R-CNN [10]. For training our head detector we collect head bounding box annotations over the PASCAL VOC 2010 trainval set. The dataset consists of 10,103 images of 7,372 head instances. 6,659 images do not have people, but we retain them as source of negatives. We make two modifications to the original Faster R-CNN configuration to make it more suitable for our head detection task. First, we account for small heads by adding smaller scale “anchor boxes”. Anchor boxes refer to a default set of sliding window proposals from which Faster R-CNN regresses detection bounding boxes. Second, instead of doing hard negative mining by only considering proposals with ground truth overlap > 0 and  ≤ 0.5as negatives, we include any proposals with overlap  ≤ 0.5. This greatly improves the quality of our head detector by increasing the diversity of negative head training samples. We run our detector on every frame of MPII-MD. We keep all the head detections with scores ≥ 0.5and both dimensions  ≥ 40pixels.

4.1.2 Head tracking

After obtaining the head detections we aim to track them within the video clip. More specifically, we want to group all detections corresponding to the same person together. We need to take into account that the movies have shot boundaries (rapid changes in a camera viewpoint/angle). Thus the motion of a person can not be the only cue for tracking and we require an appearance cue to group together different views of the same character. This motivates our two-step approach, where we first group head detections within individual shots based on their motion and then further group the obtained tracks based on their appearance.

To detect a shot boundary between two frames we rely on two features. First, we obtain color histograms on both frames and compute the Manhattan distance between the two. Second, we run the Kanade-Lucas-Tomasi (KLT) point tracker [26, 49], initialized in the first frame with corner points from the minimum eigenvalue algorithm. We compute the ratio of points that are reliably tracked in the second frame. Based on these two characteristics we estimate the thresholds which allow us to detect shot boundaries and achieve high recall on a small set of manually annotated frame pairs w.r.t. to being a shot boundary. We select the parameters on a set of annotated frames and get the F-score 0.98. We try to detect all boundaries if possible and not produce too many false positives (wrong boundaries). Our tracking approach can deal with some false positives by clustering different tracks together based on appearance.

Our tracking framework is based on [46], a multicut [4, 12] tracker for pedestrians in street scene videos. The idea is to build a graph based on person detections in video, and then obtain the tracks by partitioning the graph into an optimal number of connected components, based on attractive and repulsive pairwise terms between pairs of detections. It is essentially a clustering based tracking formulation, which produces robust tracking results. We adapt the multicut tracker to generate tracks for person heads in video clips. We cast our task as a two-level clustering problem. At the first level, we generate tracks from detections that are obtained on the consecutive frames within shots. To generate tracks from detections, we employ simple geometric features between detection bounding boxes. Given two bounding boxes b and  b′, where each has spatial-temporal location (x, y, t) , scale h and a corresponding image region B, we define the following variables: ¯h = (hb+hb′)2 , ∆x =

¯h , ∆y = |yb−yb′|¯h , ∆h = |hb−hb′|¯h , IOU = |Bd∩Bd′||Bd∪Bd′|, where IOU is the intersection over union of the two detection bounding boxes. The pairwise feature is defined as (∆x, ∆y, ∆h, IOU). Additionally, we add the quadratic terms of each feature to form a nonlinear mapping from feature space to the pairwise potentials.

Next, we cluster the obtained tracks, selecting the ones that are at least 5 frames long for computational efficiency. For this we rely on the visual appearance features. For each track we mean pool the FaceVGG [30] fc7 representations on the head crops. We then compute the cosine distance between pairs of tracks and use  1−distance as pairwise potentials in the second clustering step.

4.1.3 Track representations

The representations extracted from the tracks should allow us to (re-)identify the characters, predict their activity and gender, and estimate if they should be described.

For re-identification of characters we again rely on the FaceVGG [30] fc7 representation, referred to as  vheadin the following. We mean pool the track t representation over all head crops clustered in this track and refer to it as  vhead(t). We discuss in Section 5 how we estimate the similarity of two tracks for character re-identification in our pipeline. We include the person body context which could be useful to e.g. predict the person’s activity. We extract the body region w.r.t. the head bounding box: 3 times wider and 6 times taller. We experiment with two visual features on the body region. First is a VGG [42] fc7 representation fine-tuned for 393 activities from the MPII human pose activity dataset [31], provided by [11]. We only use the body crop ignoring the additional context features as they would be similar across tracks and thus likely not help too much to distinguish tracks, but would significantly increase computation. Another feature we compute is ResNet [14] (pool5), trained on ImageNet [6] for object classification. We mean pool both visual representations over all body crops in a track and refer to this as  vbody(t). In the experiments we specify if/which feature is being used. We find, as also noted in [29], that the described characters are frequently in the front, center, and large compared to characters not described (background characters). Rather than manually defining a good function we provide the following track statistics  vstat(t)and allow our approach to learn from this data: track length, mean and standard deviation of head width/height/center/detection score.

We do not extract designated gender features, as we find that  vheadand  vbodycarry strong information about this aspect. It is straightforward to include even more targeted representation as part of future work. All the computed representations are normalized element-wise by first mean centering and then dividing by the standard deviation to improve learning subsequent functions with deep learning.

4.2. Holistic video representations

In the previous section we discussed how and which localized features we extract for characters. To additionally capture context, objects, and scene information, important for movie description, we additionally rely on global representations provided by [36] for the MPII-MD dataset. We shortly review them in the following: 1) scores from 146 activity classifiers trained with Dense Trajectory features [53]; 2) scores from 99 object classifiers trained with LSDA [16] responses; 3) scores from 18 scene classifiers trained with PLACES-CNN [63] responses. All the classifiers were trained in [36] using the words from descriptions as labels. The provided visual feature  vglobalis a 263 dimensional concatenation of all three groups of scores.

As discussed in the introduction, we focus on character grounding and local co-reference resolution, while generating the description. More specifically, we aim to predict the character grounding and do co-reference resolution given the previous sentence grounding. At test time this allows to e.g. process the movie sequentially from start to end. In the following we rely on our transformed description corpus, MPII-MD Co-ref+Gender, described in Section 3.

The key ideas of our approach are to predict grounding and co-reference resolution jointly while generating the sentence (Section 5.1) and to learn grounding and co-reference with noisy supervision at training time obtained automatically by linking character mentions and tracks (Section 5.2). Figure 2 provides an overview of our model.

5.1. Predicting grounding and co-reference during sentence generation

For generating sentences we rely on a recurrent LSTM [15] network as defined in [62]. To predict the hidden state at step  τof the sentence, we provide it with the previous word  wτ−1and hidden state  hτ−1, as well as the current visual representation  vτ: hτ = f LST M([wτ−1, vτ], hτ−1)where [, ] denotes concatenation. The  f LST Mhas an additional hidden state or memory cell  ctwhich is not exposed. The word is then predicted as  wτ = f pred(hτ) =Softmax(W predhτ +bpred)which can be supervised with the ground truth word  ˆwτ. Note that our vocabulary  w ∈ Vdoes not contain any character names, but only  V person ={MaleCoref,FemaleCoref,MaleName,FemaleName} ⊂ V.

In the following we discuss how we obtain a  vτwhich allows to predict the correct word and at the same time solve the grounding and co-reference problem. We formulate the problem in terms of tracks which are the result of the head tracking in Section 4.1.2. We have tracks  tc ∈ T cin the current clip  (C = |T c|), and tracks  tp ∈ T pin the previous clip  (P = |T p|). We always assume the sentences in the previous clip are already grounded to tracks and only consider those tracks which correspond to mentions of characters in the sentence. Whenever we generate a word  wτwhich refers to a person  wτ ∈ V person, the task is to also select which track  tˆcit corresponds to in the current clip and which track  tˆpin the previous clip. To account for the case when the person was not mentioned in the previous sentence we include  t0in  T pwhich represents a null track, which has to be selected to indicate that we describe a new name. As we are only modeling two consecutive clips at a time, this means if  tˆp = t0we want to generate MaleName or FemaleName and MaleCoref or FemaleCoref otherwise.

Track re-identification for visual co-reference. To estimate similarity of two tracks  tpand  tcwe learn a weighting after element-wise multiplication1:


For p = 0, which indicates that no similar track exists, we set  vid(t0, tc) = −1. In preliminary experiments we found that this works better than 0, as values  vidare close to 0.

Learning grounding and co-reference jointly. The goal of our approach is to select a track  tˆcand the corresponding previous track  tˆpwhich matches the person we are describing with the current word at time  τ, in other words we ground this person in  tˆcand link it to  tˆp. As noted above if tˆp = t0there is no previous track with the same identity as tˆc. We propose to jointly predict  ˆcand  ˆpusing an attention mechanism which takes into account the re-identification and visual representations as well as the hidden state  hτ−1of the recurrent LSTM network generating the description.

The visual features are jointly embedded in the same space as the embedding learned for the hidden state:


Afterwards visual and hidden state representation are element-wise multiplied and we learn a function to predict the attention  α. This is inspired by [55], who combine convolutional visual features and the recurrent hidden state in the same way to predict spatial attention. Conceptually different, we predict two aspects jointly, the grounding  tpand linking  tcof tracks from different clips.



Figure 2: Our model. Some components are omitted for clarity, e.g. we omit the body and statistic representations.

with the htan non-linearity  φ(x) = ex−e−xex+e−x. The attention is normalized with softmax and then we use the predicted  αin a weighted sum to get the new local visual representation:


We use this together with the global/holistic video representation  vglobal(see Section 4.2) and the previous word  wτ−1to predict the next hidden state of the recurrent LSTM network as discussed above: hτ =f LST M([vgrounded, vglobal, wτ−1], hτ−1).

Supervising grounding and co-reference. While this system can be trained by only providing reference sentences as supervision, it is difficult to jointly correctly learn the grounding and co-reference resolution. We thus discuss in the next section how to obtain supervision for  αp,c,τ. Instead of annotating all characters mentions with tracks, we try to automatically predict the correct track t for each character mention  wτin the sentence. As we have ground truth co-reference on the text side for the entire training data (Section 3), we can construct the joint ground truth ˆαp,c,τfrom the groundings per clip  ˆαp,τ, ˆαc,τ. For all noncharacter words  wτ /∈ V person, no supervision and thus no loss is provided. The losses from sentence supervision and grounding/co-reference supervision are weighted equally.

5.2. Obtaining automatic supervision: linking character mentions and tracks

In this section we discuss how to ground or link character mention with id  mτin text at position  τto a corresponding visual track  tcin the video to provide ground truth ˆαc,τused above. In contrast to sentence generation, here we explicitly use the character mentions m (e.g. ”Harry”) which appear in the text. In other words we want to robustly choose the correct track for all character mentions. Note, that this is a slightly different task than in e.g. [29], who aim to link all the visual tracks to correct names. To link the name mentions in text to tracks we adapt the recently proposed semi-supervised approach GroundeR [34]. This approach was initially proposed for the task of localizing text phrases within an image without localization supervision, i.e. where the phrase is located. The main idea is to learn to attend to the right bounding box out of a set of proposals, by trying to reconstruct the phrase. We adapt this to our scenario by learning to localize a character  mτ,kin the set of tracks  Tkfrom clip k, where character m is mentioned in the sentence k at position  τ. We represent tracks with vhead(tc,k)and encode character names m together with an identifier of the  gender(m) ∈ {M, F}as separate word in an LSTM. Adding the gender allows the model to exploit correlations with different visual appearance of male versus female people and thus helps selecting the right track. In the special case when the sentence k only contains a single


Detection 82.00 65.78 84.73 GroundeR 78.12 84.46 80.35 Tracking 78.53 61.65 81.41

Table 2: (left) Detection and tracking recall on the annotated character heads. (right) GroundeR accuracy on the annotated names/bounding boxes (evaluated on the boxes covered by the tracks). In %.

name and the clip k contains a single track, i.e.  |Tk| = 1, we assume that grounding is correct and this information is used as additional supervision, thus enabling the semi-supervised setting of [34]. To train the model we use pairs ([gender(mτ,k), mτ,k], {vhead(tc,k)}c∈{1..C})and predict the grounding as the track with maximum attention from all the tracks in the clip.

We start with evaluating the quality of our person head detection and tracking. Then we look at the quality of automatic linking between character names and tracks, obtained in Section 5.2. Finally, we evaluate our complete pipeline for grounded movie description. We break down the evaluation in two parts: description quality and grounding quality.

6.1. Head detection and tracking

We evaluate our head detections and tracks on the collected bounding box annotations from Section 3. Given the annotated bounding boxes we compute detection recall by looking whether there is a head detection in a given frame that has an Intersection Over Union (IOU)  ≥ 0.5with the annotated head box. The track recall is computed similarly, based on the presence of the track that goes through the given frame while overlapping with the annotated box with IOU  ≥ 0.5. Table 2(left) shows recall on the Training, Validation and Test parts of the annotations.

We analyze the missing recall of our head detector on the Training annotations. We find that there are multiple failure modes, such as motion blur, occlusion and head size (both small and large) contributing to the missing recall. On the well visible heads we achieve 93.2% recall. The tracking recall is slightly lower than the detection recall, due to the short track rejection (see Section 4.1.2). In particular, tracking can be hard when the head is observed from an unusual angle. Overall, we find that our annotations are rather challenging but the obtained performance is reasonable. We also note that our approach already works with just one good track for each character.

6.2. Linking characters to tracks

For every clip we restrict the number of tracks to at most 50. If more than 50 tracks are available we sort them by length and keep the longest, otherwise we zero-complete the missing tracks. For the previous track we consider at most 7 candidate tracks in addition to the “null” track (no match among the previous tracks). Thus there are  8 × 50possible choices to predict the character grounding and co-reference during sentence generation. We first train the GroundeR [34] approach on Training movies only in order to estimate the hyper parameters. Next we combine the Training, Validation and Test movies and train GroundeR on this joint set. We evaluate the accuracy of the obtained predictions on the annotated pairs name/bounding box presented in Section 3. For a given name we choose the top scoring track as the grounding prediction. For this track we then check whether it contains the annotated frame and overlaps with the annotated box by IOU  ≥ 0.5. Table 2(right) shows that GroundeR is able to quite robustly predict the correct track for a given character name.

6.3. Evaluating description quality

We evaluate our approach in terms of description quality and compare it to a few baselines as well as prior work via an automatic as well as human evaluation. We report all the standard automatic measures in Table 3. For human evaluation the human judges were provided with pairs of a reference sentence and a predicted sentence, and asked to compare them w.r.t. being helpful for a blind person to follow the events in the video [38]. The judges can decide that one sentence is better than the other or both are similar. Each pair is evaluated by three human judges. Afterwards for every system we compute the percentage of times when at least 2 out of 3 judges decided that the predicted sentence is similar or better than the reference. Table 3 presents the results of human evaluation in the last column.

The top part of the table contains the reference numbers from prior works on the standard version of the corpus. We cannot use attention supervision or evaluate grounding on standard MPII-MD, which are our core contributions. It is encouraging that our reduced model “Our w/o  α” achieves similar scores to prior work.

The middle and bottom part of the table presents results on MPII-MD Co-ref+Gender, thus the numbers between the two settings are not directly comparable as the references changed which strongly affects the automatic evaluation measures. To address this we evaluate the approach Visual-Labels [36] on the transformed corpus. Unlike [36], we do not ensemble multiple models. For a fair comparison with the Visual-Labels in the middle part of Table 3, we provide ablations that do not have access to the previous clip character grounding but instead select the 7 biggest previous tracks if sorted by track length multiplied by an average track area. We compare a variant of our approach without the body context features (“Our”), one with body features (“Our + Activity”) as described in Section 4.1.3, and one which removes the attention mechanism but uses the activ-


Table 3: Left: automatic / right: human evaluation of description generation on the test set of MPII-MD; for discussion see Section 6.3.

ity feature and encodes it jointly with the holistic feature (“Our + Activity w/o attention & co-reference”). In the bottom part of Table 3 we use the automatically obtained previous clip grounding (via Section 5.2, which has access to the previous ground-truth sentence), so that different variants of our approach are comparable, as they obtain the same previous information. Here we compare “Our” and two variants of our approach with body features (“Our+Activity”, “Our+ResNet”). We also ablate the impact of the grounding and co-reference supervision (“Our w/o  ˆα”) and the statistic features (“Our w/o statistic features”).

From Table 3 we see that: a) the systems “Our” / “Our + Activity” without previous clip character grounding achieve similar or better sentence quality than the Visual-Labels baseline; b) the variant with extra body context but without attention mechanism gets lower human score than our full system (11.0 vs. 15.0); c) providing grounding and co-reference supervision  ˆαbenefits the sentence quality; d) overall, body context features improve the scores, while the statistic features do not have a significant impact; e) the best result, according to human evaluation, is achieved by the variant of our approach “Our + Activity” without previous clip character grounding. A possible explanation for this is as follows. In the automatically obtained previous clip’s character grounding we might: a) link the characters to tracks correctly; b) link them incorrectly; c) miss some links if names are absent. In a) we follow the storyline of the movie. If we instead use the largest tracks of the previous clip, we bias the description of the current clip in a different way, e.g. focus on the most salient characters. Thus, in some cases the obtained descriptions are ranked higher by the humans, as they only see the current clip in isolation (no


Figure 3: Supported by a visual co-reference to the previous clip, (2) correctly refers to a receptionist as ‘her’, rather than ‘Jacob’(1).

story-line). In b), c) it is naturally more difficult to obtain a correct description of the current clip. See Figure 3 for an example.

6.4. Evaluating grounding quality

In this section we evaluate the correctness of the predicted grounding, co-reference and the generated character specific word  wτ ∈ {MaleCoref, FemaleCoref, MaleName, FemaleName}. We evaluate our predictions with respect to the manually obtained ground-truth (Section 3) or automatically obtained ground-truth (Section 5.2). For each of the named bounding boxes we obtain the track which overlaps with it most, for every character mention we obtain one or more associated ground-truth tracks. In total we obtain a set of 186 sentences with manually obtained grounding and co-reference. For the automatic annotations we evaluate on a complete MPII-MD Test set (6, 578 sentences).

We break down the evaluation in three parts: Grounding, Grounding + Co-Reference, Grounding + Co-Reference + wτ(generated word). We compute precision and recall for each of these tasks and report the F1 score. Precision is computed as a percentage of predictions  {αp,c,τ, wτ}, which are present in ground-truth. For the grounding task we only check whether the track  tcis present among ground-truth tracks. For co-reference it has to be also correctly linked to the track  tpfrom a previous clip. For the final task the predicted word  wτwith the track  tcand predicted co-reference  tphas to be present in the ground-truth. Recall is computed in a reversed way: for every ground-truth pair  {ˆαp,c,τ, ˆwτ}we check whether it is among the predictions.

The top part of Table 4 shows a set of baselines where we aim to obtain the grounding and co-reference resolution as a post-processing step after the sentence was generated. We use Visual-Labels [36] as a sentence generation baseline. We consider multiple heuristics to select the


Table 4: Grounding evaluation on test set. For discussion see Section 6.4.

track: central position (Center), length times average area (LxA). Additionally we use a simple co-reference resolution method: if there are tracks in the previous clip, we pick the one which is most similar to the selected track as a co-reference (LxA,Sim). The similarity is estimated as 1 − cosine(vhead(tc), vhead(tp)). The bottom part of the table lists the variants of our approach introduced earlier.

Table 4(left) presents the evaluation with the manually obtained ground-truth. As we can see: a) the baselines are rather competitive in the grounding task, however they fall far below our approach in the co-reference task; b) grounding and co-reference supervision  ˆαis very important to learn the co-reference prediction; c) statistics features, although they did not impact the description quality signifi-cantly, benefit the co-reference resolution; d) our approach is doing quite well in the final task, meaning that the language model correctly learns when to use co-references and recognizes the gender information.

In the last line of Table 4 we evaluate the quality of automatic ground-truth predictions from Section 5.2 with respect to our tasks. As we can see the predictions are overall quite reliable. Encouraged by that we perform the evaluation on this automatic ground-truth for the complete Test set, Table 4(right). We note, that the manually annotated set covers only 2.8% of the full test set, so the results on the full test are more stable. We make the following observations: a) an ablation w/o statistic features again slightly drops in performance; b) all the baselines fall below our best approaches in all three tasks; this can be attributed to a more challenging data distribution: the complete test set contains sentences/clips where characters are absent and that has to be recognized correctly, while the manually annotated set always contains characters and is biased towards co-references; c) on this larger and more challenging test set we see that “Our + Activity” and “Our + ResNet” ben-efit from additional body features and achieve better performance than the basic variant “Our”; one observation we


Figure 4: Qualitative results of our approach on the grounded movie description task. Given a previous grounding we predict a sentence, grounding and co-reference.

make is that these two variants are more accurate with respect to presence/absence of people in the sentence/video which impacts the precision and thus the F1 score. In Figure 4 we provide some qualitative examples with the predictions from our approach.

In this work we look at the novel task, generating descriptions with joint grounding and co-reference resolution of person mentions. We have proposed a novel approach, which relies on an attention mechanism that jointly learns to solve the grounding and co-reference resolution while learning to describe the video clip. Using an automatically learned linking between names and tracks we can provide supervision into our approach which significantly improves its ability to perform co-reference resolution. We demonstrate encouraging results in a complex task of grounded movie description and achieve improvements over multiple baselines. Our approach generates sentences of better quality than the baselines as shown by automatic and human evaluation. Overall, our approach can describe video, reason about persons identities, recognize their genders and localize them in video. We believe that this work is a first step towards fully coupling generation and grounding while performing image/video description. We will release the annotations and extracted tracks and hope that this will ben-efit other researchers who work on linguistic and/or visual co-reference resolution, movie question answering, visual storytelling, and multi-sentence video description.

We would like to thank Trevor Darrell for helpful discussions. Marcus Rohrbach was supported by the Berkeley Artificial Intelligence Research (BAIR) Lab.

[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine trans- lation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 2

[2] S. Bergsma and D. Lin. Bootstrapping path-based pronoun resolution. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 33–40. Association for Computational Linguistics, 2006. 2

[3] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic. Finding actors and actions in movies. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013. 1, 2

[4] S. Chopra and M. Rao. The partition problem. Mathematical Programming, 59(1–3):87–115, 1993. 4

[5] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning from ambiguously labeled images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 1, 2

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 4

[7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2

[8] M. Everingham, J. Sivic, and A. Zisserman. ”hello! my name is... buffy” - automatic naming of characters in tv video. In Proceedings of the British Machine Vision Conference (BMVC), 2006. 1, 2

[9] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 363–370, 2005. 3

[10] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. 3

[11] G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with r* cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1080–1088, 2015. 4

[12] M. Gr¨otschel and Y. Wakabayashi. A cutting plane algorithm for a clustering problem. Mathematical Programming, 45(1):59–96, 1989. 4

[13] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition.

In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013. 2

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4

[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 5

[16] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Gir- shick, T. Darrell, and K. Saenko. LSDA: Large scale detection through adaptation. In Advances in Neural Information Processing Systems (NIPS), 2014. 5

[17] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Dar- rell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[18] T.-H. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra, C. L. Zitnick, D. Parikh, L. Vanderwende, M. Galley, and M. Mitchell. Visual storytelling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016. 1, 2

[19] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[20] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2

[21] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2

[22] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2

[23] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual seman- tic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2657–2664. IEEE, 2014. 2

[24] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Generating multi-sentence natural language descriptions of indoor scenes. In Proceedings of the British Machine Vision Conference (BMVC), 2015. 2

[25] C. Liu, J. Mao, F. Sha, and A. Yuille. Attention correctness in neural image captioning. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2017. 2

[26] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981. 4

[27] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[28] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 2

[29] O. M. Parkhi, E. Rahtu, and A. Zisserman. It’s in the bag: Stronger supervision for automated face labelling. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2015. 1, 2, 4, 6

[30] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), 2015. 2, 4

[31] L. Pishchulin, M. Andriluka, and B. Schiele. Fine-grained activity recognition with holistic and pose based features. In Proceedings of the German Confeence on Pattern Recognition (GCPR), pages 678–689, 2014. 4

[32] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hock- enmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2

[33] V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Link- ing people in videos with ”their” names using coreference resolution. In Proceedings of the European Conference on Computer Vision (ECCV), 2014. 1, 2, 3

[34] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. 2, 6, 7

[35] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In Proceedings of the German Confeence on Pattern Recognition (GCPR), 2014. 2

[36] A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. In Proceedings of the German Confeence on Pattern Recognition (GCPR), 2015. 2, 5, 7, 8, 9

[37] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1, 3, 8

[38] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie description. International Journal of Computer Vision (IJCV), 2017. 1, 7

[39] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013. 2

[40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- fied embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2

[41] A. Shin, K. Ohnishi, and T. Harada. Beyond caption to narra- tive: Video captioning with multiple sentences. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2016. 2

[42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. 4

[43] J. Sivic, M. Everingham, and A. Zisserman. ”who are you?”- learning person specific classifiers from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 1, 2

[44] Y. Sun, X. Wang, and X. Tang. Deeply learned face represen- tations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2

[45] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verifica-tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2

[46] S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5033–5041, 2015. 4

[47] M. Tapaswi, M. Baeuml, and R. Stiefelhagen. ”knock! knock! who is it?” probabilistic person identification in tvseries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 1, 2

[48] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics (COLING), 2014. 2

[49] C. Tomasi and T. Kanade. Detection and tracking of feature points. Technical Report CMU-CS-91-132, Carnegie Mellon University, 1991. 4

[50] A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070v1, 2015. 1

[51] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2, 8

[52] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2

[53] H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013. 5

[54] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure- preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[55] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), 2015. 2, 5

[56] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal

structure. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2

[57] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[58] H. Yu and J. M. Siskind. Grounded language learning from videos described with sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2013. 2

[59] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

[60] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod- eling context in referring expressions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 69–85. Springer, 2016. 2

[61] M. Zanfir, E. Marinoiu, and C. Sminchisescu. Spatiotemporal attention models for grounded video captioning. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2016. 2

[62] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014. 5

[63] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems (NIPS), 2014. 5

[64] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touching the limit of lfw benchmark or not? arXiv:1501.04690, 2015. 2

Designed for Accessibility and to further Open Science