Recent advances in visual language field enabled by deep learning techniques have succeeded in bridging the gap between vision and language in a variety of tasks, ranging from describing the image [15, 7, 28, 29] to answering questions about the image [2, 5]. Such achievements were possible under the premise that there exists a set of ground truth references that are universally applicable regardless of the target, scope, or context. In real-world setting, however, image descriptions are prone to an infinitely wide range of variabilities, as different viewers may pay attention to different aspects of the image in different contexts, resulting in a variety of descriptions or interpretations. Due to its subjective nature, such diversity is difficult to obtain with conventional image description techniques.
In this paper, we propose a customized image narrative generation task, in which we attempt to actively engage the
Figure 1: Example of conventional image description (top) and customized image narrative (bottom).
users in the description generation process by asking questions and directly obtaining their answers, thus learning and reflecting their interest in the description. We use the term image narrative to differentiate our image description from conventional one, in which the objective is fixed as depicting factual aspects of global elements. In contrast, image narratives in our model cover a much wider range of topics, including subjective, local, or inferential elements.
We first describe a model for automatic image narrative generation from single image without user interaction. We develop a self Q&A model to take advantage of wide array of contents available in visual question answering (VQA) task, and demonstrate that our model can generate image descriptions that are richer in contents than previous models. We then apply the model to interactive environment by directly obtaining the answers to the questions from the users. Through a wide range of experiments, we demonstrate that such interaction enables us not only to customize the image description by reflecting the user’s choice in the current image of interest, but also to automatically apply the learned preference to new images (Figure 1 ).
Visual Language: The workflow of extracting image features with convolutional neural network (CNN) and generating captions with long short-term memory (LSTM) [11] has been consolidated as a standard for image captioning task. [15] generated region-level descriptions by implementing alignment model of region-level CNN and bidirectional recurrent neural network (RNN). [13] proposed DenseCap that generates multiple captions from an image at region-level. [12] built SIND dataset whose image descriptions display a more casual and natural tone, involving aspects that are not factual and visually apparent. While this work resembles the motivation of our research, it requires a sequence of images to fully construct a narrative.
Visual question answering (VQA) has escalated the interaction of language and vision to a new stage, by enabling a machine to answer a variety of questions about the image, not just describe certain aspects of the image. A number of different approaches have been proposed to tackle VQA task, but classification approach has been shown to outperform generative approach [1, 14]. [8] proposed multimodal compact bilinear pooling to compactly combine the visual and textual features. [24] proposed an attentionbased model to select a region from the image based on text query. [19] introduced co-attention model, which not only employs visual attention, but also question attention.
User Interaction: Incorporating interaction with users into the system has rapidly become a research interest. Visual Dialog [5] actively involves user interaction, which in turn affects the responses generated by the system. Its core mechanism, however, functions in an inverse direction from our model, as the users ask the questions about the image, and the system answers them. Thus, the focus is on extending the VQA system to a more context-dependent, and interactive direction. On the other hand, our model’s focus is on generating customized image descriptions, and user interaction is employed to learn the user’s interest, whereas Visual Dialog is not concerned about the users themselves.
[6] introduces an interactive game, in which the system attempts to localize the object that the user is paying attention to by asking relevant questions that narrow down the potential candidates, and obtaining answers from the users. This work is highly relevant to our work in that user’s answers directly influence the performance of the task, but our focus is on contents generation instead of object localization or gaming. Also, our model not only utilizes user’s answer for current image, but further attempts to apply it to new images. Recent works in reinforcement learning (RL) have also employed interactive environment by allowing the agents to be taught by non-expert humans [4]. However, its main purpose is to assist the training of RL agents, while our goal is to learn the user’s interest specifically.
Figure 2: Example of regions extracted from the image, and the questions generated from each region.
We first describe a model to generate image narrative that covers a wide range of topics without user interaction. We propose a self Q&A model where questions are generated from multiple regions, and VQA is applied to answer the questions, thereby generating image-relevant contents.
Region Extraction: Following [9], we first extract region candidates from the feature map of an image, by applying linear SVM trained on annotated bounding boxes at multiple scales, and applying non-maximal suppression. The region candidates then go through inverse cascade from upper, fine layer to lower, coarser layers of CNN, in order to better-localize the detected objects. This results in region proposals that are more contents-oriented than selective search [26] or Edge Boxes [17]. We first extracted top 10 regions per image. Figure 2 shows an example of the regions extracted in this way. In the experiments to follow, we set the number of region proposals K as 5, since the region proposals beyond top 5 tended to be less congruent, thus generating less relevant questions.
Visual Question Generation: In image captioning task, it is conventional to train an LSTM with human-written captions as ground truth annotations. On the other hand, in VQA task, questions are frequently inserted to LSTM in series with fixed image features, and the answers to the questions become the ground truth labels to be classified. Instead, we replace the human-written captions with human-written questions, so that LSTM is trained to predict the question, rather than caption.
Given an image I and a question Q = (q0,...qN), the training proceeds as in [28]:
where We is a word embedding, xt is the input features to LSTM at t, and pt+1 is the resulting probability distribution for the entire dictionary at t. In the actual generation of questions, it will be performed over all region proposals r0,...,rN I:
for q0,...qN Qri. Figure 2 shows examples of questions generated from each region including the entire image. As
Table 1: Examples of questions generated using non-visual questions in VQG dataset.
shown in the figure, by focusing on different regions and extracting different image features, we can generate multiple image-relevant questions from single image.
So far, we were concerned with generating “visual” questions. We also seek to generate “non-visual” questions. [21] generated questions that a human may naturally ask and require common-sense and inference. We examined whether we can train a network to ask multiple questions of such type by visual cues. We replicated the image captioning process described above, with 10,000 images of MS COCO and Flickr segments of VQG dataset, with 5 questions per image as the annotations. Examples of questions generated by training the network solely with non-visual questions are shown in Table 16.
Visual Question Answering: We now seek to answer the questions generated. We train the question answering system with VQA dataset [2]. Question words are sequentially encoded by LSTM as one-hot vector. Hyperbolic tangent non-linearity activation was employed, and elementwise multiplication was used to fuse the image and word features, from which softmax classifies the final label as the answer for visual question. We set the number of possible answers as 1,250.
As we augmented the training data with “non-visual” questions, we also need to train the network to “answer” those non-visual answers. Since [21] provides the questions only, we collected the answers to these questions on Amazon Mechanical Turk. Since many of these questions cannot be answered without specific knowledge beyond what is seen in the image (e.g. “what is the name of the dog?”), we encouraged the workers to use their imagination, but required them to come up with answers that an average person might also think of. For example, people frequently answered the question “what is the name of the man?” with “John” or “Tom.” Such non-visual elements add vividness and story-like characteristics to the narrative as long as they are compatible with the image, even if not entirely verifi-able.
Figure 3: Example of question and answer converted to a declarative sentence by conversion rule.
Natural Language Processing: We are now given multiple pairs of questions and answers about the image. By design of the VQA dataset, which mostly comprises simple questions regarding only one aspect with the answers mostly being single words, the grammatical structure of most questions and answers can be reduced to a manageable pool of patterns. Exploiting these design characteristics, we combine the obtained pairs of questions and answers to a declarative sentence by application of rule-based transformations, as in [23, 25].
We first rephrase the question to a declarative sentence by switching word positions, and then insert the answers to its appropriate position, mostly replacing wh-words. For example, a question “What is the man holding?” is first converted to a declarative statement “The man is holding what” and the corresponding answer “frisbee” replaces “what” to make “The man is holding frisbee.” Part-of-speech tags with limited usage of parse tree were used to guide the process, particularly conjugation according to tense and plurality. Figure 3 illustrates the workflow of converting question and answer to a declarative sentence. See Supplemental Material for specific conversion rules. Part-of-speech tag notation is as used in PennTree I Tags [20].
We now extend the automatic image narrative generation model described in Section 3 to interactive environment, in which users participate in the process by answering questions about the image, so that generated narrative varies depending on the user input provided.
4.1. Applying Interaction within the Same Images
4.1.1 Question with Multiple Possible Answers
As discussed earlier, we attempt to reflect user’s interest by asking questions that provide visual context. The foremost prerequisite for the interactive questions to perform that function is the possibility of various answers or interpretations. In other words, a question whose answer is so obvious that it can be answered in an identical way would not be valid as an interactive question. In order to make sure that each generated question allows for multiple possible
Figure 4: Questions that allow for multiple responses are generated to reflect user’s interest and corresponding regions proceed to image narrative generation process.
answers, we internally utilize the VQA module. The question generated by the VQG module is passed on to VQA module, where the probability distribution for all candidate answers C is determined. If the most likely candidate
, where
, has a probability of being answer over a certain threshold
, then the question is considered to have a single obvious answer, and is thus considered ineligible. The next question generated by VQG is passed on to VQA to repeat the same process until the the following requirement is met:
In our experiments, we set as 0.33. We also excluded the yes/no type of questions. Figure 4 illustrates an example of a question where the most likely answer had a probability distribution over the threshold (and is thus ineligible), and another question whose probability distribution over the candidate answers was more evenly distributed (and thus proceeds to narrative generation stage).
4.1.2 Region Extraction
Once the visual question that allows for multiple responses is generated, a user inputs his answer to the question, which is assumed to reflect his interest. We then need to extract a region within the image that corresponds to the user’s response. We slightly modify the attention networks introduced in [30] in order to obtain the coordinates of the region that correspond to the user response. In [30], the question itself was fed into the network, so that the region necessary to answer that question is “attended to.” On the other hand, we are already given the answer to the question by the user. We take advantage of this by making simple yet efficient modification, in which we replace the wh- question terms with the response provided by the user. For example, a question “what is on the table?” with a user response “pizza” will be converted to a phrase “pizza is on the table,” which is fed into attention network. This is similar to the rule-based NLP conversion in Section 3. We obtain the coordinates of the region from the second attention layer, by obtaining minimum and maximum values for x-axis and y-axis in which the attention layer reacts to the input phrase. Since the regions are likely to contain the objects of interest at very tight scale, we extracted the regions at slightly larger sizes than coordinates. A region of size (
) with coordinates
for image I of size (W, H) is extracted with a magnifying factor
(set as 0.25):
Given the region and its features, we can now apply the image narrative generation process described in Section 3 with minor modifications in setting. Regions are further extracted, visual questions are generated and answered, and rule-based natural language processing techniques are applied to organize them. Figure 4 shows an overall workflow of our model.
4.2. Applying Interaction to New Images
We represent each instance of image, question, and user choice as a triplet consisting of image feature, question feature, and the label vector for the user’s answer. In addition, collecting multiple choices from identical users enables us to represent any two instances by the same user as a pair of triplets, assuming source-target relation. With these pairs of triplets, we can train the system to predict a user’s choice on a new image and a new question, given the same user’s choice on the previous image and its associated question. User’s choice is represented as one-hot vector where the size of the vector is equal to the number of possible choices. We refer to the fused feature representation of this triplet consisting of image, question, and the user’s choice as choice vector.
We now project the image feature and question feature
for the second triplet onto the same embedding space as the choice vector. We can now train a softmax classification task in which the feature from the common embedding space predicts the user’s choice
on new question. In short, we postulate that the answer with index u, which maximizes the probability calculated by LSTM, is to be chosen as
by the user who chose
, upon
Figure 5: Training with pair of choices made by the same user. Given the choice vector for image 1 and new image feature and question feature for image 2, it is trained to predict the answer for the question on image 2.
seeing a tuple of new image and new question:
where P is a probability distribution determined by softmax over the space of possible choices, and is the choice vector corresponding to
. This overall procedure and structure are essentially identical as in VQA task, except we augment the feature space to include choice vector. Figure 5 shows the overall workflow for training.
5.1. Automatic Image Narrative Generation
5.1.1 Setting
We applied the model described in Section 3 to 40,775 images in test 2014 split of MS COCO [18]. We compare our proposed model to three baselines as following:
Baseline 1 (COCO): general captioning trained on MS COCO applied to both images in their entireties and the region proposals
Baseline 2 (SIND): captions with model trained on MS SIND dataset [12], applied to both images in their entireties and the region proposals
Baseline 3 (DenseCap): captions generated by DenseCap [13] at both the whole images and regions with top 5 scores using their own region extraction implementation.
5.1.2 Evaluation
Automatic Evaluation: It is naturally of our interest how humans would actually write image narratives. Not only can we perform automatic evaluation for reference, but we can also have a comprehension of what characteristics would be shown in actual human-written image narratives. We collected image narratives for a subset of MS COCO dataset 1. We asked the workers to write a 5-sentence narrative about
Table 2: Examples of human-written image narratives col- lected on Amazon Mechanical Turk.
Table 3: Performances of the generated image narratives with human-written image narratives as ground truth.
the image in a story-like way. We made it clear that the description can involve not only factual description of the main event, but also local elements, sentiments, inference, imagination, etc., provided that it can relate to the visual elements shown in the image. Table 2 shows examples of actual human-written image narratives collected and they display a number of intriguing remarks. On top of the elements and styles we asked for, the participants actively employed many other elements encompassing humor, question, suggestion, etc. in a highly creative way. It is also clear that conventional captioning alone will not be able to capture or mimic the semantic diversity present in them.
We performed automatic evaluation with BLEU [22] with collected image narratives as ground truth annotations.
Table 4: Each model’s performance on DIANE.
Table 5: Against each model on with 2 degrees of freedom, and one-sided p-value from binomial probability.
Table 3 shows the results. While resemblance to human-written image narratives may not necessarily guarantee better qualities, our model, along with DenseCap, showed highest resemblance to human-written image narratives. As we will see in human evaluation, such tendency turns out to be consistent, suggesting that resemblance to human-written image narratives may indeed provide a meaningful reference.
Human Evaluation: We asked the workers to rate each model’s narrative with 5 metrics that we find essential in evaluating narratives; Diversity, Interestingness, Accuracy, Naturalness, and Expressivity (DIANE). Evaluation was performed for 5,000 images with 2 workers per image, and all metrics were rated in the scale of 1 to 5 with 5 being the best performance in each metric. We asked each worker to rate all 4 models for the image on all metrics.
Table 6 shows example narratives from each model. Table 4 shows the performance of each model on the evaluation metrics, along with the percentage of each model receiving the highest score for a given image, including par with other models. Our model obtained the highest score on Diversity, Interestingness and Expressivity, along with the highest overall score and the highest percentage of receiving best scores. In all other metrics, our model was the second highest, closely trailing the models with highest scores. Table 5 shows our model’s performance against each baseline model, in terms of the counts of wins, losses, and pars. values on 2 degrees of freedom are evaluated against the null hypothesis that all models are equally preferred. The rightmost column in Table 5 corresponds to the one-sided p-values obtained from binomial probability against the same null hypothesis. Both significance tests provide an evidence that our model is clearly preferred over others.
Discussion: General image captioning trained on MS COCO shows weaknesses in accuracy and expressivity. Lower score in accuracy is presumably due to quick diversion from the image contents as it generates captions directly from regions. Since it is restricted by an objective of describing the entire image, it frequently generates irrelevant description on images whose characteristics differ from typical COCO images, such as regions within an image as in our case. Story-like captioning trained on MS SIND obtained the lowest scores in all metrics. In fact, examples in Table 6 display that the narratives from this model are almost completely irrelevant to the corresponding images, since the correlation between single particular image and assigned caption is very low. DenseCap turns out to be the most competitive among the baseline models. It demonstrates the highest accuracy among all models, but shows weaknesses in interestingness and expressivity, due to their invariant tone and design objective of factual description. Our model, highly ranked in all metrics, demonstrates superiority in many indispensable aspects of narrative, while not sacrificing the descriptive accuracy.
5.2. Interactive Image Narrative Generation
5.2.1 Setting
We first need to obtain data that reflect personal tendencies of different users. Thus, we not only need to collect data from multiple users so that individual differences exist, but also to collect multiple responses from each user so that individual tendency of each user can be learned.
We generated 10,000 questions that allow for multiple responses following the procedure described in Section 4. We grouped every 10 questions into one task, and allowed 3 workers per task so that up to 3,000 workers can participate. Since multiple people are participating for the same group of images, we end up obtaining different sets of responses that reflect each individual’s tendency.
We have permutation of 10 choose 2, P(10, 2) = 90 pairs of triplets for each user, adding up to 270,000 pairs of training data. Note that we are assuming a source-to-target relation within the pair, so the order within the pair does matter. We randomly split these data into 250,000 and 20,000 for training and validation splits, and performed 5-fold validation with training procedure described in Section 4. With 705 labels as possible choices, we had an average of 68.72 accuracy in predicting the choice on new image, given the previous choice by the same user. Randomly matching the pairs with choices from different users seemingly drops the average score down to 45.17, confirming that the consistency in user choices is a key point in learning preference.
5.2.2 Evaluation
Question Generation: For question generation, our interest is whether our model can generate questions that allow for various responses, rather than single fixed response. We
Table 6: Examples of image narratives. See Supplemental Material for many more examples.
Table 7: Evaluation results on whether the generated ques- tions allow for multiple responses.
Table 8: Examples of generated questions using our proposed model and VQG respectively.
asked the workers on Amazon Mechanical Turk to decide whether the question can be answered in various ways or has multiple answers, given an image. 1,000 questions were generated with our proposed model using both VQG and VQA, and another 1,000 questions were generated using VQG only.
Table 7 shows the number of votes for each model. It is very clear that the questions generated from our proposed model of parallel VQG and VQA outperformed by far the questions generated from VQG only. This is inevitable in a sense that VQG module was trained with human-written questions that were intended to train the VQA module, i.e. with questions that mostly have clear answers. On the other hand, our model deliberately chose the questions from VQG that have evenly distributed probabilities for answer labels, thus permitting multiple possible responses. Table 20 shows examples of visual questions generated from our model and VQG only respectively. In questions generated from our model, different responses are possible, whereas the questions generated from VQG only are restricted to single obvious answer.
Reflection of User’s Choice on the Same Image: Our next experiment is on the user-dependent image narrative generation. We presented the workers with 3,000 images and associated questions, with 3 possible choices as a response to each question. Each worker freely chooses one of the choices, and is asked to rate the image narrative that corresponds to the answer they chose, considering how well it reflects their answer choices. As a baseline model, we examined a model where the question is absent in the learning and representation, so that only the image and the user input are provided. Rating was performed over scale of 1 to 5, with 5 indicating highly reflective of their choice. Table 11 shows the result. Agreement score among the workers was calculated based on [3]. Agreement score for our model falls into the range of ‘moderate’ agreement, whereas, for baseline model, it is at the lower range of ‘fair’ agreement, as defined by [16], demonstrating that the users more frequently agreed upon the reliability of the image narratives for our model. Our model clearly has an advantage over using image features only with a margin considerably over standard deviation. Table 26 shows examples of images, generated question, and image narratives generated depending on the choice made for the question respectively.
Reflection of User’s Choice on New Images: Finally, we experiment with applying user’s interest to new images. As in the previous experiment, each worker is presented with an image and a question, with 3 possible choices as an answer to the question. After they choose an answer, they are presented with a new image and a new image narrative. Their task is to determine whether the newly presented image narrative reflects their choice and interest. As a base-
Table 9: Examples of image narratives generated depending on the user choices.
Table 11: Evaluation results on how well the generated im- age narrative reflects the choices they made.
Table 12: Evaluation results on how well the generated im- age narrative for new images reflects their interest.
line, we again examined a model where the question is absent in the learning and representation stages. In addition, we performed an experiment in which we trained preference learning module with randomly matched choices. This allows us to examine whether there exists a consistency in user choices that enables us to apply the learned preferences to new image narratives.
Table 12 shows the result. As in previous experiment, our model clearly has an advantage over using image features only. Inter-rater agreement score is also more stable for our model. Training preference learning module with randomly matched pairs of choices resulted in a score below our proposed model, but above using the image features only. This may imply that, even with randomly matched pairs, it is better to train with actual choices made by the users with regards to specific questions, rather than with conspicuous objects only. Overall, the result confirms that it is highly important to provide a context, in our case by generating visual questions, for the system to learn and re-flect the user’s specific preferences. It also shows that it is important to train with consistent choices made by identical users. Table 27 shows examples of image narratives generated for new images, depending on the choice the users made for the original image, given the respective questions.
We proposed a customized image narrative generation task, where we proposed a model to engage the users in image description generation task, by directly asking questions to the users, and collecting answers. Experimental results demonstrate that our model can successfully diversify the image description by reflecting the user’s choice, and that user’s interest learned can be further applied to new images.
This work was partially funded by the ImPACT Program of the Council for Science, Technology, and Innovation (Cabinet Office, Government of Japan), and was partially supported by CREST, JST.
[1] A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual question answering models. In EMNLP, 2016.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
[3] E. Bennett, R. Alpert, and A. Goldstien. Communications through limited-response questioning. Public Opinion Quarterly, 18:303–308, 1954.
[4] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. https://arxiv.org/abs/1706.03741, 2017.
[5] A. Das, S. Kottur, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual dialog. In CVPR, 2017.
[6] H. de Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville. Guesswhat?! visual object discocery through multi-modal dialogue. In CVPR, 2017.
[7] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In CVPR, 2015.
[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Emnlp. 2016.
[9] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. J. V. Gool. Deepproposal: Hunting objects by cascading deep convolutional layers. In ICCV, 2015.
[10] R. Girshick. Fast r-cnn. In ICCV, 2015.
[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[12] T.-H. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra, C. L. Zitnick, D. Parikh, L. Vanderwende, M. Galley, and M. Mitchell. Visual storytelling. In NAACL, 2016.
[13] J. Johnson, A. Karpathy, and L. Fei-Fei. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In CVPR, 2016.
[14] K. Kafle and C. Kanan. Visual question answering: Datasets, algorithms, and future challenges. https://arxiv.org/abs/1610.1465, 2016.
[15] A. Karpathy and F.-F. Li. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.
[16] J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1), 1977.
[17] P. D. Larry Zitnick. Edge boxes: Locating object proposals from edges. In ECCV, 2014.
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dollr, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
[19] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
[20] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: annotating predicate rgument structure. In HLT, 1994.
[21] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende. Generating natural questions about an image. In ACL, 2016.
[22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In ACL, 2002.
[23] M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering. In NIPS, 2015.
[24] K. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016.
[25] A. Shin, Y. Ushiku, and T. Harada. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa). arXiv:1609.6657, 2016.
[26] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 104:154–171, 2013.
[27] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
[28] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and Tell: A Neural Image Caption Generator. In CVPR, 2015.
[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi- nov, R. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, 2015.
[30] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
Table 13: Examples of captions and questions for the same image. While captions essentially describe the same contents, questions widely vary in terms of the topics.
Figure 6: Viewer’s attention varies depending on the context provided.
A. Why generate quesetions?
A question may arise as to why not to simply ask the users to select the region or part of the image that stands out the most to them. In such case, there would be no need to generate the questions for each image, as the question ‘what stands out the most?’ would suffice for all images. This, however, would be equivalent to a simple saliency annotation task, and would not allow for any meaningful customization or optimization per user. Thus, as discussed above, generating a question for each image is intended to provide a context in which each user can apply their own specific interest. Figure 6 shows how providing context via questions can diversify people’s attention. Apart from simply generating diverse image narratives based on the user input, many potential applications can be conceived of. For example, in cases where thorough description of an entire scene results in a redundant amount of information both quality and quantity-wise, application of our model can be applied to describe just the aspect that meets the user’s interest that was learned.
Table 14: Statistics from the crowd-sourcing task on col- lecting answers to non-visual questions.
Table 15: Examples of answers collected on VQG.
B. Clarification of DIANE
Few works tackled the task of narrative evaluation, hardly taking visual information into consideration. Although we could not find an authoritative work on the topic of narrative evaluation, this was our best attempt at not only reflecting precision/recall, but various aspects contributing to the integrity of the image narrative. Diversity deals with the coverage of diction and contents in the narrative, roughly corresponding to recall. Interestingness measures the extent to which the contents of the narrative grasp the user’s attention. Accuracy measures the degree to which the description is relevant to the image, corresponding to pre-
Figure 7: Illustration of the overall workflow for each task.
Table 16: Examples of questions generated using non-visual questions in VQG dataset.
cision. Contents that are not visually verifiable are considered accurate only if they are compatible with salient parts of the image. Naturalness refers to the narrative’s overall resemblance to human-written text or human-spoken dialogue. Expressivity deals with the range of syntax and tones in the narrative.
Table 17: Examples of human-written image narratives col- lected on Amazon Mechanical Turk.
C. Additional Experiments
We also performed an experiment in which we generate image narratives by following conventional image captioning procedure with human-written image narratives collected on Amazon Mechanical Turk. In other words, we trained LSTM with CNN features of images and human-written image narratives as ground truth captions. If such setting turns out to be successful, our model would not have
Table 18: Statistics for human-written image narratives col- lected on Amazon Mechanical Turk.
Table 19: Examples of image narratives generated by train- ing with human-written image narratives.
Table 20: Examples of generated questions for user interac- tion using our proposed model and VQG only respectively.
much comparative merit.
We trained an LSTM with collected image-narratives for training split of MS COCO. We retained the experimental conditions identically as previous experiments, and trained for 50 epochs. Table 19 shows example narratives generated. Not only does it utterly fail to learn the structure of image narratives, but it hardly generates text over one sentence, and even so, its descriptive accuracy is very poor. Since LSTM now has to adjust its memory cells’ dependency on much longer text, it struggles to even form a complete sentence, not to mention inaccurate description. This tells us that simply training with human-written image narratives does not result in reliable outcomes.
With reference human-written image narratives, we further performed CIDEr [27] evaluation as shown in Table 25.
D. Discussion
It was shown via the experiments above that there exists a certain consistency over the choices made by the same user, and that it is thus beneficial to train with the choices made by the same users. Yet, we also need to investigate whether such consistency exists across different categories of images. We ran Fast-RCNN [10] on the images used in our experiment, and assigned the classes with probability over 0.7 as the labels for each image. We then define any two images to be in the same category if any of the assigned labels overlaps. Of 3,000 pairs of images used in the experiment, 952 pairs had images with at least one label overlapping. Our proposed model had average human evaluation score of 4.35 for pairs with overlapping labels and 2.98 for pairs without overlapping labels. Baseline model with image features only had 2.57 for pairs with overlapping labels and 2.10 for pairs without overlapping labels. Thus, it is shown that a large portion of the superior performance of our model comes from the user’s consistency for the images of the same category, which is an intuitively correct conclusion.
However, our model also has superiority over baseline model for pairs without overlapping labels. This may seem more difficult to explain intuitively, as it is hard to see any explicit correlation between, for example, a car and an apple, other than saying that it is somebody’s preference. We manually examined a set of such examples, and frequently found a pattern in which the color of the objects of choices was identical; for example, a red car and an apple. It is dif-ficult to attribute it to a specific cause, but it is likely that there exists some degree of consistency in user choices over different categories, although to a lesser extent than for images in the same category. Also, it is once again confirmed that it is better to train with actual user choices made on specific questions, rather than simply with most conspicuous objects.
E. Additional Figures & Tables
Table 13 shows the contrast between semantic diversity of captions and questions. Figure 7 shows overall architecture each of image captioning, visual question answering, and visual question generation task. Table 14 shows statistics for crowd-sourcing task on collecting answers to non-visual questions in VQG dataset. Table 15 shows examples of answers to VQG questions collected on crowd-sourcing. Table 16 shows examples of generated questions using VQG dataset. Table 17 shows examples of human-written image narratives. Table 18 shows statistics for human-written image narratives collection. Table 21 shows conversion rules for natural language processing stage for narrative generation process as used in Section 3. Table 22 to Table 24 show more examples of image narratives. Ta-
Table 21: Conversion rules for transforming question and answer pairs to declarative sentences.
ble 20 shows examples of questions for user interaction that were generated using our proposed model of combining VQG and VQA, and the baseline of using VQG only. Table 26 shows another example of customized image narratives generated depending on the choices made by user upon the question. Table 27 shows examples of how the choices made by user upon the question were reflected in new images.
F. Additional Clarifications
Why were yes/no questions excluded? Yes/no questions are less likely to induce multiple answers. The number of possible choices is limited to 2 in most cases, and rarely correspond well to particular regions.
Failure cases for rule-based conversion: Since both questions and answers are human-written, our conversion rule frequently fails with typos, abridgments, words with multiple POS tags, and grammatically incorrect questions. We either manually modified them or left them as they are.
Table 22: More examples of image narratives.
Experiments with different VQA models. Most of well-known VQA models’ performances are currently in a relatively tight range. In fact, we tried [8], SOTA at the time of experiment, but did not see any noticeable improvement.
Is attention network retrained to handle sentences? No, but we found that attention network trained for questions works surprisingly well for sentences, which makes sense since key words that provide attention-wise clue are
Table 23: More examples of image narratives.
likely limited, and hardly inquisitive words.
Why not train with “I dont know?” We were concerned that answers like “I don’t know” would likely overfit. It would also undermine creative aspect of image narrative, without adding much to functional aspect.
Table 24: More examples of image narratives.
Table 25: Each model’s performance on CIDEr with human-written image narratives as ground truths.
Table 26: Examples of image narratives generated depending on the user choices.
Table 27: Examples of image narratives generated on new images, depending on the choices made.