Progress in machine perception and language understanding (e.g. (Krizhevsky et al., 2012; Liang et al., 2013)) has inspired researchers to work on holistic tasks that interlink both modalities together in a complex chain of perception, representation and inference. Examples include: grounding (Krishnamurthy and Kollar, 2013), language generation (Karpathy and Fei-Fei, 2014; Donahue et al., 2014), retrieval (Karpathy et al., 2014; Malinowski and Fritz, 2014b), and question answering about images (Malinowski and Fritz, 2014a,c).
Recently, Malinowski and Fritz (2014a) have presented an approach for question answering about images that resembles the famous Turing Test (Turing, 1950), while Malinowski and Fritz (2014c) further discuss some of the associated challenges and issues. In the following, we elaborate on data acquisition, contrast this challenge with other tasks including grounding, language generation, as well as highlight properties like robustness to over-interpretation, which makes it hard to cheat such a test.
Architectures working on a holistic task such as question answering based on images need to deal with a large gamut of challenges. In this section, we have distilled a few prominent ones that require a joint reasoning over language and visual inputs. We also argue that holistic architectures can benefit from a common sense knowledge. Finally, we discuss challenges in data acquisition and show how the task differs from other well known tasks.
Vision and language Scalability: Vision and language systems ground any internal representation in an external world that serves as a common reference point for machines and humans. The human conceptualization divides these percepts into different instances, categories as well as spatio-temporal concepts. Architectures that aim at reproducing this space of human concepts need to capture the same diversity and therefore scale up to thousands of concepts.
Concept ambiguity: As the number of categories grows, the semantic boundaries become more fuzzy, and hence ambiguities and gradual memberships are inherently introduced. For instance, difference between ’night stand’ and ’cabinet’, or ’armchair’, ’chair’ and ’sofa’ can be blurry. Such ambiguities are challenging in at least two ways. Methods need to distinguish fine-grained differences between these objects when appropriate. Objective functions and evaluating metrics need to gradually penalize the methods for their mistakes.
Ambiguity in reference resolution: The quality of an answer depends on how ambiguous and latent notions of reference frames and intentions are understood (Malinowski and Fritz, 2014a). Depending on the cultural bias and the context, we may use object-centric or observer-centric or world-centric frames of reference (Levinson, 2003). Moreover, it is no uni-fied notion what ’with’, ’beneath’, ’over’ mean.
Common sense knowledge Interestingly, some questions can be quite reliably answered with access to common sense knowledge. For instance ”Which object on the table is used for cutting?” already narrows down the likely options significantly. Such example suggests that question-answering architectures would significantly benefit from common sense knowledge.
An ’object for cutting’ is not directly visual but about the affordance of the object and therefore a challenging concept to acquire from images only. On the other hand, cooccurrences in visual data can represent a kind of visual common sense knowledge of very mundane facts or probabilistic relations that are rarely found in common sense knowledge bases.
Annotations We argue that despite the aforementioned challenges, “question answering about images” has unique
advantages over other tasks in terms of data acquisition and task evaluation. In contrast to grounding (Krishnamurthy and Kollar, 2013), annotating images with question and answer pairs does not require a detailed annotations of whole scenes in terms of predicates representing objects and their relations. The task is also agnostic to the internal representation of a method. In contrast to language generation (Karpathy and Fei-Fei, 2014; Donahue et al., 2014), the output space of a question answering task is more restricted and hence evaluation of different architectures on the task is easier to formulate. In contrast to typical computer vision tasks like object detection (Everingham et al., 2010), architectures are judged solely on right answers, not an internal representation. In contrast to the traditional Turing Test (Turing, 1950), “answering questions about images” is less prone to over-interpretations via associating a meaning to machine answers by the human interrogator. Hence, a method can be forced to answer to the point rather than “cheating” by giving generic answers or output that is open
to interpretations.
Measuring progress on holistic tasks require identifying its goals. For instance a suitable metric for “question answering about images” should evaluate architectures based on produced answers but not on intermediate results such as detections or logical forms. For a Visual Turing Challenge, we seek a metric that satisfies several properties. The most important are:
Automation: Evaluating answers on such complex tasks as answering on questions requires a quite deep understanding of natural language, involved concepts and hidden intentions of the questioner. The ideal but impractical metric would be to manually judge every single answer of every architecture individually. Therefore, we are seeking an automatic approximation so that we can evaluate different holistic architectures at scale. Malinowski and Fritz (2014a) proposed to restrict the answer space in order to achieve this goal, while leaving the questions unconstraint.
Social consensus: The complex tasks that we are interested in are inherently ambiguous. The ambiguities stem from many factors such as cultural bias, different frame of reference and fined grained categorization. This implies that multiple interpretations of a question are possible. To deal with different interpretations of words, Malinowski and Fritz (2014a) define a WUPS scores using lexical databases (Miller, 1995) with Wu-Palmer similarity (Wu and Palmer, 1994). To deal with different interpretations of a question, Malinowski and Fritz (2014c) suggest that the quality of answers should be measured according to the social consensus where the answers are evaluated against multiple groundtruths. Interestingly, such metric also naturally quantifies social agreement of the answer, and serve as a practical approximation of tedious manual evaluation.
Experimental scenarios In many cases, success on challenging learning problems has been accelerated by use of external data in the training. We believe that a Visual Turing challenge should consists of a sub-task with a prohibited use of auxiliary data to understand how the holistic learners generalize from limited and challenging data in a more
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338.
Karpathy, A. and Fei-Fei, L. (2014). Deep visual-semantic align- ments for generating image descriptions. arXiv:1412.2306.
Karpathy, A., Joulin, A., and Fei-Fei, L. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In NIPS.
Krishnamurthy, J. and Kollar, T. (2013). Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Levinson, S. C. (2003). Space in language and cognition: Explorations in cognitive diversity, volume 5. Cambridge University Press.
Liang, P., Jordan, M. I., and Klein, D. (2013). Learning dependency-based compositional semantics. Computational Linguistics.
Malinowski, M. and Fritz, M. (2014a). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
Malinowski, M. and Fritz, M. (2014b). A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv:1411.5190.
Malinowski, M. and Fritz, M. (2014c). Towards a visual turing challenge. In Learning Semantics (NIPS workshop).
Miller, G. A. (1995). Wordnet: a lexical database for english. CACM.
Turing, A. M. (1950). Computing machinery and intelligence. Mind, pages 433–460.
Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selec- tion. In ACL.