The words-as-classifiers (WAC) model of lexical semantics has shown promise as a way to acquire grounded word meanings with limited training data in interactive settings. Introduced in Kennington and Schlangen (2015) for grounding words to visual aspects of objects, Schlangen et al. (2016) showed that the model could be generalized to work with any object representation (e.g., a layer in a convolutional neural network) with “real” objects depicted in photographs. The WAC model builds on prior work (Larsson, 2015) treating formal predicates as classifiers to effectively learn and determine class membership of entities (e.g., an object x denoted as “red” in an utterance belongs to the predicate red(x) class). The WAC model has been used to ground into modalities beyond just vision, including simulated robotic hand muscle activations (Moro and Kennington, 2018), and WAC has been used for language understanding in a fluid human-robot interaction ask (Hough and Schlangen, 2016) . Beyond comprehension tasks, the WAC model has also been used for referring expression generation (Zarrieß and Schlangen, 2017). The WAC classifiers can use any set of features from any modality and the model is interpretable because each word has its own clas-sifier. Moreover, the WAC model allows for incremental, word-by-word composition which has implications for interactive dialogue: human users of incremental spoken dialogue systems perceive them as being more natural than non-incremental systems (Aist et al., 2006; Skantze and Schlangen, 2009; Asri et al., 2014).
Though the WAC model has pleasing theoretical properties and practical implications for interactive dialogue and robotic tasks, Li and Boyer (2016) and Emerson and Copestake (2017) point out that WAC treats all words independently, thereby ignoring distributional relations and meaning representations, and composition using the WAC model has generally resulted in averaging over applications of WAC to objects–a purely intersectional approach to semantic composition, which has been shown to fail in many cases (Kamp, 1975). As is well known, linguistic structures are compositional—simple elements can be functionally combined into more complex elements (Frege, 1892)—and composition, which is an important aspect of grounding into visual or other modalities, is itself a process that is a function of how a particular model of lexical semantics is represented.1
Figure 1: Left: connotational compose-then-apply; Right: denotational apply-then-compose.
The goal of this paper is (1) to explore and understand possible approaches to composition for WAC, and (2) by determining the best WAC model, we extend WAC by using classifier coefficients as embedding vectors in a semantic similarity task. At the outset, following Schlangen et al. (2016), we make an important distinction by identifying two composition types: in any task, composition can be handled at the level of connotation where words are composed then applied in the task (e.g., tensor operations of word embeddings are generally composed by summation before they are used in a task, such as sentiment classification), and at the level of denotation where words are applied then composed in a task (e.g., WAC classifiers are applied to object representations, then the results of those applications are composed, such as reference resolution to visual objects). These differences in composition process are depicted in Figure 1 which distinguishes between application to a task and composition.
More specifically, we alter the WAC model to leverage several classifier types, namely logistic regression, multi-layer perceptrons, and decision trees which allow us to use both connotative and denotative composition strategies and benefit from the different classifiers, for example by grafting decision trees together and by making use of the hidden layers in the multi-layer perceptrons (explained in Section 4). Our evaluations show that the choice of classifier and the composition strategies that those classifiers afford yield varied, yet comparable results in a visual reference resolution task (Section 5). Our analyses of the multi-layer perceptron model shows that the coefficients of the neurons in the hidden layers show properties that are similar to distributional embedding approaches to lexical semantics (Section 5.3), which we evaluate in Experiment 3. Finally, we conclude and explain our plans for future work.
Emerson and Copestake (2017) offer a comprehensive review and discussion on different approaches to compositionality including formal frameworks, tensor-based composition, and syntactic dependencies, to which we refer the reader. Their own model’s application of composition uses distributional semantics and a probabilistic graphical model.
Recursive neural networks can arguably encode compositional processes directly, but it has been shown recently in Lake and Baroni (2017) that they fail in a proof-of-concept neural machine translation task. The authors suggest a lack of systematicity in how these networks learn the composition process. Moreover, Socher et al. (2014) reported successful application of composition using recursive neural models, but their approach made use of syntactic information from dependency parses; i.e., the composition was likely guided by the syntactic representation rather than the neural network. Yu et al. (2018) introduce MAttNet, which leverages recursive networks to obtain high results on a referring expression task using the same data we use for our experiments. In contrast with this work, our primary purpose is not to achieve state-of-the-art results, but rather a systematic check of composition strategies.
Comparable to their work and ours here is Pa- perno et al. (2014) which attempted distributional composition using linguistic information similar to dependency parses. These approaches and tasks generally apply a connotative strategy of composition, where the final task is applied after combining meaning representations of the parts to first arrive at a single meaning representation of a sentence. Our work also compares to Wu et al.
(2019) which proposed a model that incorporates grounded and distributional information, which we explore in Experiment 3.
We extend prior work to mitigate WAC’s shortcomings by modeling several different denotative and connotative strategies made possible by the machinery of our chosen classifiers, and we use the resulting classifier coefficients as a possible way to bring WAC into a distributional space.2
Figure 2: Example of refCOCO image with three referring expressions (one on each line) made to an image region (see Footnote 3 for source).
To give context to understanding our model and composition approaches, we first explain the data that we use in our experiments. We use the same dataset as described in Schlangen et al. (2016), the “Microsoft Common Objects in Context” (MSCOCO) collection (Lin et al., 2014), which contains over 300k images with object segmentations, object labels, and image captions, augmented by Mao et al. (2016) to add English referring expressions to image regions (i.e., refCOCO). The average length of the referring expressions is 8.3 tokens and the average number of image regions (which we treat as reference candidates in our experiments) is 8. We used the data access interface provided by Kazemzadeh et al. (2014), which included a defined training/valiation/test split of the data.3 An example image, image region, and three referring expressions to that image region are depicted in Figure 2.
The WAC approach to lexical semantics is essentially a task-independent approach to predicting semantic appropriateness of words in physical contexts. The WAC model pairs each word w in its vocabulary V with a classifier that maps the realvalued features x of an entity ent to a semantic appropriateness (i.e., class membership) score:
For example, to learn the connotative meaning of the word red, the low-level features (e.g., visual) of all objects referred to with the word red in a corpus of referring expressions are given as positive instances to a supervised learning classifier. Negative instances are randomly sampled from the complementary set of referring expressions (i.e., not containing the word red). This results in a trained is an object that can be applied to red to determine class membership.
4.1 Approaches to Composition with WAC
Traditionally, the WAC model has been applied using independent linear classifiers, such as logistic regression. In this paper, we expand upon the previous work by conducting experiments with a variety of classifiers, such as multi-layer perceptrons (MLP) and decision trees (DT), in addition to logistic regression (LR). We use these classifiers so as to not make assumptions of linearity (i.e., MLP) and to use the available machinery that MLPs and DTs afford us for exploring composition strategies. We explain below each approach and the methods of composition for each. It is important to note that in our explanations and discussions of composition, we are focusing on application of the classi-fiers; all classifiers are trained as explained above (i.e., by pairing words with positive and negative examples). We leave more specific explanation of training to Section 5.
4.1.1 Logistic Regression LR summed-predictions The traditional ap-
proach to composition using WAC uses a denotative strategy where each WAC classifier in a referring expression is applied to an entity (e.g., a visually present object in a scene) which yields a probability (i.e., a score of fitness) for each word applied to each object. Following Schlangen et al. (2016), the resulting probabilities can be combined in several ways including summing, averaging, or multiplication (we opt for summing in our experiments) to produce a single overall expression-level fitness score for each object. This operation constitutes the composition of the referring expression into a single distribution (hence, denotative: the objects are applied to each word classifier, then the resulting probabilities are composed into a single distribution over candidate objects). The object with the highest score is the hypothesized referred target.
4.1.2 Multi-layer Perceptrons
We explain in this section how we leverage MLPs for three different approaches to composition. Unlike LR, the MLP does not need to assume linearity, and its structure allows us greater flexibility for compositional techniques.
MLP summed-predictions For completeness and direct comparison to LR, we leverage MLPs as we did with the LR denotative approach by simply using a MLP (i.e., using a single hidden layer of 3 neurons) in place of a LR classifier for each word. Composition is applied after application by summing the resulting probabilities.
Adjectives and Nouns Following Baroni and Zamparelli (2010), we looked at adjective-noun pairs, which are two-word compositions that can identify how an adjective qualifies a noun. For example, instead of treating large, green and tree separately in the phrase the large green tree next to the lake, we use the adjectives large and green to modify the noun tree. We explore two approaches to composing adj-noun pairs using MLPs:
MLP adj-noun extended hidden layers: if there exist multiple adjectives for one noun, a separate adj-noun pair is created for each adjective preceding the noun. When predicting, the MLP classi-fiers for each word in every adj-noun pair in the referring expression is merged together into one classifier by extending the hidden layer to include neurons from both the adjective and the noun in a single MLP. The coefficients of the top layer (i.e., a binary sigmoid) of the original adj and noun MLPs are averaged together to produce a single probability. The rest of the phrase is then composed normally using the traditional WAC methodology. This approach effectively leverages a connotative strategy to compose the adj-noun pairs, then a denotative strategy to compose the rest of the expression.4
MLP adj-noun warm start: MLP machinery also affords the use of the warm-start feature for the
Figure 3: Example of decision tree graft of brown with dog: the root node of dog is grafted into the leaf nodes that would output a “true” classification brown. This composes the noun-phrase “brown dog” (the dog clas-sifier augments the brown classifier).
MLPs. In this approach, we take the noun’s classi-fier in the adj-noun pair which has already been trained, and then continue to train the classifier (i.e., using the warm-start functionality) with data that was used to train the adjective’s classifier. This results in a single classifier that theoretically represents the entire adj-noun pair in a single clas-sifier. This approach keeps the classifiers the same size, while leveraging a transfer-style learning approach to produce a adj-noun pair classifier that is composed of its constituent words.
MLP extended hidden layers For this approach, we generalize the MLP adj-noun extended hidden layers approach and apply it to the entire expression by concatenating the neurons in each word MLP’s hidden layer and averaging the coefficients of the top layer to form a single, composed MLP that has a single hidden layer with nodes that can determine fitness between objects and all words in an expression.
4.1.3 Decision Trees
We opted to apply DTs because of their readable internal structure and how the branching system lends itself to intuitive composition strategies.
DT summed-predictions For completeness and direct comparison, we leverage DTs as we did with the LR and MLP denotative approaches by simply using a DT classifier for each word. Composition happens after application by summing the resulting predicted probabilities.
DT grafting In this approach to composition, we leverage the DT representation by training the clas-sifier as usual (i.e., independently), but then to compose the classifiers into a single classifier, we “graft” the trees of each classifier in a referring expression to the two most probable “true” leaf nodes of the classifier
, beginning with
as the starting tree with the root node. This grafting process of two words is depicted in Figure 3 where the root node of the decision tree for the word dog is grafted into the true leaf nodes of the decision tree classifier for the word brown, thus allowing each word to make a contribution of the final decision of expression-level fitness to objects. This grafting process is repeated for each word in the referring expression, thus allowing us to apply then compose an entire phrase.
4.2 Composing Relational Expressions
For this final model, we follow and extend Ken- nington and Schlangen (2015)’s approach to relational expressions (e.g., containing prepositions such as next to or above, etc.) by expressing those relational phrases (r) as a fitness for pairs of objects (i.e., ; whereas for non-relational words, WAC only learns fitness scores for single objects):
However, in prior work, the two candidate objects were known. In our model, we only assume that the final referred target object is known (i.e., ), but the relative object is not (i.e.,
handle this, we interpret relational phrases r containing prepositional words as “transitions” from one noun phrase (
) to the relative noun phrase (
). For example, in the phrase the woman to the right of the tree, we first identify the relational phrase(s) (in this case right of), and we use this relational phrase to make the most likely transition from
tree). We do this by using a learned WAC model (i.e., trained on single objects using any of the approaches explained above) and apply
candidate objects in the scene. This produces a distribution over all objects (i.e., a partially observable
); we use the resulting argmax of that distribution to arrive at the
that was referred to by
the difference in features (i.e., simple vector subtraction) between the feature vector for
resulting in a feature vector for r that is the same size as for all other WAC classifiers. During application, we find the most likely pair by applying
to all objects, then r to all pairs of objects, and force identity on the
distribution and candidate
, as well as the
distribution and the candidate
, forming a trellis-like structure. The combined (i.e., product) of probabilities for
result in a final distribution; we take the argmax probability and the resulting
object as the target referred object.
5.1 Experiment 1: Simple Expressions
In this experiment, we evaluate the performance of our approaches to composition of simple (i.e., containing no relational words) referring expression resolution using the refCOCO data.
Task & Procedure The task is reference resolution to objects depicted in static images. For example, the referring expression the woman in red sitting on the left would require composition of each word (with the exception of the quantifier the which signals that the referring expression should only identify a single entity). We task the WAC model (and varying compositional approaches, as explained above) to produce a “distribution” (i.e., ranked scores) of each object in an image. We only considered referring expressions that did not contain words included in this list: below, above, between, not, behind, under, underneath, front of, right of, left of, ontop of, next to, middle of (as was done in Schlangen et al. (2016)) resulting in simpler referring expressions. This resulted in 106,336 training instances and 9,304 test instances (i.e., individual referring expressions and the corresponding image with multiple object regions). In Experiment 2 we consider all referring expressions, including those with relations.
Training the WAC model follows prior work where each word in the refCOCO training data is trained on all instances where that word was used to refer to an object and 5 randomly chosen negative examples sampled from the training data where that particular word was not used in a referring expression (i.e., a 5-to-1 negative to positive example ratio). For each object in the images, following Schlangen et al. (2016), we used the image region information from the annotated data and extract that region as a separate image that we then pass through GoogLeNet (Szegedy et al., 2015), a convolutional neural network that was trained on data from the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) from the ImageNet corpus (Jia Deng et al., 2009). That is, GoogLeNet is optimized to recognize single objects within an image, which it can do effectively with our extracted image regions. This results in a vector representation of the object (i.e., region) with 1024 dimensions (i.e., we use the layer directly below the predictions layer).5 We concatenate to this vector 7 additional features that give information about the image region resulting in a vector of 1031 dimensions: the (relative to the full image from which it was extracted) coordinates of two corners, its (relative) area, distance to the center, and orientation of the image.
To ensure that our WAC models were made up of reliable classifiers, we threw out words that had 4 or fewer positive training examples, resulting in an English vocabulary of 2,349. For training the LR WAC model, we used L1 normalization. For training the DT WAC, we used the GINI splitting criteria with a maximum depth of 2. For training the MLP WAC model, we used a multi-layer perceptron with a single hidden layer of 3 neurons using the tanh activation function, the top layer is a binary sigmoid, trained using the adam solver (alpha value of 0.1) for 2000 maximum epochs. In all cases, the best hyper parameters were found using the development data.
Metrics The metric we use for this task is accuracy that the highest scoring object in an image of candidate objects as ranked by our model matches the annotated target object referred to by the referring expression. For example, in Figure 2, there are several annotated objects including the two people, the trees, car, ground, etc., but the referring expression woman on the right in white shirt should uniquely identify the target object in the annotated image region. Our target is 0.64, the result reported in Schlangen et al. (2016).
Results The results for this experiment are in the Exp. 1 acc column of Table 1.6 We see that in all cases, the denotative summed-predictions for all three classifiers yield respectable performance (the first row verifies results reported in Schlangen et al. (2016) for this task). When considering connotative approaches, the results are more nuanced: for the MLP adj-noun approaches, we see similar
Table 1: Experiment 1 & 2 results: accuracy scores for the models and composition approaches using the refCOCO data.
results when the hidden layers are extended and when we use warm-start. This is a positive result in that composing adj-noun pairs together using warm-start is theoretically appealing because a single classifier can perform the work of two, though it needs additional training data to perform that function, whereas extending the hidden layers requires no additional retraining, and the compose-then-apply nature of it makes it more appealing than simply summing the predictions of each word applied to the objects, as has been done in prior work. Unfortunately, when extending the hidden layers to contain the neurons from the MLP classifiers for all words in the expression, the results take a hit when compared to the denotative summed predictions approach. The DT classifiers did not perform well in this particular task (both the denotative summed predictions and the grafting) which is somewhat surprising, though as can be seen in Figure 3, by limiting the depth, the DT classifiers were unable to make use of the nuance in the individual features.
5.2 Experiment 2: Referring Expressions with Relations
For this experiment, we apply our approaches of composition to all of the refCOCO data, including cases where relations are present.
Task & Procedure The task and procedure for this experiment are similar to that of Experiment 1 with the important difference that we consider all of the training and test instances (120,266 and 9,914, respectively) for training; we do not ignore referring expressions that have relations, for example the woman in red sitting on the left next to the tree.
Metrics The metrics for this experiment are the same as Experiment 1: accuracy that the highest scoring object as ranked by our model matches the annotated target object referred to by the referring expression.
Results Table 1, column Exp. 2 acc contains the results for this experiment. As expected, nearly all approaches took a hit because they were unable to handle the composition of multiple noun phrases and learn the semantics of relational words. The best performing model, as expected, is the MLP relational model which applies the extended hidden layer to individual noun phrases, but composes them using the relational word using summed predictions (the results are statistically significant compared to the LR summed-preds row result for Exp 2). The final relational model, which uses the extended hidden approach for each noun phrase, applies more principled connotative composition and works with relational expressions without requiring annotation of the relative object.
5.3 Analyses
We end this experiment by offering some analysis on what the WAC classifiers are learning and compare some of the composition strategies with each other. We put focus on the MLP classifier since it produced the best results and allows more possible composition strategies.
Individual Classifiers Similar to Kennington and Schlangen (2015), we looked at classifiers that, we assume, learned features relating to colors. To check this, we ranged over a wide range of colors, producing an image for each color, and passed those through the GoogLeNet to arrive at a representation for each color that our model could use. Our analyses show that the classifiers learned prototypical colors for objects as well as color terms. For example, we passed all colors through the water classifier and plotted the resulting probabilities, resulting in Figure 4. This shows that individual classifiers can pick up on color information where an important visual property is its color (i.e., water), though that information is not apparent in the underlying representation (i.e., the GoogLeNet vector).
This is an important result with useful implications: this kind of transfer learning using GoogLeNet (or any other model) trained on the ImageNet data extends the original limited ability of those networks by using the entire vocabulary to identify not only object types, but also attributes (e.g., colors, sizes, spatial placements, etc.). Importantly for interactive dialogue with robots, the
Figure 4: Probabilities returned by passing color images through the water classifier from the red to the violet range.
WAC approach allows new vocabulary to be added word-by-word without requiring retraining of the entire underlying network.
Semantic Clusters To see if WAC could also yield semantic clusters, we applied a t-distributed Stochastic Neighbor Embedding (TSNE) (van der Maaten and Hinton, 2008) to the hidden layers of the MLP classifiers (i.e., 3021 dimensions, as each hidden layer had 3 neurons, each neuron had coefficients for 1007 features), mapping the coef-ficient vectors to 2 dimensions, then applied the Density-based Spatial Clustering of Applications with Noise (dbscan; eps = 0.7) (Schubert et al., 2017) to cluster the TSNE result, which is depicted in Figure 5.7 The figure shows a large, central cluster with more informative clusters closer to the edges. The following are several noteworthy clusters:
Figure 5: Cluster proposals of MLP classifier hidden layer node coefficients.
• yellow, red, green, blue, light, board • area, between, of, above, edge, next to, right • door, oven, stove, dishwasher, knobs • trailer, semi, rig, car, vehicle, van, ups, suv, taxi • cat, dog, horse, cow, sheep, animal
This analysis informs us that using the MLP classifier is not only more satisfying for composition approaches, but it can also yield coefficients that can be used as vectors for word embeddings. Our final experiment tests this hypothesis.
5.4 Experiment 3: WAC Coefficients for Semantic Similarity
In this experiment, we evaluate the vectors that are derived from the coefficients of the MLP variant of WAC in a semantic similarity task. We follow a similar approach to Kottur et al. (2016) by creating visually grounded word embeddings, but we make use of the WAC model and we create our own dataset for this experiment.
Task & Procedure We evaluated the WAC co-efficient vectors using the semantic similarity tasks WordSim-353 (Finkelstein et al., 2002) and SimLex-999 (Hill et al., 2015) by comparing the WAC vectors with GloVe embeddings (Pennington et al., 2014) (6B, 200d), and a concatenation of the vectors for each word in the two models. To arrive at a model for WAC with a large enough vocabulary, we retrieved 100 images (i.e., using Google Image Search) for each of 30,000 common English words and trained the MLP classifiers as described in Experiment 1 above using 100 images for each word.
Metrics For semantic similarity, we report a Spearman Correlation between the cosine similarities of the WAC coefficient vector, the GloVe embeddings, and the combination of the two models. High numbers denote better scores.
Results The results of the Spearman Correlation is shown in Table 2. The results show that, when combined with embeddings trained on large amounts of text such as GloVe, the WAC coef-ficient vectors can contribute useful information derived from the visual features that the GloVe embeddings don’t have. We conjecture that WAC does not work well on its own because the words are trained independently, and the WAC model assumes that words trained on visual representations are concrete, whereas many words are abstract, the semantics of which is learned in lexical context, such as distributional embeddings. These results show that WAC is a potential model for bridging grounded and distributional approaches to lexical semantics, which we will explore further in future work.
Table 2: Experiment 3 results: Spearman Correlation between the cosine similarities of WAC, GloVe, and the two combined.
In this paper, we explored using LR, MLP, and DT for composing the words-as-classifiers approach to grounded, lexical semantics. We evaluated several methods including denotative where the clas-sifiers are applied to objects, then the resulting probabilities are summed, connotative where the composition was applied first (i.e., either by extending hidden layers or using warm start in MLP, or grafting DT). Our results show that classifiers affect results, and the kinds of composition that can be accomplished varies depending on the clas-sifier. We concur with Baroni (2019) that more exploration needs to be done in this area to learn the systematic functional applications of composition of language; our results are an important step in that direction.
Furthermore, we showed that the WAC model has properties that lends itself for a straightforward embedding by using the classifier coeffi-cients. We showed in Experiment 3 that the embeddings show promise in a simple word similarity task. We are in the process of evaluating the WAC embeddings on other common tasks and analyzing the efficacy of WAC embeddings by combining them with other approaches to semantics.
Combined with prior work, the results reported here have implications for tasks and settings for speech interaction: WAC can be trained with minimal training data, used in dialogue systems where words ground to multiple modalities (e.g., vision, proprioperception, predicted emotions, etc.), can be composed incrementally, and the coefficients of the classifiers can be unified with other distributional representations.
For future work, we will explore how WAC can be coupled with formal and distributional semantic representations to better exploit and integrate knowledge from multiple modalities for a more holistic representation of semantics. We are also evaluating WAC in an interactive language learning task with two different robot platforms.
Gregory Aist, James Allen, Ellen Campana, Lucian Galescu, Carlos Gallo, Scott Stoness, Mary Swift, and Michael Tanenhaus. 2006. Software architectures for incremental understanding of human speech. In Proceedings of CSLP, pages 1922—-1925.
Layla El Asri, Romain Laroche, Olivier Pietquin, and Hatim Khouzaimi. 2014. NASTIA: Negotiating Appointment Setting Interface. In Proceedings of LREC, pages 266–271.
Marco Baroni. 2019. Linguistic generalization and compositionality in modern artificial neural net- works. arXiv.
Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on ... , October, pages 1183–1193, Cambridge, MA. Association for Computational Linguistics.
Lawrence W. Barsalou. 2015. Cognitively Plausible Theories of Concept Composition. In Compositionality and concepts in linguistics and psychology, pages 1–31. Springer, Cham.
Guy Emerson and Ann Copestake. 2017. Semantic Composition via Probabilistic Model Theory. Proceedings of IWCS.
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1):116–131.
Gottlob Frege. 1892. ¨Uber Sinn und Bedeutung. Erkenntnis, 100(1):1–15.
Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (Genuine) similarity estimation. Computational Linguistics.
Julian Hough and David Schlangen. 2016. Investigat- ing Fluidity for Human-Robot Interaction with Real- time, Real-world Grounding Strategies. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 288– 298, Los Angeles. Association for Computational Linguistics.
Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierar- chical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition.
Hans Kamp. 1975. Two Theories about Adjectives. In Formal Semantics of Natural Language, chapter Two theori. Cambridge University Press.
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the 2014 Conrerence on Empirical Methods in Natural Language Processing (EMNLP’14), pages 787–798.
C. Kennington and D. Schlangen. 2015. Simple learn- ing and compositional application of perceptually groundedword meanings for incremental reference resolution. In ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, volume 1.
Satwik Kottur, Ramakrishna Vedantam, Jos´e MF Moura, and Devi Parikh. 2016. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4985–4994.
Brenden M. Lake and Marco Baroni. 2017. General- ization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arXiv.
Staffan Larsson. 2015. Formal semantics for percep- tual classification. Journal of Logic and Computation, 25(2):335–369.
Xiaolong Li and Kristy Elizabeth Boyer. 2016. Refer- ence Resolution in Situated Dialogue with Learned Semantics. In Proceedings of Sigdial, August. Association for Computational Linguistics.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research.
Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2016. Learning like a child: Fast novel visual concept learning from sen- tence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision, volume 11-18-Dece.
Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429.
Richard Montague. 1970. Universal grammar. Theoria, 36(3):373–398.
Richard Montague. 1973. The proper treatment of quantification in ordinary English. In Approaches to natural language, pages 221–242. Springer.
Daniele Moro and Casey Kennington. 2018. Multi- modal Visual and Simulated Muscle Activations for Grounded Semantics of Hand-related Descriptions. In Proceedings of the 22nd Workshop onthe Semantics and Pragmatics of Dialogue.
Denis Paperno, Nghia The Pham, and Marco Baroni. 2014. A practical and linguistically-motivated approach to compositional distributional semantics. In Proceedings of ACL, pages 90–99. Association for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
David Schlangen, Sina Zarriess, and Casey Kenning- ton. 2016. Resolving References to Objects in Pho- tographs using the Words-As-Classifiers Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1213– 1223.
Erich Schubert, Martin Ester, Xiaowei Xu, Hans Peter Kriegel, and J¨org Sander. 2017. DBSCAN Revis- ited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems.
Gabriel Skantze and David Schlangen. 2009. Incre- mental dialogue processing in a micro-domain. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on EACL 09, (April):745–753.
Richard Socher, Andrej Karpathy, Quoc V Le, Christo- pher D Manning, and Andrew Y Ng. 2014.
Grounded Compositional Semantics for Finding and Describing Images with Sentences. Tacl, 2(April):207–218.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
Peter D Turney and Patrick Pantel. 2010. From Fre- quency to Meaning: Vector Space Models of Seman- tics. Artificial Intelligence, 37(1):141–188.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Uni- fied Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representa- tions. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) Grounded Communication for Robotics (RoboNLP) Grounded Communication for Robotics (RoboNLP).
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315.
Sina Zarrieß and David Schlangen. 2017. Is this a child, a girl or a car? Exploring the contribution of distributional similarity to learning referential word meanings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol-
ume 2, pages 86–91.