In recent years, we have witnessed the emergence of a new type of automatic conversational system – social chatbots, such as SimSimi 1, Microsoft XiaoIce 2, and Replika 3. Different from traditional task-oriented bots [12], social chatbots are designed to "communicate" and build "emotional bonds" with users [17]. Social chatbots bring users closer and better engage them in human-computer conversations. As an illustrative example, Microsoft XiaoIce has been an extremely popular social chatbot since it was released in 2014. XiaoIce
Figure 1: Two examples of our social chatbot using metaphors in conversation with users. (a) demonstrates a one-round conversation in which the chatbot directly says the whole metaphor sentence. (b) demonstrates a two-round conversation in which the chatbot first says a novel comparison to interact with user, followed by the explanation in the second round.
has accumulated 660 million users worldwide and on average users interact with XiaoIce 60 times a month4.
Meanwhile, the new purposes of social chatbots also introduce new challenges: speaking more like a "virtual friend" to users. Thus, social chatbots should be capable of handling more casual and open-domain conversations. Although much work has been done on chatbots [11, 12, 19], this work has mostly focused on task-oriented chatbots and on making chatbots talk "correctly" instead of "casually". To enrich the expressions of social chatbots, a natural approach is to
introduce more human-like and advanced linguistic features. Figurative language is frequently used in human communication [10]. Previous studies [5, 9] suggest that figurative language such as metaphors and sarcasm are key to interesting and engaging conversations. Furthermore, Roberts et al. [16] examined the specific goals of people using figurative language in conversations and reported that most people view metaphors making conversations more interesting. Therefore, in this work, we develop a new social chatbot that conducts conversations with users using automatically generated metaphors. Our framework starts from a randomly selected target-
source pair, such as "love" and "math". The system then quan-
titatively finds proper connections between source and target. For example, "(being) complex" is considered as a feature shared by both "love" and "math". Based on the target-source pair and the discovered connection, the framework generates metaphorical sentences (e.g. "Love is as complex as math.") We validated our system from two perspectives: 1) the quality of generated metaphors in terms of properness and creativity; and, more importantly, 2) how do users react to these metaphors in real human-computer conversations. For the first evaluation, human annotators are asked to label the quality of the generated metaphors. The results show that our framework is capable of generating novel and proper metaphors. Regarding the second evaluation, we study user experiences with the generated metaphors. We focus on metrics including but not limited to friendliness (i.e., how much the chatbot is speaking like a friend) and follow-up rate (i.e., the desire to respond to the chatbot). Test results indicate that users are more interested and are significantly more willing to respond when a chatbot uses metaphors. Metaphors also marginally significantly increased the perceived friendliness of chatbots. The contributions of this study are threefold: 1) We propose an automatic metaphor generation system
for social chatbots. To the best of our knowledge, this is the
first work that considers generating metaphors for conversational systems. 2) We conduct user studies to evaluate the quality of gen-
erated metaphors. Results show that the system is able to
generate novel and proper metaphors. 3) We systematically evaluate the effect of using metaphors
in human-computer conversations. The results reveal that
metaphors make users feel more interested and more willing to respond.
Metaphors (e.g., Love is like chocolates, sweet and bitter at the same time) are a figure of speech involving the comparison of one thing with another thing of a different kind and are used to make a description more emphatic or vivid [15].
Previous studies [5, 9] suggest that the use of figurative language such as metaphors and sarcasm are important for creating interesting and engaging conversations. Roberts et al. [16] report that among all major figurative language types, metaphors are most able to make conversations more interesting. Specifically, 71% of participants indicated that they use metaphor to add interest to conversations and 12% use metaphors to get attention from their conversational partner.
Early works [11, 12, 19] on human-computer conversation systems mainly focused on task-completion, such as customer service, making recommendations and answering questions. Researches on task-oriented systems are mainly focusing on addressing users’ queries and generating informative answers. In recent years, more and more attention has been paid to non-task-oriented chatbots [17], which aim to hold casual and engaging conversations with users in open domains. A number of studies [7, 8] have been done to meet users’ emotional needs and make conversation systems more engaging.
Despite the pervasiveness of figurative language in human conversations, little attention has been paid on integrating figurative language with chatbots. Inspired by Roberts et al.’s study [16], we propose a metaphor generation system that is capable of generating metaphors and enhancing users’ engagement with chatbot systems.
Table 1: Top 10 most frequent abstract concepts and concrete concepts in our chatbot conversation log. Conc. R. is the abbreviation for concreteness rating.
Previous cognitive linguistic studies show that target and source are usually of different types of concepts: target are usually from abstract domains, while sources from concrete domains [10]. In other words, by utilizing metaphors, people manage to explain and express less-understood and abstract
Figure 2: An illustration of the connecting words (in blue) for target love and source lottery (in red) by different part of speech (POS) tags. Plots (a), (b), and (c) respectively show adjectives, verbs, and nouns in the underlying word vector space. Numbers on the dotted lines represent the semantic distance (defined as 1 ) between a pair of words.
concepts (i.e., targets) using well-understood and concrete concepts (i.e., sources) [3]. Therefore, to select suitable targets and sources, we applied two different approaches in our system.
To select targets, we first followed previous linguistic studies and collected 122 poetic themes [2, 4]. Please note that poetic themes are usually abstract concepts, which makes them ideal candidates for targets. We then extended the candidate set by adding the closest five concepts of each poetic theme5. To ensure that the concepts are actually being used in human-computer conversations, we further analyzed the frequency of each concept in our chatbot conversation log. We filtered out the concepts that are rarely used (frequency lower than 0.001%) and obtained 96 concepts. These concepts were used as target candidates in our system. These concepts spanned many diverse topics, such as romance (e.g., "love", "heart"), history (e.g., "war", "peace"), and nature (e.g., "earth", "spring"). Table 1 shows the top ten most popular concepts, as well as their frequencies.
To select sources, we considered two factors of a concept: popularity in human-computer conversations and concreteness. We first extracted the top 10,000 frequently used nouns from our chatbot conversation log. We then learned the concreteness scores for these words from a concreteness database introduced by Brysbaert et al [1]. The database assigns concreteness ratings for 40 thousand English words, and the ratings evaluate the degree to which the concept denoted by a word refers to a perceptible entity [1]. We took the most concrete 3,000 nouns as source candidates for our system. Table 1 also shows the top ten most concrete concepts and their scores.
Besides a target and a source, a metaphor also requires a connection between these two concepts. The connection is usually an expression, which is not only semantically close to both the target and source, but also maintains a balanced semantic distance to the two words [2]. In our framework, we quantitatively discovered words linking targets and sources semantically, and refer to these words as connecting words.
We first located targets and sources in a word embedding space. Since the distance in the space represents the semantic similarities of words [13], we can quantify how good a word is in terms of connecting target and source from two perspectives: 1) connectivity: the semantic distances from a connecting word to target and source should be smaller than the semantic distance between target and source; and 2) balance: a connecting word should maintain a balanced distance to target and source, thus drawing the target and source together. These two aspects can be clearly visualized in Figure 2. For example, in Figure 2 (a), lucky is a connecting word that demonstrates both connectivity and balance between target love and source lottery. Therefore, combining these two aspects, we designed a connecting score. Formally, given a target T and a source V , the connecting score of a word X for T and V is defined as:
The lower the connecting score, the better a word could link the two concepts. For all the target-source pairs, we ranked all words according to their connecting score in ascending order and choose the top 5 words as the connecting words.
Figure 3: The properness score distribution of connecting words by different POS categories. X-axis shows the properness scores, ranging from 0 (not proper) to 2.0 (very proper). Y-axis shows the percentage.
Identify Similarities by Different POS
As connecting words should convey enough information, we considered content words (i.e., adjectives, verbs, and nouns) as candidates. Connecting words semantically links a target and a source, but connecting words of different part of speech (POS) links the two concepts in different ways. We summarize the most representative case in each category: 1) The connecting word is an adjective and it is a common attribute or property shared by the target and the source. For example, "complex" is a proper connecting word for target "love" and source "math". 2) The connecting word is a verb and it can modify both the target and the source. For example, "scream" is a proper connecting word for target "soul" and source "football fans". 3) The connecting word is a noun and its relationship with the target is the same as its relationship with the source. For example, "gamble" is a proper connecting word for target "love" and source "lottery".
We designed different methods for generating metaphor from connecting words of different POS. Table 2 reports example metaphors generated from (target, source, connecting word) triplets, and also shows the POS tags of connecting words. In the next three sections, we report our approach for each of the three categories.
Generate Metaphors with ADJ Connecting Words
Ortony et al. [14] argue that metaphors project high-salience properties of a descriptive source term (the source) onto a target term (the target) for which those properties are not already salient. In other words, a proper connecting word for a target-source pair can be 1) used to describe the target, and 2) a salient attribute of the source. Note that condition 2 is more restrictive than condition 1: both "sour" and "sweet" can be used to describe apples, but only "sweet" is a salient attribute. We validated adjective connecting words based on these two conditions by checking if people have used the adjective to describe the target and the source before.
Specifically, to validate condition 1, we send two queries adjective T (e.g. "sweet love") and T is adjective (e.g. "love is sweet") to a web search engine and recorded the total number of returned web pages. Similarly, to validate condition 2, we queried as adjective as (a|an) V (e.g. "as sweet as apples"), and recorded the number of returned web pages. We considered an adjective as proper if both conditions are satisfied, i.e., both numbers are larger than certain thresholds. To generate complete metaphor sentences, we then manually constructed a few templates: T is adjective, just like V., T is as adjective as (a|an) V., and T is like (a|an) adjective V.
A key observation is that verb and noun connecting words tend to exhibit diverse relationships with targets and sources. Therefore, we do not identify and handle all possible relationships, but rather handle the most representative case: subject and verb associations. Subject-verb is the most fundamental sentence structure. It is also an effective feature in metaphor detection [18] and metaphor generation [6]. Thus, to validate whether a verb exhibits the same relation with a target and source pair, we verified if target-verb and source-verb each demonstrate subject-verb relations.
Starting from a verb connecting word and a pair of target T and source V, we sent two queries verb + T and verb + V to a search engine and retrieved the top 10,000 web snippets for each query. After removing duplicates, we had 615.3 web snippets on average for each keyword pair. We then filtered invalid sentences (e.g., broken sentences, advertisements, and sentences that don’t contain both keywords), which resulted in 4.6 sentences on average for each keyword pair. We analyzed the syntactic dependency structure of each sentence and filtered out those sentences in which the target-verb or source-verb relation was not subject-verb. We ranked all sentences of targets by their semantic distance to the source word, in which the distance was calculated as the average distance of every word (excluding stopwords) in the sentence to the source word. We used the sentence with smallest distance as the explanation and generated T is like V, [explanation]. metaphors.
Similar to the verb case, noun connecting words exhibit diverse relations with targets and sources. Therefore, we identified and handled the most representative relation: subject-predicate-object patterns. The idea is that if there exists a certain predicate such that both target-predicate-noun and
Figure 4: Plots (a), (b), and (c) respectively show the smoothness score, properness score, and novelty score distribution of metaphor sentences by different categories of connecting words: adjective, verb, and noun.
source-predicate-noun frequently occur as subject-predicate-object in a large text corpus, then we know predicate + noun is a phrase that can modify both target and source.
We followed the same procedure to collect sentences for targets and sources from the search engine and filtered the sentences. On average, we collected 612.4 web snippets and 5.3 valid sentences for each keyword pair. We identified subject-predicate-object structure in each sentence via dependency parsing. We then followed the same approach to generate T is like V, [explanation]. metaphors.
From all 963000 pairs of target and source, we randomly sampled 500 pairs and used equation (1) to retrieve the top 5 adjective connecting words, verb connecting words, and noun connecting words, respectively. Each of the 500
15 samples was annotated on a 3-point scale: 0 (not proper), 1 (proper), and 2 (very proper). Each sample was labeled by 3 human judges, and its average was used as the final rating.
Figure 3 shows the properness score distribution of connecting words by different POS categories. If we consider a connecting word with score >= 1 as proper, then overall we have 1965 (26.2%) proper connecting words, consisting of 847 adjectives, 597 verbs, and 521 nouns. An important observation is that adjective connecting words achieve higher scores than verb connecting words and verb connecting words achieve higher scores than noun connecting words. This result aligns with our previous analysis that verb and noun connecting words exhibit more diverse relations with targets and sources. In the next section, we evaluate the metaphors generated using the 1965 proper connecting words.
Generated Metaphors Evaluation
Metaphor generation was evaluated with 1965 proper (target, source, connecting word) triplets. We were able to generate a total of 461 metaphor sentences: 351 with adjectives, 63 with verbs, and 47 with nouns. The main reason for why there were fewer metaphors with verbs and nouns than with adjectives is that we only handled the subject-verb and subject-predicate-object patterns, though other possible relations exists.
Each generated sentence was evaluated by three human annotators from the following perspectives: 1) smoothness: if a sentence is clear and grammatically correct; 2) properness: if the comparison and explanation are understandable and make sense; and 3) novelty: if the comparison or explanation is fresh, novel or surprising. Smoothness is a binary metric: 0 (not smooth) or 1 (smooth); properness and novelty are annotated on a 3-point scale: 0 (not at all), 1 (moderately), and 2 (very strongly). These three metrics reflect progressive relationships, i.e., a metaphor cannot be proper unless the sentence is clear, and the novelty score is only meaningful if the comparison is proper. We use the average of three annotators as the final score. Table 2 reports the assigned scores of these three metrics for the example metaphors. A sentence with a smoothness score 0.7 is considered clear, and a metaphor with a properness score
1 is considered proper. We generated 330, 45, 36 smooth sentences and 242, 29, 24 proper sentences with adjectives, verbs, and nouns, respectively.
Figure 4 shows the score distribution of each metric by different categories of connecting words. From plot (a) we can see that 92.02% of metaphors generated with adjectives are smooth while only 71.43% and 76.6% of the metaphors generated with verbs and nouns, respectively, are smooth.
Table 2: Examples of generated metaphors in decreasing order of the smoothness, properness, and novelty scores. Targets (in red), sources (in orange), and connecting words (underlined) are highlighted.
This result is not surprising since the former approach applies a far more restrictive template than the latter two approaches. Plot (b) shows similar distributions for different word categories. However, it is still worth noting that 20% more metaphors generated with verbs are inappropriate as compared to the metaphors generated with adjectives and nouns. This is probably because subject+verb is a relatively loose constraint. Finally, plot (c) reveals interesting differences in the distributions: the novelty distribution of adjective metaphors is evenly distributed between 1.00 and 2.00, while the novelty distribution of noun metaphors is continuously increasing from 1.00 to 2.00. There are 11.3% and 18.5% more strongly novel metaphors (i.e., novelty score verbs and nouns than with adjectives. This is because noun and verb sentences are longer and tend to provide richer explanations.
Although several metrics have been proposed in previous works to evaluate the performance of chatbots [11, 19], these metrics are mostly designed for task-oriented conversations. The main focuses of social chatbots, such as user experience in casual conversations, are usually ignored in these existing metrics. Therefore, we propose that one reasonable way to evaluate social chatbots would be to align the metrics with the goals of social chatbots: to build emotional bonds with users and become their friend. Therefore, we designed the following metrics to evaluate our system: (1) dialogue quality: do users think that the chatbot generates meaningful and informative content?; (2) friendliness: does the content generated by the chatbot make users feel that the chatbot is personable or engaging in the way a friend would?; (3) follow-up rate: does the content generated by the chatbot make users want to respond and keep the conversation going?
Our data consists of 52 randomly sampled metaphor sentences with properness score 1. We implemented two different approaches to integrate our metaphor generation system into a chatbot. In the first approach, a chatbot directly says the complete metaphor sentence, e.g., "Heart is shining like a diamond". In the second approach, a chatbot first says a comparison, e.g. "I heard that heart is like a diamond. Do you know why?", and then follows with the explanation in the second round, e.g. "Because both are shining." Both approaches were compared to the baseline where a chatbot simply says the literal sentence, e.g., "Heart is shining." Therefore, for each metaphor sentence, we generated three different types of expressions: one-round metaphor, two-round metaphor, and literal sentence.
We recruited three annotators and assigned them to rate all 50 sentences for all three types of expressions. Each metric was rated on a 5-point scale, from strongly disagree (-2) to strongly agree (2). To deal with the dependency of our within-subject experiment design, we first used a repeated measures ANOVA to analyze the differences among group means. We then performed Tukey post-hoc tests to compare all the group means in pairs and reported the significances of the pairwise differences.
The results of our statistical tests are reported in Table 3. There is a statistically significant effect of expressions on dialogue quality and follow-up rate as determined by oneway repeated measures ANOVA. There is also a marginally significant difference in the friendliness score. The results suggest that integrating metaphors with a conversation system helps to attract users and makes conversations more interesting. This result aligns with prior studies on humanhuman conversations[9, 16].
A Tukey post hoc test revealed that dialog quality (p value = 0.013) and follow-up rate (p value = 0.001) were significantly higher when a chatbot directly stated the metaphor as compared to a literal sentence. There was also a marginally significant increase in friendliness score if a chatbot said a metaphor instead of a literal sentence (p value = 0.09). However, there were no statistically significant differences between saying a metaphor in two-round conversations and saying a literal expression for dialogue quality, friendliness, and follow-up rate (p value = 0.36, 0.42, and 0.18, respectively). One possible explanation is that user study subjects might feel that two-round conversations are unnecessary.
Table 3: Means and standard deviations of dialogue quality, friendliness, and follow-up rate scores for each type of expression in our user study. The Fstatistic from a repeated measures ANOVA is also shown for each metric.
To more robustly evaluate the effect of our system, we integrated metaphors with an existing social chatbot and analyzed how different expressions affect real chatbot users’ follow-up rate. As users were unaware of the ongoing test, we were able to eliminate any potential biases.
When integrating metaphors with social chatbot systems, we sought to make the integration context-aware and fit in the conversation flow, i.e., the metaphor is relevant to the conversation topic and the metaphor matches users’ input. For example, if a user is talking about their boyfriend or girlfriend, a metaphor for love or marriage could be a good fit in the conversation. We used question-answer relevance, keyword matching, and topic similarity as input features and trained a classifier to predict whether a metaphor should be triggered. 6 When a metaphor was a good fit in the conversation flow, we randomly triggered one of the three expressions.
We tested our system on 924 users within a 3 week period. Users’ follow-up rates are 22%, 27%, and 41% for literal sentences, one-round metaphors, and two-round metaphors, respectively. Overall, the results show that both metaphor expressions achieve more follow-ups than literal expressions. Importantly, we found that the follow-up for two-round metaphors was the highest among all three expressions, which is contradictory to the findings from our user study. One possible explanation is that in the user study, two-round conversation might seem weird to annotators because there is no conversation context, and thus annotators assign lower scores. However, in real human-computer conversations, users prefer more interaction with a chatbot. Sample dialogues are presented in Table 4.
In this paper, we propose computational approaches to generate metaphors and report the effect of our system in the context of human-computer conversations. According to human evaluation results, our system is able to generate proper and novel metaphors. User study evaluations show that people feel that metaphorical expressions are more meaningful and interesting as compared to literal expressions. More importantly, integrating metaphors with an existing social chatbot increased users’ follow-up rates.
There are many interesting and valuable directions for future work, including studying how the properness and novelty of metaphors affects users’ experiences and engagement in human-computer interactions. In the meantime, it will also be important to study possible improvements to our
Table 4: Sample dialogues with two-round metaphors.
proposed metaphor generation model, such as enhancing the percentage of proper metaphors.
[1] Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods 46, 3 (2014), 904–911.
[2] Andrea Gagliano, Emily Paul, Kyle Booten, and Marti A Hearst. 2016. Intersecting word vectors to take figurative language to new heights. In Proceedings of the Fifth Workshop on Computational Linguistics for Literature. 20–31.
[3] Dedre Gentner, Brian Bowdle, Phillip Wolff, and Consuelo Boronat. 2001. Metaphor is like analogy. The analogical mind: Perspectives from cognitive science (2001), 199–253.
[4] Katy Gero and Lydia Chilton. 2018. Challenges in Finding Metaphorical Connections. In Proceedings of the Workshop on Figurative Language Processing. 1–6.
[5] Sam Glucksberg. 1989. Metaphors in Conversation: How Are They Understood? Why Are They Used? Metaphor and Symbolic Activity 4, 3 (1989), 125–143. https://doi.org/10.1207/s15327868ms0403_2 arXiv:https://doi.org/10.1207/s15327868ms0403_2
[6] Sarah Harmon. 2015. FIGURE8: A Novel System for Generating and Evaluating Figurative Language.. In ICCC. 71–77.
[7] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, and Rama Akkiraju. 2018. Touch Your Heart: A Toneaware Chatbot for Customer Care on Social Media. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 415.
[8] Bernd Huber, Daniel McDuff, Chris Brockett, Michel Galley, and Bill Dolan. 2018. Emotional Dialogue Generation using Image-Grounded Language Models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 277.
[9] Anna Albertha Kaal. 2012. Metaphor in conversation. (2012).
[10] George Lakoff and Mark Johnson. 2008. Metaphors we live by. University of Chicago press.
[11] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541 (2016).
[12] Gustavo López, Luis Quesada, and Luis A Guerrero. 2017. Alexa vs. Siri vs. Cortana vs. Google Assistant: a comparison of speech-based natural user interfaces. In International Conference on Applied Human Factors and Ergonomics. Springer, 241–250.
[13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.
[14] Andrew Ortony. 1979. The role of similarity in similes and metaphors. Metaphor and thought (1979).
[15] IA Richards. 1965. The Philosophy of Rhetoric, A Galaxy Book. New York: Oxford University Press.
[16] Richard M. Roberts and Roger J. Kreuz. 1994. Why Do People Use Figurative Language? Psychological Science 5, 3 (1994), 159–163. https://doi.org/10.1111/j.1467-9280.1994.tb00653.x arXiv:https://doi.org/10.1111/j.1467-9280.1994.tb00653.x
[17] Heung-Yeung Shum, Xiaodong He, and Di Li. 2018. From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots. CoRR abs/1801.01957 (2018). arXiv:1801.01957 http://arxiv.org/abs/1801. 01957
[18] Ekaterina Shutova, Lin Sun, and Anna Korhonen. 2010. Metaphor identification using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 1002–1010.
[19] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A new chatbot for customer service on social media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 3506–3510.