Metaphors have long posed significant problems to researchers across a wide variety of fields. While humans seem capable of easily understanding even complex metaphors, it remains difficult to devise a formal analysis that captures the depth and breadth of meanings produced by novel metaphors. We typically think of metaphors within the Conceptual Metaphor framework (Lakoff and Johnson, 1980; Lakoff, 1993), in which metaphors are based in conceptual mappings between different domains: we have cognitive concepts that can be used to represent and understand other concepts, and these mappings can be expressed linguistically to form concrete metaphoric expressions.
While there are many different computational approaches to metaphoric language, the field remains challenging and, in some areas, relatively unexplored. The variety of meanings captured by creative metaphors pose numerous problems to natural language processing researchers, as they rely on lexical diversity and conceptual knowledge. The bulk of work in metaphor has gone to identifying metaphor expressions or generating interpretations for them (Shutova, 2015; Veale et al., 2016). Whereas previous approaches focus on classification, we instead focus on generation: how can we create novel, interesting, and valid metaphoric expressions?
This task has many possible applications, including creative writing assistance, where users can employ metaphor generation to develop more interesting, persuasive writing. Lakoff and Johnson (1980) suggest that not only can metaphors capture similarity between domains, they actually can generate the similarity, allowing us to view concepts in new ways; optimistically, metaphor generation may allow us to discover new metaphoric ideas to foster understanding and growth in scientific areas. This is particularly true in the domain of education, where new metaphors can be instructive both for teachers and students (Marshall, 1990). Metaphors are also critical for proper interaction between humans and computational agents: humans produce metaphors easily, and to have natural communication with computational models will require them to be able to do the same (Zhang, 2008; Wallington et al., 2011).
In contrast to previous work generating novel metaphors (§2.2), we are the first to tackle metaphor paraphrase generation, and we hope our work can function as a jumping-off point for this challenging and interesting task. This task is a particularly difficult task for a variety of reasons. First, metaphors have the potential to be enormously creative, deviating greatly from "standard" language, which means normal language models may have difficulty in producing good metaphors. Traditional paraphrasing systems attempt to keep the sentences relatively similar, while in fact we need sentences that vary substantially, in order to enforce metaphor production.
This leads also to significant problems: there are countless possible metaphor paraphrases for any given utterance and there are numerous possible metaphoric mappings that can be evoked, yielding slightly different semantic connotations. Consider the following example:
1. The company was losing money rapidly.
This sentence has numerable possible metaphoric paraphrases, evoking many different metaphors:
2. The company was hemorrhaging money.
3. The company’s finances were circling the drain.
4. The business fell off of a cliff.
5. Profits collapsed.
In 2 and 3, "money" is conceptualized as blood and water respectively, and from conceptual metaphor theory we see that this evokes the MONEY IS A LIQUID mapping. In 4, "finances" is conceptualized as a physical entity, and further, one that can experience harm, perhaps evoking the ECONOMIC HARM IS PHYSICAL INJURY mapping. In 5, the company’s profits are conceptualized as a building, evoking the frequent metaphor of social and economic constructs being conceptualized as physical constructions, in this case specifically FINANCES ARE BUILDINGS.
Note that there is a seemingly endless variety of metaphoric expressions that can fairly consistently capture the same general meaning, with a wide variety of lexical variation. This makes metaphoric paraphrases extremely difficult to evaluate automatically: traditional metrics for generation (such as BLEU (Banerjee and Lavie, 2005) and ROUGE (Papineni et al., 2002)) rely heavily on word overlap, which is actually counterproductive for metaphoric paraphrasing: we would like our generated phrases to have less word overlap, as interesting metaphors are likely to share little lexical overlap with the original inputs. For this reason we rely on crowdsourcing, evaluating metaphoricity, fluency, and paraphrase quality.
We approach the problem of metaphoric paraphrase generation from a variety of backgrounds, each with their own positives and negatives. First, we will consider the problem one of lexical replacement, in which we identify the important words in the literal utterance and replace them with metaphoric counterparts. This yields coherent utterances, but limits the flexibility of the output. Second, we will consider this a sequence to sequence (seq2seq) problem, and employ a novel generation technique dubbed "metaphor masking" to hide important words in the input during training and evaluation, forcing the seq2seq model to learn the appropriate contexts for metaphoric and literal words. This also requires knowledge of the key words before paraphrasing, but allows for substantially more flexibility in generation.
Our contribution is thus threefold:
• We formalize the task of metaphor generation, elucidating the datasets and experimental setup necessary.
• We implement a lexical replacement-based baseline, as well as a novel seq2seq architecture based on "metaphor masking".
• We perform analysis of generated metaphors, identifying strengths and weaknesses for each method.
While our task is new, it bears similarity to a variety of better known NLP benchmarks. In the metaphor community, most of the efforts are focused on identification and interpretation of metaphors. We will instead focus on our two key components, paraphrasing and generation, as they relate to metaphors.
2.1 Literal Paraphrasing
Previous work investigates paraphrasing from metaphoric utterances to literal ones with the goal of providing interpretations (Mao et al., 2018; Shutova, 2010). Shutova et al. (2010) treats identification and interpretation jointly, and generates literal paraphrases for metaphoric adjectivenoun phrases. Vector space models have also been employed successfully for generating literal paraphrases. Shutova et al. (2012) identify a set of candidate paraphrases based on context and word vectors, and then use a model of selectional preferences to pick the most literal paraphrase. They require no training data, and achieve promising results for unsupervised literal paraphrasing.
Similarly, Mao et al. (2018) build a metaphor identification system using word vectors, and also use it to generate paraphrases for metaphoric sentences. This is done by replacing the verbs that are identified as metaphoric with the most likely literal candidates. They use Word2Vec embeddings (Mikolov et al., 2013) combined with WordNet to identify relations between literal and metaphoric lexemes. This allows for replacement of rarer, more metaphoric senses to concrete literal ones, but doesn’t provide a solution for transitioning from a literal sense to an appropriate metaphoric one. Thus their work is effective at metaphoric to literal paraphrasing, but functions only in this direction; we will restructure their algorithm for the metaphoric direction as a lexical baseline in §4.1.
2.2 Metaphor Generation
With regard to metaphor generation, most efforts have been to generate metaphors at the lexical or phrase level, using template- and heuristic-based methods. Early work in computational metaphor generation involves generating simple "A is like B" expressions, based on probabilistic relationships between words (Abe et al., 2006; Terai and Nakagawa, 2010). These methods are effective to a degree, but lack the flexibility necessary to instantiate natural language metaphors.
Other early approaches to metaphor generation are rooted in knowledge bases. Hervas et al. (2007) build a metaphor generation system by identifying metaphoric domains, building mappings between the source and target, and replacing appropriate references with the built metaphors. They show the difficulty of determining appropriate target domains for metaphors in context. Others use WordNet, building knowledge representations through semantic information from defini-tions (Veale and Hao, 2008).
Other works seek to generate conceptual metaphors, rather than open linguistic expressions. These approaches, designed to generate conceptual metaphor mappings such as MONEY IS A LIQUID, vary from WordNet- and selectional preference-based (Mason, 2004), clustering over WordNet senses (Gandy et al., 2013), and using proposition databases built from syntactic relations (Ovchinnikova et al., 2014). While this task is interesting and useful, particularly for doing proper reasoning from metaphoric mappings, our goal is instead to generate natural linguistic metaphors, rather than metaphoric mappings.
Word embedding approaches have been popular and effective for lexical metaphor tasks. In addition to Mao et al. and Shutova et al.’s paraphrasing work, Gagliano et al. (2016) build off of Word2Vec, using the generated vectors to identify poetic relationships between words, developing a vector-based interpretation of conceptual blends (Fauconnier and Turner, 1996). They identify "connector words" between concepts, allowing for the creation of linguistic metaphors that accurately capture these conceptual metaphoric mappings.
More recently there have been efforts using deep learning methods to generate metaphoric expressions more freely, using sequence-to-sequence models. Most notable is Yu et al. (2019), who use neural models to generate metaphoric expressions in an unsupervised manner. They identify source and target verbs automatically from corpora, and use these to train a neural language model. Our work is similar: they encode both literal and metaphoric pairs and produce metaphoric outputs based on verbs, but their generation task is free. We are instead working on the more constrained task of generating specific paraphrases from literal utterances.
This is the experimental paradigm we will be following: given a literal phrase, we generate a metaphoric paraphrase that should capture the same meaning. Unlike previous work, our methods are broadly applicable to free text: we are not limited to paraphrasing individual words or phrases, but rather use deep learning models for full natural language generation, which can then freely create literal paraphrases. To our knowledge, our work is the first to attempt to explicitly generate metaphoric paraphrases.
Our goal is to generate metaphoric paraphrases for given literal phrases. Data for this task is extremely sparse: there aren’t any large scale parallel corpora containing literal and metaphoric paraphrases. Most useful is that of the Mohammad et al. (2016). Their dataset includes multiple parts; importantly, it contains 171 metaphoric sentences extracted from WordNet, with manually generated literal paraphrases. These are high quality annotations, and we will use this dataset for evaluation. While originally built from the side of generating literal paraphrases for metaphoric utterances, it is easy enough to reverse the direction, using their literal paraphrases as input and attempting to generate metaphoric outputs.
Note that there are some discrepancies between
the original usage and our intended paraphrase usage. Notably, the dataset was originally built around verbs: the authors replaced the key verbs in each metaphoric sentence to yield a more literal output. This ignores cases where the metaphoric meaning of the sentence is captured by components other than the verb:
1. The painting seems to capture the essence of Spring.
2. These events could fracture the balance of power.
3. The new moon reflected back at itself from the lake’s surface.
In these examples, the verb that was replaced to make a paraphrase is in bold, while the italic phrases could also be construed as metaphoric. In particular, 3 is likely to be considered metaphoric regardless of the bolded verb, due to the poetic re-flexive construction "back at itself". This means that the resulting "literal" paraphrases contain literal verbs, but the sentences themselves may still contain metaphors. This isn’t prevalent in the data and doesn’t impact the experiments, as we are only trying to generate more metaphoric output sentences from more literal inputs, but it is important to be aware that our paraphrasing task differs somewhat from the design of the original dataset.
The size of this dataset is small: 171 instances is not enough to train viable deep learning models, and large scale parallel corpora for this task don’t exist. For this reason, we will use methods that are either unsupervised, or don’t rely on parallel data, and can be developed using non-parallel corpora. The lexical replacement model is the former, requiring no training data. The metaphor masking seq2seq model uses external training data, but does not require the data to be parallel. We use a masking procedure to generate artificial sentence pairs for seq2seq training, allowing the model to be function using non-parallel datasets.
We propose two different models for metaphoric paraphrase generation. First, we implement a lexical replacement baseline, based on that of Mao et al. (2018). Second, we develop a novel seq2seq framework that masks metaphoric words to better learn how to generate metaphoric outputs.
Figure 1: Lexical Replacement Baseline
4.1 Lexical Replacement Baseline
Metaphors often hinge on verbs. This intuition has fueled many identification and interpretation projects, including the inclusion of the verb-specific identification track of the metaphor detection shared task (Leong et al., 2018). We implement a lexical replacement baseline that takes the literal verb and replaces it with a more metaphoric counterpart. This is based on the work of Mao et al. (2018), who employ this strategy in the other direction: they take metaphoric sentences and replace the metaphoric verbs with literal ones.
We implement this algorithm for metaphor generation by reversing their candidate selection. For an overview of the process, see Figure 1. We begin with a literal sentence with a marked verb (a). (b) We use the WordNet sense hierarchy to find related words to the input word which will then be "candidates" to replace it, but rather than searching "up" the hierarchy for hypernyms, we search "down" the hierarchy for troponyms: more spe-cific verbs (in bold). We believe that in the lexical replacement task, replacement with more specific verbs is likely to yield more metaphoric expressions, as these specific verbs require specific contexts to be understood literally. When placed in an unfamiliar context, they adopt metaphoric meanings via a coercion-like process (Steedman and Moens, 1988). (c) We follow their algorithm for picking the best candidate: we take the mean output embedding of the context (based on the Google News Word2Vec vectors (Mikolov et al., 2013)), and select the candidate word that best matches
Figure 2: Metaphor masking for the seq2seq model.
that mean by way of cosine similarity. (d) This yields the (more specific) word that best fits the context, generating a more metaphoric expression.
This method, then, takes as input a sentence with a known literal verb, generates possible metaphoric candidates to replace that verb, and chooses the best fitting option. It requires no external training data, but relies on WordNet, and is restricted to only generating metaphoric verbs.
4.2 Metaphor masking model
Sequence to sequence (seq2seq) learning paradigms are vital for a variety of NLP applications: machine translation, style transfer, natural language generation, and more (Chen et al., 2018; Mueller et al., 2017; Dušek et al., 2020). These methods rely on encoding input sentences into vectors, and then applying decoders to generate some output from that input vector. They are often trained on parallel corpora (as in the case of machine translation), with the model learning to output some text based on the vector encoded from the input.
Seq2seq models have been used to generate metaphoric text (Yu and Wan, 2019), but here we are focused on paraphrase generation. In order to apply seq2seq models to this task, we develop a new framework dubbed "metaphor masking". In this framework, we replace metaphoric words in the input texts with metaphor masks (unique "metaphor" tokens), hiding the lexical item. This creates artificial parallel training data: the input is the masked text, with the hidden metaphorical word, and the output is the original text. Through this learning paradigm, the model learns that it needs to generate metaphoric words when it encounters the metaphor mask token. At test time, we provide the model with the literal input, mask the verb, and the model produces an output conditioned on the metaphor masking training. An overview of the process is shown in Figure 2.
This procedure requires additional annotated data to generate the parallel inputs for training. For this, we employ a number of available metaphor corpora: the VUAMC dataset (Steen et al., 2010), another partition of the Mohammad et al. dataset that contains individual sentences labelled as literal or metaphoric (Mohammad et al., 2016)1, the Trofi dataset (Birke and Sarkar, 2006), and the additional data collected by Stowe et al. (2018). Each of these datasets contains annotations of metaphoric verbs, although the annotation schema differ, so we expect some variety and noise in the model. Combining these datasets yields 35,415 verbs, of which 11,593 are metaphoric.
Our final goal is to generate short metaphoric utterances based on the Mohammad et al. (2016) dataset. In order to match this, we trim our training data around the verbs: each verb is treated as a separate training instance, along with 7 words of context on each side. We use all 35,415 sentences as input to the model: non-metaphoric sentences are left as-is, with the input mirroring the output. Metaphoric data is masked during training, replacing the input verb with a metaphor masking and using the original as output. This yields 35,415 pairs for training, 11,593 of which contain metaphoric masks. We hypothesize that using both literal and metaphoric datasets will allow the model to better distinguish between sentences with a metaphor mask and those without, generating stronger metaphoric outputs. We use a transformer architecture (Vaswani et al., 2017) with 6 layers in the encoder and decoder. The model uses 8 heads to learn different attention distributions. In the end they are concatenated. The hidden size for encoder and decoder is 512. We use normalization per tokens, with a vocabulary size of 30K. The model was trained using ADAM optimiser, with an initial learning rate of 0.5.
The approaches to evaluating metaphoric and literal sentences using crowdsourcing include evaluating hand-generated sentences for metaphoricity (Mohammad et al., 2016; Bizzoni and Lap- pin, 2018), evaluation of the output of automatic metaphor generation systems (Yu and Wan, 2019; Veale, 2016), and evaluation of novelty in verbal metaphors (Do Dinh et al., 2018). Uniquely focusing on metaphor evaluation, Miyazawa and Miyao (2017) highlight the importance of effective evaluation. They use four key metrics: metaphoricity, novelty, comprehensibility, and overall evaluation, to measure the success of metaphor generation in Japanese.
We will rely on two components that are typical of metaphor generation. First, we evaluate metaphoricity, with the goal of producing coherent and interesting metaphors, rather than conventional, common language. Second, we evaluate fluency, attempting to capture the syntactic viability of the generated output. Additionally, as we are attempting to generate paraphrases, we also include crowdsourced evaluation of paraphrase quality.
Annotators were thus asked to rate sentences with regard to three different factors: metaphoricity, fluency, and paraphrase quality. Each sentence was rated by five separate workers on a Likert scale from 1 to 4.2 We filtered out results of users who failed test sentences and those who only completed 1 task, aiming to keep results from consistent and knowledgeable workers.
Fluency judgments were relatively simple. For this, we asked annotators to rate the sentences based on how fluent (from incomprehensible to fluent English) a sentence is.
For paraphrase judgments, we used with two different setups. We have access to three components: the original literal input x; y, the original metaphoric paraphrase of erated metaphoric paraphrase of x. We first evaluate
paraphrasing, comparing the generated metaphoric outputs with the gold metaphors from the test data, allowing us to compare the system output to the gold data. We also experimented with comparing generated paraphrases to the literal inputs, as these should also be valid paraphrases. This represents our
evaluation, comparing the resulting paraphrases with the original literal inputs. For each, we presented the worker with a
Figure 3: Evaluation of each model via crowdsourcing.
gold input (either literal for or metaphoric for
) and the generated output, and asked them how good of a paraphrase the output was, from "completely unrelated" to "strong paraphrase".
Metaphor evaluation is more difficult, and we attempt to follow previous crowdsourcing approaches for metaphor rating. Based on the schema from Do Dinh et al. (2018) and Yu et al. (2019), we provided basic definitions of metaphoricity for crowdworkers, allowing them to use their intuitions about what to consider metaphoric. We found in a pilot study that providing longer, more complex descriptions of metaphoricity increased the difficulty of the task, so we chose to keep the definition simple.3
Our crowdsourcing setup was repeated for three outputs. The gold metaphors of Mohammad et al. (2016), which also contain hand-crafted literal paraphrases, the lexical replacement baseline, and the output of our experimental system: sentences generated via seq2seq with metaphor masking.
The mean scores for the crowdsourced evaluations for each system are shown in Figure 3.4 The sentences generated by lexical replacement bear closer resemblance to the literal inputs: they have lower metaphoricity scores and higher paraphrase rankings. This is expected, as the change to the input is only a single word. The
Table 1: Samples with the highest scores for LexRep
metaphor masking model shows more similarity to the metaphoric outputs: they have low paraphrase similarity in the settings, but better
paraphrase scores, and high metaphoricity. The fact that the metaphor masking model produces metaphoricity scores on par with the gold standard metaphors, and is consistent with the lexical model in terms of fluency and
quality, shows that this method is very effective at generating metaphoric sentences. The quality of the paraphrases is still relatively low, averaged 1.3 points below the gold x, y paraphrases, but this is an important first step in for this task.
In order to understand how each of these models performed, we do qualitative analysis over the results. We examined the results of each model: what does the lexical replacement baseline do well, and what benefits can we gain from employing our metaphor masking model?
6.1 Lexical Replacement
Table 1 shows the best lexical replacement outputs, based on their improvement over the metaphor masking model. The replacement model performs well in fluency and paraphrase quality, particularly because it copies most of the input, only replacing a single word. In some cases, the "best fit" candidate is the original input word. These perform exceedingly well in fluency and paraphrase quality, as they match the input sentence, but understandably lack metaphoricity (1). However, in many cases the model often makes novel and metaphoric word choices, indicating the validity of this approach for metaphor generation (2-4).
This baseline has numerous theoretical advantages and disadvantages. It yields output sentences that are very similar to the inputs, as we are only replacing a single word. This can be ben-eficial, as the outputs will be necessarily syntactically and semantically coherent except for the replaced word, but also severely restricts the creativity and novelty of the output.
A downside is that this method requires knowledge of the target verb. Our data has the target verb in the literal and metaphoric paraphrases annotated, but sometimes these verbs contain particles (such as "start on" and "use up"), which make lexical replacement difficult. Just replacing the verb and maintaining the particle sometimes yields good results ("I [started] on the problem" "I [fell] on the problem"), while replacing both verb and particle can also be correct ("they [used up] their food"
"they [demolished] their food"). Second, if we apply this method to unseen data, we will first need to identify the target verbs, making it more reliant on external knowledge and prone to error. Finally, it is dependent on WordNet, which restricts the power and flexibility with regard to creativity.
Table 2: Samples with the highest scores for Metaphor Masking
While the above examples highlight the strength of the lexical replacement baseline, they also show the weaknesses of the metaphor masking approach. Due to the free nature of generation, we often see words in the generated output that bear little to relation to the original input ("impishly" in (1) and "DMZ" in (2)). These kinds of errors elucidate how the more constrained lexical replacement model tends to yield better paraphrases.
6.2 Metaphor Masking
The metaphor masking model tend to generate more metaphoric sentences with similar fluency, although they often are not valid paraphrases of the original input. Table 2 shows examples for which the metaphor masking model performs best in comparison with the lexical replacement model.
Metaphor masking tends to produce fairly consistent outputs, which are syntactically regular. Hiding the metaphoric word causes the model to make a prediction, yielding varied outputs, and these are more metaphoric than their inputs.
This model is complementary to the lexical replacement model: as it is based on a sequence-to-sequence transformer model, it is relatively free in its generation. It frequently generates words not in the original input which leads to more creative, metaphoric outputs. Examples like 5 and 6 show the power of the metaphor masking model: it is capable of generating a wider variety of words that yield better metaphors. As the model isn’t constrained to a particular resource, it has more power with regard to lexical choice. Example 6 shows another benefit with regard to metaphoricity: the model can generate multiple words not present in the input ("hailed", "airborne"), yielding more creative utterances, although these are often worse paraphrases.5.
While the model often generates strong metaphors, there are also cases where the model predicts a word for the masked metaphoric word that is extremely literal (7 and 8), which yields sentences that are fluent and good paraphrases but lacking in metaphoricity. This deficit is due to the lack of information for the model about the metaphoric class. As our dataset is limited, the model doesn’t have enough signal to fully distinguish what goes into a metaphoric gap. More data (both metaphoric and overtly literal) should help the model generate more surprising and metaphoric outputs.
We can also see from Table 2 the weaknesses of the lexical replacement baseline. As the candidates are generated with diverse syntactic endings, they often exhibit disagreement with their arguments (5, 7, 8). Additionally, it doesn’t always make metaphoric predictions: in 5, the output matches the input verb, yielding an extremely
Table 3: Sentences for which both LexRep and MM models performed poorly.
literal paraphrase.
6.3 Consistent Errors
The sentences that confound both of our models tend to be idiomatic (Table 3). These are cases where the "metaphoric" meaning of the sentence isn’t captured explicitly by the verb, but rather spans the entire utterance. For example, in 10, the communication metaphors is present, regardless of the verb used: the literal verb "communicate" may be less metaphoric as a verb than the gold "talk", but the metaphor of the sentence persists. This causes difficulties for our systems which require metaphoricity to be focused on the verb.
The lexical replacement model often makes lexical choices that either don’t match the original meaning (9), or don’t maintain any metaphoricity (13). As WordNet is a finite resource, the number of candidate replacement verbs is often small, and this restricts the system from finding truly novel metaphoric expressions. It also may be the case that finding the "best fit" word from output vectors is actually counterproductive: Mao et al. (2018) use this procedure for finding the best literal paraphrases, and although we alter their approach to identify more metaphoric candidates, the model
might still prefer the most literal option.
A possible solution left for future work is to select the "worst fit" from the candidates: the word who’s vector is least likely to match the context. This would ensure contrast between domains, but in preliminary studies lead to the model invariably picking syntactically incomprehensible or semantically incoherent choices. For future work, we believe better limitations on the candidate selection, enforcing syntactic constraints while allowing a wider variety of domains, will allow us to implement the "worst fit" approach more effectively with the potential to generate much more interesting metaphoric replacements.
The metaphoric masking model struggles with short sentences: it often generates words that don’t fit the context, yielding unparseable expressions (see 9-12). The relatively idiomatic nature of these expressions also hinders the model’s performance: as the metaphoricity isn’t focused singularly on the verb, the model is unable to make accurate predictions about the masked token.
A possible solution here is to expand the masking to other parts of speech, or even to phrases. This would allow the model to generate over more complex metaphoric expressions. Additionally, if our seq2seq model can accurately pick up on masked metaphor tasks, this gives us both flexi-bility and control over metaphor generation: we will be able to choose which parts of utterances we would like to metaphoric, allowing for much more powerful generation systems.
One consistent problem in this process is the difficulty of keeping annotation categories independent. We find that generated sentences that are incoherent syntactically also tend to be considered bad paraphrases (Spearman correlation of .559, p < .01). It is likely because if a sentence is dif-ficult to syntactically parse, it is more difficult to assess its meaning, making judgments of semantic similarity difficult. Additionally, metaphoricity ratings correlate negatively to a lesser degree with paraphrase quality (-.112, Strong metaphoric paraphrases likely add additional meaning or de-emphasize some of the original literal meaning, making their paraphrase quality lower. Interestingly, fluency and metaphor ratings did not significantly correlate, indicating that disfluent sentences were neither more or less metaphoric than their fluent counterparts.
It is important to note the variety of possible generated expressions that are considered good. Different generated metaphors can even maintain some of the original literal meaning, while highlighting different aspects, as good novel metaphors are known for. Consider the generated example "This idea harmonizes up with the other one", intended to paraphrase "This idea matches up with the other one". This captures in many senses the original input of "matches up", but also provides something more: not only do the ideas go together, but perhaps they also improve upon one another. Because of the variety of acceptable outputs, automatic generation of metaphoric paraphrases is exceedingly difficult. For this reason, we present an automatic metric for evaluating metaphoric paraphrases.
We’ve established a new task for natural language generation: the creation of metaphoric paraphrases for literal sentences. We explore two possible models for accomplishing this task: an adapted lexical replacement baseline model that relies on WordNet to find candidate verbs and the output vectors of word embeddings to match their contexts, and a seq2seq transformer-based model that masks metaphoric verbs to encourage generation of metaphoric outputs. Crowdsourced evaluations show that both models are successful at different aspects of the task: the lexical replacement baseline yields consistent paraphrases that lack metaphoricity, while the metaphor masking model yields extremely metaphor outputs that often don’t accurately paraphrase the input.
Future work in this area is hindered by the lack of available data. In order to improve these methods, we need better datasets. This couples with the problem of evaluation: standard evaluation metrics for language generation are often misleading with regard to metaphors. Better datasets would allow for the development of better metrics for evaluation, and in turn better evaluation metrics may allow us to build better systems for automatically identifying metaphoric paraphrases, allowing us to build better corpora.
Another possible direction to explore is the incorporation of knowledge representations. Our lexical replacement method relies heavily on WordNet, and can make local changes based on a small number of candidate verbs. Our metaphor masking model is relatively free, but neither contain any knowledge of the metaphors in use.
To truly be able to generate metaphors based on actual metaphoric mappings, we need to incorporate some knowledge of the source and target domains involved. This could involve leveraging FrameNet (Baker et al., 1998) or MetaNet (Dodge et al., 2015), developing a novel metaphor knowledge base, or learning domain knowledge in an unsupervised fashion. Developing metaphor knowledge bases that capture relations between domains in a usable way will not only allow for better metaphor generation, but also better reasoning and understanding of texts that make use of more complicated metaphoric expressions. However the ordeal is undertaken, generation of coherent metaphors will inevitably require better representation of the interaction between the domains evoked.
Keiga Abe, Sakamoto Kayo, and Masanori Nak- agawa. 2006. A computational model of the metaphor generation process. In Proceedings of the 28th Annual Meeting of the Cognitive Science Society, pages 937–942.
C. F. Baker, C.J. Fillmore, and J.B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, pages 86–90, Montreal, Canada. Association for Computational Linguistics.
Satanjeev Banerjee and Alon Lavie. 2005. ME- TEOR: An automatic metric for MT evaluation with improved correlation with human judg- ments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Julia Birke and Anoop Sarkar. 2006. A clustering approach for nearly unsupervised recognition of nonliteral language. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 329–336, Trento, Italy. Association for Computational Linguistics.
Yuri Bizzoni and Shalom Lappin. 2018. Predict- ing human metaphor paraphrase judgments with deep neural networks. In Proceedings of the Workshop on Figurative Language Processing, pages 45–55, New Orleans, Louisiana. Association for Computational Linguistics.
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Com- bining recent advances in neural machine trans- lation. cs.CL/1804.09849v2.
Erik-Lân Do Dinh, Hannah Wieland, and Iryna Gurevych. 2018. Weeding out conventional- ized metaphors: A corpus of novel metaphor annotations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1412–1424, Brussels, Belgium. Association for Computational Linguistics.
Ellen Dodge, Jisup Hong, and Elise Stickles. 2015. Metanet: Deep semantic automatic metaphor analysis. In Proceedings of the Third Workshop on Metaphor in NLP, pages 40–49,
Denver, Colorado. Association for Computational Linguistics.
Ondˇrej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of end-to-end natural language generation: The E2E NLG challenge. Computer Speech & Language, 59:123–156.
Gilles Fauconnier and Mark Turner. 1996. Blend- ing as a central process of grammar. In Adele Goldberg, editor, Conceptual Structure, Discourse, and Language. Cambridge University Press.
Andrea Gagliano, Emily Paul, Kyle Booten, and Marti A. Hearst. 2016. Intersecting word vec- tors to take figurative language to new heights. In Proceedings of the Fifth Workshop on Computational Linguistics for Literature, pages 20– 31, San Diego, California, USA. Association for Computational Linguistics.
Lisa Gandy, Nadji Allan, Mark Atallah, Ophir Frieder, Newton Howard, Sergey Kanareykin, Moshe Koppel, Mark Last, Yair Neuman, and Shlomo Argamon. 2013. Automatic identi- fication of conceptual metaphors with limited knowledge. In Proceedings of the 27th AAAI Conference on Artificial Intelligence, pages 328–334, Bellevue, Washington. AAAI Press.
Raquel Hervás, Rui P. Costa, Hugo Costa, Pablo Gervás, and Francisco C. Pereira. 2007. Enrichment of automatically generated texts using metaphor. In Proceedings of the Sixth Mexican International Conference on Artificial Intelligence, pages 944–954, Aguascalientes, Mexico. Springer.
George Lakoff. 1993. The contemporary theory of metaphor. In Andrew Ortony, editor, Metaphor and Thought, pages 202–251. University Press Cambridge.
George Lakoff and Mark Johnson. 1980. Metaphors We Live By. University of Chicago Press, Chicago.
Chee Wee (Ben) Leong, Beata Beigman Kle- banov, and Ekaterina Shutova. 2018. A report on the 2018 VUA metaphor detection shared task. In Proceedings of the Workshop on Figurative Language Processing, pages 56–66, New
Orleans, Louisiana. Association for Computational Linguistics.
Rui Mao, Chenghua Lin, and Frank Guerin. 2018. Word embedding and WordNet based metaphor identification and interpretation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1222–1231, Melbourne, Australia. Association for Computational Linguistics.
Hermine H. Marshall. 1990. This issue: Metaphors we learn by. Theory Into Practice, 29(2):70–70.
Zachary J. Mason. 2004. CorMet: A compu- tational, corpus-based conventional metaphor extraction system. Computational Linguistics, 30(1):23–44.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
Akira Miyazawa and Yusuke Miyao. 2017. Eval- uation metrics for automatically generated metaphorical expressions. In The 12th International Conference on Computational Semantics, Montpellier, France. Association for Computational Linguistics.
Saif Mohammad, Ekaterina Shutova, and Peter Turney. 2016. Metaphor as a medium for emo- tion: An empirical study. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 23–33, Berlin, Germany. Association for Computational Linguistics.
Jonas Mueller, David Gifford, and Tommi Jaakkola. 2017. Sequence to better sequence: Continuous revision of combinatorial struc- tures. In Proceedings of the 34th International Conference on Machine Learning, pages 2536– 2544, Sydney, Australia. PMLR.
Ekatarina Ovchinnikova, Vladimir Zaytsev, Suzanne Wertheim, and Ross Israel. 2014. Generating conceptual metaphors from propo- sition stores. cs.CL/1409.7619.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for auto- matic evaluation of machine translation. In Pro-
ceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania. Association for Computational Linguistics.
Ekaterina Shutova. 2010. Automatic Metaphor Interpretation as a Paraphrasing Task. In The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 1029–1037, Los Angeles, California. Association for Computational Linguistics.
Ekaterina Shutova. 2015. Design and Evaluation of Metaphor Processing Systems. Computational Linguistics, 41:579–623.
Ekaterina Shutova, Tim Van de Cruys, and Anna Korhonen. 2012. Unsupervised metaphor para- phrasing using a vector space model. In Proceedings of the 24th International Conference on Computational Linguistics, pages 1121– 1130, Mumbai, India. COLING 2012 Organizing Committee.
Mark Steedman and Marc Moens. 1988. Tempo- ral ontology and temporal reference. Computational Linguistics, 2(14):15–28.
G.J. Steen, A.G. Dorst, J.B. Herrmann, A.A. Kaal, T. Krennmayr, and T. Pasma. 2010. A method for linguistic metaphor identification. From MIP to MIPVU. Converging Evidence in Language and Communication Research. John Benjamins.
Kevin Stowe and Martha Palmer. 2018. Lever- aging syntactic constructions for metaphor pro- cessing. In Workshop on Figurative Language Processing, pages 17–26, New Orleans, Louisiana. Association for Computational Linguistics.
Asuka Terai and Masanori Nakagawa. 2010. A computational system of metaphor gener- ation with evaluation mechanism. In International Conference on Artificial Neural Networks, pages 142–147, Thessaloniki, Greece. Springer.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In 31st Conference on
Neural Information Processing Systems, pages 5998–6008, Long Beach, California. Curran Associates, Inc.
Tony Veale. 2016. Round up the usual suspects: Knowledge-based metaphor generation. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 34–41, San Diego, California. Association for Computational Linguistics.
Tony Veale and Yanfen Hao. 2008. A fluid knowl- edge representation for understanding and gen- erating creative metaphors. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 945–952, Manchester, UK. COLING 2008 Organizing Committee.
Tony Veale, Ekaterina Shutova, and Beata Beigman Klebanov. 2016. Metaphor: A computational perspective. Synthesis Lectures on Human Language Technologies, 9(1):1–160.
Alan Wallington, Rodrigo Agerri, John Barnden, Mark Lee, and Tim Rumbell. 2011. Affect transfer by metaphor for an intelligent conver- sational agent. In Affective computing and sentiment analysis. Emotion, metaphor and terminology, volume 45, pages 53–66.
Zhiwei Yu and Xiaojun Wan. 2019. How to avoid sentences spelling boring? Towards a neural approach to unsupervised metaphor gen- eration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 861–871, Minneapolis, Minnesota. Association for Computational Linguistics.
Li Zhang. 2008. Metaphorical affect sensing in an intelligent conversational agent. In Proceedings of the Fifth International Conference on Advances in Computer Entertainment Technology, pages 100–106, Yokohama, Japan.