Question answering (QA) has received consistent attention from the natural language processing community. Recently, research on QA systems has reached the stage of generating free-form answers, called GenQA, beyond extracting the answer to a given question from the context (Yin et al., 2016; Song et al., 2017; Bauer et al., 2018; Nishida et al., 2019; Bi et al., 2019, 2020). However, as a bottleneck in developing GenQA models, there are no proper automatic metrics to evaluate generated answers (Chen et al., 2019).
In evaluating a GenQA model, it is essential to consider whether a generated response correctly contains vital information to answer the question. There exist several n-gram similarity metrics such
Figure 1: An example from MS-MARCO (Bajaj et al., 2016) where widely used n-gram similarity metrics does not align with human judgments of correctness. On the other hand, our KPQA-metrics focus on the key information and give low scores to incorrect answers similar to humans.
as BLEU (Papineni et al., 2002) and ROUGEL (Lin, 2004), that measure the word overlaps between the generated response and the reference answer; however, these metrics are insufficient to evaluate a GenQA system (Yang et al., 2018a; Chen et al., 2019).
For instance, in the example in Figure 1 from the MS-MARCO (Bajaj et al., 2016), the generated answer receives a high score on BLEU-1 (0.778) and ROUGE-L (0.713) due to the many overlaps of words with those in the reference. However, humans assign a low score of 0.063 on the scale from 0 to 1 due to the mismatch of critical information. As in this example, we find that existing metrics often fail to capture the correctness of the generated answer that considers the key information for the question.
To overcome this shortcoming of the existing metrics, we propose a new metric called KPQAmetric for evaluating GenQA systems. To derive the metric, we first develop Keyphrase Predictor for Question Answering (KPQA). KPQA computes the importance weight of each word in both the generated answer and the reference answer by considering the question. By integrating the output from the KPQA, we compute the KPQA-metric in two steps: (1) Given a {question, generated answer, reference answer}, we compute importance weights for each question-answer pair {question, generated answer} and {question, reference answer} using a KPQA; (2) We then compute a weighted similarity score by integrating the importance weights into existing metrics. Our approach can be easily integrated into most existing metrics, including n-gram similarity metrics and the recently proposed BERTScore (Zhang et al., 2020).
Additionally, we newly create two datasets for assessing automatic evaluation metrics with regard to the correctness in the GenQA domain. We first generate answers using state-of-the-art GenQA models on MS-MARCO and AVSD (Alamri et al., 2019) where the target answers are natural sentences rather than short phrases. We then collect human judgements of correctness over the 1k generated answers for each dataset.
In experiments on the human-evaluation datasets, we show that our KPQA-metrics have significantly higher correlations with human judgments than the previous metrics. For example, BERTScoreKPQA, one of our KPQA-integrated metrics, obtains Pearson correlation coefficients of 0.673 on MS-MARCO whereas the original BERTScore obtains 0.463. Further analyses demonstrate that our KPQA-metrics are robust to the question type and domain shift. Overall, our main contributions can be summarized as follows:
• We propose KPQA metric, an importance weighting based evaluation metric for GenQA.
• We collect high-quality human judgments of correctness for the model generated answers on MSMARCO and AVSD, where those two GenQA datasets aim to generate sentence-level answers. We show that our proposed metric has a dramatically higher correlation with human judgments than the previous metrics for these datasets.
• We verify the robustness of our metric in various aspects such as question type and domain effect.
• We release the human-annotated benchmark dataset and pre-trained models to compute the KPQA-metric to the research community1.
We briefly review the current automated text evaluation metrics that have been used to evaluate GenQA systems.
BLEU is a popular evaluation metric for generated text based on n-gram precision. BLEU scores a candidate by counting the number present in the reference among the n-gram of the candidate. In general, n varies from 1 to 4, and the scores for varying n are aggregated with a geometric mean.
ROUGE is a set of evaluation metrics used for automatic text generation such as summarization and machine translation. Typically, most studies use ROUGE-L, which is a F-measure based on the longest common subsequence between a candidate and the reference.
METEOR (Banerjee and Lavie, 2005) is an F1 score of a set of unigram alignments. METEOR has a unique property that it considers stemmed words, synonyms, and paraphrases, as well as the standard exact word matches.
CIDER (Vedantam et al., 2015) is a consensus-based evaluation metric that is designed for a high correlation with human judgment in the image captioning problem. CIDEr uses Term FrequencyInverse Document Frequency (TF-IDF) weights for human-like evaluation.
BERTScore is a recently proposed text evaluation metric that use pre-trained representations from BERT (Devlin et al., 2019). BERTScore first computes the contextual embeddings for given references and candidates independently with BERT, and then computes pairwise cosine similarity scores. When computing similarity, BERTScore adopts Inverse Document Frequency (IDF) to apply importance weighting.
To build a better metric for GenQA, we first propose KPQA. By considering the question, the KPQA assigns different weights to each token in the answer sentence such that salient tokens receive a high value. We then integrate the KPQA into existing metrics to make them evaluate correctness as well.
3.1 KPQA
For GenQA, we observe that each word has different levels of importance when assessing a gen-
Figure 2: Overall flow of KPQA-metric. Importance weights are computed by pre-trained KPQA for each question-answer pair. And then these weights are integrated into existing metrics to compute weighted similarity.
Figure 3: Overall architecture and an output example of KPQA. KPQA classifies whether each word in the answer sentences is in the answer span for a given question. We use the output probability KPW as an importance weight to be integrated into KPQA-metric.
erated answer. As shown in Figure 1, there exist keywords or keyphrases that are considered sig-nificant when evaluating the correctness of the answer. Additionally, some words, such as function words are mostly irrelevant to the correctness of the answer. Inspired by this observation, we introduce KPQA, which can predict the importance of each word when evaluating GenQA systems. As shown in Figure 3, KPQA is a BERT-based (Devlin et al., 2019) classifier that predicts salient tokens in the answer sentences depending on the question. We regard it as a multi-class classification task where each token is a single class. To train KPQA, we first prepare extractive QA datasets such as SQuAD (Rajpurkar et al., 2016), which consist of {passage, question, answer-span}. We transform these datasets into pairs of {answer-sentences, question, answer-span}. We extract the answer-sentences that contain answer-span in the passage since these sentences are short summaries for the given question. Specifically, for a single-hop QA dataset such as SQuAD, we pick a single sentence that includes answer-span as the answer sentence. For the answers in a multi-hop QA dataset such as HotpotQA (Yang et al., 2018b), there are multiple supporting sentences for the single answer span. For these cases, we use SpanBERT (Joshi et al., 2020) to resolve the coreferences in the paragraphs and extract all of the supporting sentences to compose answer sentences. The {question, [SEP], answer-sentences} is then fed into the KPQA to classify the answer-span, which is a set of salient tokens, in the given answer-sentences considering the question.
3.2 KPQA Metric
Since KPQA’s training process allows KPQA to find essential words in the answer sentences to a given question, we use a pre-trained KPQA to get the importance weights that are useful for evaluating the correctness of generated answers in GenQA. The overall flow of our KPQA-metric is described in Figure 2. We describe how we combine these weights with existing metrics to derive the KPQAmetric.
We first compute the importance weights for a given question , ..., ), reference answer , ..., ) and generated answer , ..., ) using pre-trained KPQA. We provide each pair {question, generated answer} and {question, reference answer} to pre-trained KPQA and get the output of the softmax layer. We define these parts as KeyPhrase Weight (KPW) as shown in Figure 3. We note that KPW) is an importance weight of generated answer for a given question Q. These weights reflect the importance of each token for evaluating the correctness.
We then compute KPQA-metric by incorporat-
ing the KPW into several existing metrics modifying the precision and recall to compute the weighted similarity.
BLEU-1-KPQA: We derive BLEU-1-KPQA, which is an weighted precision of unigram () as follows:
where I(i, j) is an indicator function assigned the value of 1 if token is the same as and 0 otherwise.
ROUGE-L-KPQA: We also derive ROUGE-L-KPQA, which is a modified version of ROUGE-L using KPW to compute weighted precision(), recall() and F1(), as follows:
where LCS is the Longest Common Subsequence between a generated answer and a reference answer. The is defined as follows:
where is an indicator function which is 1 if each word is in the LCS and 0 otherwise. is defined in (Lin, 2004).
BERTScore-KPQA Similar to ROUGE-L-KPQA, we compute BERTScore-KPQA using KPW. We first compute contextual embedding for generated answer and x for reference X using the BERT model. Then, we compute weighted precision(), recall() and F1() with contextual embedding and KPW of each token as follows:
where LCS is the Longest Common Subsequence between a generated answer and a reference answer. The is defined as follows:
where is an indicator function which is 1 if each word is in the LCS and 0 otherwise. is de-fined in (Lin, 2004). Similar to ROUGE-L-KPQA, we also derive BLEU-1-KPQA and BERTScoreKPQA by intergating KPW and provide the formulas in Appendix.
4.1 Generating Answers
Table 1: Statistics of the generative question answering dataset.
GenQA Datasets: To evaluate GenQA metrics, it is necessary to measure the correlation between human judgments and automated text evaluation metrics for evaluating the model generated answers. Recently, Chen et al. (2019) released human judgments of correctness for two GenQA datasets, NarrativeQA (Koˇciský et al., 2018) and SemEval-2018 Task 11 (SemEval) (Ostermann et al., 2018). However, we find that the average lengths of the answer sentence are 4.7 and 2.5 for NarrativeQA and SemEval, respectively, as shown in Table 1. These short answers are often short phrases and cannot be representative of GenQA, because the answers could be long and may deliver complex meaning. We argue that evaluating long and abstractive answers is more challenging and suitable for studying the metrics for general form of GenQA. To fill this gap, we collect the human judgments of correctness for model generated answers on two other GenQA datasets, MS-MARCO and AVSD, which have longer answers than NarrativeQA and SemEval as shown in Table 1. For the MS-MARCO, we use the Natural Language Generation (NLG) subset, which has more abstractive and longer answers than the Q&A subset.
GenQA Models: For each of the two datasets, we first generate answers for questions on validation sets using two trained GenQA models: UniLM (Dong et al., 2019) and MHPGM (Bauer et al., 2018) for MS-MARCO, MTN (Le et al., 2019) and AMF (Alamri et al., 2018; Hori et al., 2017) for AVSD. Details on these QA models are in Appendix. After training, we select 1k samples for each dataset in the validation set. Specifically, we first randomly pick the 500 questions in the validation set of each dataset and collect the corresponding model generated answers for each model so that we have two generated answers for each sample. Therefore, we collect a total of 1k samples, two different answers for 500 questions for each dataset. Also, we discard samples if one of two GenQA models exactly generates the ground-truth answer since human evaluation is useless during the sampling.
4.2 Collecting Human Judgments of Answer Correctness
We hire workers from the Amazon Mechanical Turk (MTurk) to rate the correctness of the generated answers from the models we trained. We assign ten workers for each sample to get reliable data. We ask the workers to annotate correctness using a 5-point Likert scale (Likert, 1932), where 1 means completely wrong, and 5 means completely correct. We provide the full instruction in Appendix.
Filtering Noisy Workers: Some workers did not follow the instructions, producing poor-quality judgments. To solve this problem, we filter noisy
Table 2: Inter annotator agreement measured by Krippendorff’s alpha() and the average of number of annotators for each dataset.
ratings using the z-score, as in (Jung and Lease, 2011). We first compute the z-score among the ten responses for each sample. Then, we consider the responses whose z-score is higher than 1 to be noise and remove up to five of them in the order of the z-score. The average number of annotators after filtering is shown in Table 2. We use the average score of the annotators for each sample as a ground-truth evaluation score to assess the quality of the evaluation metric.
Inter-Annotator Agreement: The final dataset is further validated with Krippendorff’s alpha (Krippendorff, 1970, 2011), a statistical measure of inter-rater agreement for multiple annotators. We observe that Krippendorff’s is higher than 0.6 for both datasets and models after filtering, as shown in Table 2. These coefficient numbers indicate a “substantial“ agreement according to one of the general guidelines (Landis and Koch, 1977) for kappa-like measures.
5.1 Implementation Details
We choose three datasets SQuAD v1.1 (Rajpurkar et al., 2016), HotpotQA (Yang et al., 2018b) and MS-MARCO Q&A subset to train KPQA. We combine the training set of the three datasets and use a 9:1 split to construct the training and development set of KPQA. For HotpotQA, we exclude yes/no type questions where the answers are not in the passage.
For model parameters, we choose bert-base-uncased variants for the BERT model and use one fully-connected layer with softmax layer after it. We train 5 epochs and choose the model that shows the minimum evaluation loss. We provide more details in Appendix.
5.2 Results
Evaluation Methods for Metrics: To compare the performance of various existing metrics and our
Table 3: Pearson Correlation(r) and Spearman’s Correlation() between various automatic metrics and human judgments of correctness. All of the results are statistically significant (p-value < 0.01).
metric, we use the Pearson coefficient and Spearman coefficient. We compute these correlation coef-ficients with human judgments of correctness. We test using MS-MARCO, AVSD, from which we collected human judgments, and NarrativeQA and SemEval from (Chen et al., 2019).
Performance Comparison: We present the correlation scores for the baseline metrics and KPQAaugmented ones for multiple datasets in Table 3. The correlations between human judgment and most of the existing metrics such as BLEU or ROUGE-L are very low, and this shows that those widely used metrics are not adequate to GenQA. Moreover, the performance of existing metrics is especially low for the MS-MARCO, which has longer and more abstractive answers than the other three datasets.
We observe a significantly higher correlation score for our proposed KPQA-metric compared to existing metrics especially for MS-MARCO and AVSD where the answers are full-sentences rather than short phrases. For the NarrativeQA, where existing metrics also have higher correlations, the gap in performance between KPQA-metric and existing metrics is low. We explain this is because the answers in NarrativeQA are often a single word or short phrases that are already keyphrases.
Comparison with IDF: The next best metric after our proposed metric is the original BERTScore, which uses contextual embeddings and adopts IDF based importance weighting. Since IDF is dependent on the word-frequency among the documents, it can assign a lower weight to some important words to evaluate correctness if they frequently occur in the corpus as shown in Table 5. On the other hand, our KPQA integrated metric assigns weights
Table 4: Ablation studies for our proposed metrics on domain effect and using the question context.
to words in the answer sentence using the context of the question. This approach provides dynamic weights for each word that leads to a better correlation with human evaluation as shown in Table 3.
5.3 Ablation Study
Domain Effect: Our KPQA metric computes importance weights using a supervised model; thus our proposed method may suffer from a domain shift problem. Although our metric is evaluated on out-of-domain datasets except MS-MARCO, we further examine the effect of the domain difference by changing the trainset of KPQA. Since we train KPQA with the combination of SQuAD, HotpotQA and MS-MARCO Q&A, the original KPQA works as in-domain for MS-MARCO. To measure the negative domain effect, we exclude the MS-MARCO Q&A in the training set of KPQA and measure the performance of KPQA-metric on MS-MARCO. We annotate it “-KPQA/MARCO" and report the results in Table 4. This drop shows the effect of the negative domain shift for our KPQA-metric. However, “-KPQA/MARCO" is still much higher than all
Figure 4: Pearson correlation coefficient among question types on MS-MARCO dataset.
of the previous metrics.
Using the Question Context: Our KPQA uses the question as an additional context to predict the keyphrases in the sentence, as shown in Figure 3. To examine the power of utilizing the question information for the keyphrase predictor, we remove the question part from the dataset and train the keyphrase prediction model. With the newly trained model, we compute the importance weights for words in the target sentence and apply them to BLEU-1, ROUGE-L, and BERTScore. We call this metric as “-KP" and report the results in Table 4. We observe that “-KPQA" metric is better than “-KP" metric for all of the three variants. These results show that training keyphrase predictor to find the short answer candidate in the sentence is effective for capturing the key information in the generated answer, but it is more effective when the question information is integrated.
5.4 Analysis
Correlation Among Question Type: Since MSMARCO provides the question type information (PERSON, NUMERIC, DESCRIPTION, LOCATION, ENTITY) for each {question, answer} pair, we evaluate the various metrics by the question type. We split the dataset into these five question types and measure the performance of various metrics with Pearson correlation coefficients. As shown in Figure 4, our KPQA-metric variants outperform their original version in all of the question types. KPQA-metric is especially effective for the NUMERIC question type, whose answer sentence often has shorter keyphrase such as a number. For ENTITY and PERSON question types, the gap between KPQA-integrated metric and original metric
Figure 5: An example from MS-MARCO where the answers are composed of multiple sentences.
is lower for BERTScore. We speculate that this is because the original BERTScore uses IDF-based importance weighting, unlike other metrics.
Multiple Sentence Answers: Most of the answers in MS-MARCO and AVSD consist of single sentences, but the answers for GenQA can be multiple sentences like (Fan et al., 2019). To verify our KPQA-metric on multiple sentence answers, we collect additional 100 human judgments for the generated answer whose answers are multiple sentences in the MS-MARCO like the example in Figure 5, and evaluate the various metrics on this dataset. As shown in Table 6, our KPQA integrated metric shows still higher correlations than other metrics. We observe that the gap between KPQA integrated metrics and existing metrics is relatively lower than that of Table 3. We speculate this is because many of the multiple sentence answers are DESCRIPTION type answers whose keyphrases are sometimes vague, similar to the results in Figure 4.
Error Analysis: We pick 100 error cases from MS-MARCO in the order of a large difference in ranks among 1k samples between human judgments and BERTScore-KPQA. The importance weights have no ground-truth data; thus we manually visualize the weights as shown in Table 5 and analyze the error cases.
From the analysis, we observe some obvious reasons for the different judgments between humans and BERTScore-KPQA. We first classify error cases by the question types and observe that 51 cases belong to NUMERIC, and 31 cases belong to DESCRIPTION. We further analyze the NUMERIC question type and find that many parts of the errors
Table 5: An example of the scores given by humans, BERTScore and BERTScore-KPQA for the samples from MS-MARCO dataset. BERTScore uses IDF and BERTScore-KPQA uses KPW as importance weights to compute score. Heat map shows IDF and KPW, which are normalized between 0 and 1.
Table 6: Correlation coefficients between various automatic metrics and human judgments of correctness for evaluating multiple sentence answers in MSMARCO (Bajaj et al., 2016).
Table 7: The percentage of matches at which human judgment and various metrics on ranking two models’ output.
are due to higher weights on units such as “million" or “years." There exist a total of ten error cases for this type, and we believe that there is room for improvement with regard to these errors through post-processing. In the case of the DESCRIPTION question type, 17 out of 31 cases are due to inappropriate importance weights. We speculate this result is because the keyphrases for the answers to questions belonging to the DESCRIPTION type are sometimes vague; thus, the entire answer needs to be considered when it is evaluated.
Rank-Pair: One practical usage of the text evaluation metric is ranking outputs of multiple models. Using the collected human judgments of correctness for the same 500 {question, reference answer} pairs for two models on MS-MARCO and AVSD, we can compare the output of each models through the human-annotated score. To see the alignment of ranking ability among the various metrics with that of human judges, we conduct a “win-lose match" experiment, counting the number of times that a metric ranks the output of two models as the same as human judges. To prepare test samples, we chose only those whose gap between human judgment scores on the two models is greater than 2. Finally, we obtain 93 and 193 samples for MS-MARCO and AVSD, respectively. Considering that the range of scores is 1-5, this approach ensures that each output of the models has a clear quality difference. Table 7 shows the percentage of rank-pair matches for each metric with human judgments of correctness on two datasets. Our KPQA-metric shows more matches than previous metrics in all of the datasets; thus, it is more useful for comparing the generated answers from different models.
One important next step for current QA systems is to generate answers in natural language for a given question and context. Following this interest, several generative (abstractive) QA datasets (Bajaj et al., 2016; He et al., 2018; Koˇciský et al., 2018; Fan et al., 2019), where the answer is not necessarily in the passage, have recently been released. Since the task is to generate natural language for the given question, the QA system is often trained with seq2seq (Sutskever et al., 2014) objective similarly to other natural generation tasks such as neural machine translation. Hence, researchers often use n-gram based similarity metrics such as BLEU to evaluate the GenQA systems, following other natural language generation tasks.
However, most of these n-gram metrics including BLEU were originally developed to evaluate machine translation and previous works (Liu et al., 2016; Nema and Khapra, 2018; Kryscinski et al., 2019) have shown that these metrics have poor correlations with human judgments in other language generation tasks such as dialogue systems. As with other text generation systems, for GenQA, it is difficult to assess the performance through n-gram metrics. Especially, n-gram similarity metrics can give a high score to a generated answer that is incorrect but shares many unnecessary words with the reference answer. Previous works (Mar- ton and Radul, 2006; Yang et al., 2018a; Chen et al., 2019) have pointed out the difficulty of similar problems and studied automated metrics for evaluating QA systems. Inspired by these works, we focus on studying and developing evaluation metrics for GenQA datasets that have more abstractive and diverse answers. We analyze the problem of using existing n-gram similarity metrics across multiple GenQA datasets and propose alternative metrics for GenQA.
In this paper, we create high-quality human judgments on two GenQA datasets, MS-MARCO and AVSD, and show that previous evaluation metrics are poorly correlated with human judgments in terms of the correctness of an answer. We propose KPQA-metric, which uses the pre-trained model that can predict the importance weights of words in answers to a given question to be integrated with existing metrics. Our approach has a dramatically higher correlation with human judgments than existing metrics, showing that our model-based importance weighting is critical to measure the correctness of a generated answer in GenQA.
Our paper and dataset follow ethical standards. We compensate the annotators with competitive pay. Furthermore, we follow all ethical procedures for data collection, where we use public datasets to train the models.
K. Jung is with ASRI, Seoul National University, Korea. This work was supported by AIRS Company in Hyundai Motor Company & Kia Corporation through HKMC-SNU AI Consortium Fund.
Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, and Devi Parikh. 2019. Audio visual scene- aware dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 7558– 7567. Computer Vision Foundation / IEEE.
Huda Alamri, Chiori Hori, Tim K Marks, Dhruv Batra, and Devi Parikh. 2018. Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In DSTC7 at AAAI2019 Workshop, volume 2.
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with im- proved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018. Commonsense for generative multi-hop question an- swering tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4220–4230, Brussels, Belgium. Association for Computational Linguistics.
Bin Bi, Chen Wu, Ming Yan, Wei Wang, Jiangnan Xia, and Chenliang Li. 2019. Incorporating ex- ternal knowledge into machine reading for gener- ative question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2521–2530, Hong Kong, China. Association for Computational Linguistics.
Bin Bi, Chen Wu, Ming Yan, Wei Wang, Jiangnan Xia, and Chenliang Li. 2020. Generating well-formed an- swers by machine reading with stochastic selector networks. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Sym-
posium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7424–7431. AAAI Press.
Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answer- ing evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119–124, Hong Kong, China. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understand- ing and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042–13054.
Angela Fan, Yacine Jernite, Ethan Perez, David Grang- ier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with lstm. Neural Computation, 12(10):2451–2471.
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2018. DuReader: a Chinese machine read- ing comprehension dataset from real-world appli- cations. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 37– 46, Melbourne, Australia. Association for Computational Linguistics.
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey, Tim K. Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4203–4212. IEEE Computer Society.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
Hyun Joon Jung and Matthew Lease. 2011. Improving consensus accuracy via z-score and weighted voting. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence.
Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA read- ing comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317– 328.
Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1):61–70.
Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, and Richard Socher. 2019. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540– 551, Hong Kong, China. Association for Computational Linguistics.
J Richard Landis and Gary G Koch. 1977. The mea- surement of observer agreement for categorical data. biometrics, pages 159–174.
Hung Le, Doyen Sahoo, Nancy Chen, and Steven Hoi. 2019. Multimodal transformer networks for end- to-end video-grounded dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5612– 5623, Florence, Italy. Association for Computational Linguistics.
Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- worthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
Gregory Marton and Alexey Radul. 2006. Nuggeteer: Automatic nugget-based evaluation using descrip- tions and judgements. In Proceedings of the Human
Language Technology Conference of the NAACL, Main Conference, pages 375–382, New York City, USA. Association for Computational Linguistics.
Preksha Nema and Mitesh M. Khapra. 2018. Towards a better metric for evaluating question generation sys- tems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium. Association for Computational Linguistics.
Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazu- toshi Shinoda, Atsushi Otsuka, Hisako Asano, and Junji Tomita. 2019. Multi-style generative reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2273–2284, Florence, Italy. Association for Computational Linguistics.
Simon Ostermann, Michael Roth, Ashutosh Modi, Ste- fan Thater, and Manfred Pinkal. 2018. SemEval- 2018 task 11: Machine comprehension using com- monsense knowledge. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 747–757, New Orleans, Louisiana. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Linfeng Song, Zhiguo Wang, and Wael Hamza. 2017. A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566– 4575. IEEE Computer Society.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian Li. 2018a. Adaptations of ROUGE and BLEU to better evaluate machine reading comprehension task. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 98–104, Melbourne, Australia. Association for Computational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018b. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016. Neural generative question answering. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2972–2978. IJCAI/AAAI Press.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
A.1 Datasets
We collect human judgments of correctness for two GenQA datasets, MS-MARCO (Bajaj et al., 2016) and AVSD (Alamri et al., 2019). We describe the properties of each dataset in this section.
MS-MARCO MS-MARCO is a large-scale english machine reading comprehension dataset that provides ten candidate passages for each question. The model should consider the relevance of the passages for the given question and answer the question. One of the main features of this dataset is that it contains free-form answers that are abstractive. MS-MARCO provides two tasks, Natural Language Generation (NLG) task and Q&A task. For the NLG task, the model should generate an abstractive summary of the passages for given questions, which is a well-formed answer rather than an answer span in the passage. Although the Q&A task also provides some abstractive answers, most of the answers are short and do not contain the context or rationale of the question. Hence, we use the NLG subset of MS-MARCO dataset as a GenQA dataset to study the metrics for GenQA. Also, we use the training set of Q&A subset to train and evaluate KPQA, since most of the samples in this
Figure 6: Instruction for MTurk workers
Table 8: Performance of the model we trained to generate answers
subset has exact answer spans in the passage like SQuAD.
Audio Visual Scene-aware Dialog (AVSD) To study more general metrics for GenQA, we also use a multimodal GenQA dataset for our work. Audio Visual Scene-aware Dialog (AVSD) is a multimodal dialogue dataset composed of QA pair about Charades videos. Although the name of the dataset contains dialog, all of the dialog pairs are composed of questions answering about a video. The task of this dataset is to generate an answer for a question about a given video, audio, and the history of previous turns in the dialog. In other words, this task is to generate a free-form answer for a given multimodal context, which can be considered as GenQA.
A.2 Instructions to Annotators
The full instructions to annotators in MTurk are shown in Figure 6. We hire the annotators whose HIT approval rate are higher than 95% and pay $0.03 for each assignment.
A.3 Models
To investigate the performance of automatic metrics, we gather pairs of a sentence, {generated answer, reference answer}. Collecting high-quality answer candidates for a given context and question is an essential step; thus, we choose two models for each dataset from the latest research in the literature. We train two models UniLM (Dong et al., 2019) and MHPGM (Bauer et al., 2018) for MS- MARCO dataset. For AVSD dataset, we train two models MTN (Le et al., 2019) and AMF (Alamri et al., 2018). We present the performance of each model we trained in Table 8. We briefly describe the models and the training details to generate the answer for two datasets.
UniLM UniLM, which stands for unified language model pre-training, is a powerful seq2seq model based on pre-trained representations from BERT (Devlin et al., 2019). UniLM is a pre-trained transformer network that can be easily fine-tuned for NLU and NLG. UniLM achieves higher performance for various NLG tasks, such as abstractive summarization and question generation. We fine-tune UniLM for GenQA similar to the way fine-tuning UniLM to NLG, where source sequences are each question and paragraphs, the target sequence is an answer. We add [SEP] tokens between the question and each paragraph. Then, we fine-tune UniLM for 3 epochs with this setting using the public code2.
MHPGM MHPGM, which stands for multi-hop pointer generator networks, uses multi-hop reasoning QA model that can integrate commonsense information. This model uses pointer-generator decoder to generate the answer. We train the model for three epochs with batch size 24 using the public code3.
MTN MTN (Le et al., 2019), which is a multimodal transformer encoder-decoder framework, is a state-of-the-art model for AVSD. MTN employs multimodal attention blocks to fuse multiple modalities such as text, video, and audio. We train 10 epochs with batch size 256 and generate the answers for the testset released in the DSTC7 workshop (Alamri et al., 2018) using the publicaly available code4.
Table 9: Pearson Correlation(r) and Spearman’s Correlation() between various automatic metrics and human judgments of correctness for MS-MARCO dataset and AVSD dataset. We generate the answers and collect human judgments for two models on each dataset. All of the results are statistically significant (p-value < 0.01).
AMF AMF is an Attentional Multimodal Fusion based model (Hori et al., 2017) introduced as a baseline system for DSTC7 AVSD workshop (Alamri et al., 2018), It is composed of RNN and multimodal attention architecture. This model encode the multimodal inputs with LSTM (Gers et al., 2000) and fuse the information with modalitydependent attention mechanism. We train this model with 15 epochs with batch size 64 using the public code5.
B.1 Correlation by Models
The dataset we collect has human judgments on a generated answer from two models for each dataset; thus we can observe how the performance of each metric depends on the type of GenQA model. The experimental results in Table 9 show that our proposed metric outperforms other metrics in both of the GenQA models for each dataset.
In this section, we describe experimental details that are not mentioned in the previous sections including some items in the reproducibility checklist.
C.1 Reproducibility Checklist
Source Code We provide the source code for both training KPQA and computing KPQA metric as a supplementary material. We will publicly release the full source with the pre-trained model to easily compute KPQA-metric.
Computing Infrastructure We use Intel(R) Core(TM) i7-6850K CPU (3.60 GHz) with GeForce GTX 1080 Ti for the experiments. The software environments are Python 3.6 and PyTorch 1.3.1.
Average runtime for each approach Each epoch of our training KPQA on average takes 150 minutes using the single GPU. For evaluation, it takes 5 minutes.
Number of Model Parameters The number of parameters in KPQA model is about 109.4M.
Hyperparameters We use max sequence length of 256 for the inputs of KPQA. We use AdamW (Loshchilov and Hutter, 2018) optimizer with learning rate 2e-5, and mini-batch size of 16 for all of the experiments. We use bert-base-uncased with additional one fully-connected layer of 768 units and tanh activation function. And then we add a softmax layer after it. We train KPQA for 5 epochs and choose the model that shows the minimum evaluation loss over the development set. We repeat training 5 times for each best-performing model.
C.2 Significant Test
For all of the correlation coefficients we computed in the paper, we use a t-test using a null hypothesis that is an absence of association to report p-value, which is the standard way to test the correlation coefficient.
C.3 KPQA Performance
We present the performance of KPQA on keyphrase prediction for evaluation data in Table 10.
Table 10: Performance of our keyphrase predictor in development set of each dataset.
C.4 BERTScore
For computing BERTScore we use bert-large-uncased-whole-word-masking-finetuned-squad
variant from (Wolf et al., 2019)6 which is a BERT model fine-tuned on QA dataset SQuAD. We observe that computing BERTScore through this BERT model shows slightly higher correlation with human judgments than the BERT model without fine tuning. We use the first layer of it after the word embedding layer to compute the embedding. We experiment among different layers and found that the first hidden layer yielded the best result. We compute all of the BERTScore including original BERTScore and BERTScore variants using this BERT model.