Machine reading comprehension aims to assess the ability to comprehend natural language and answer questions from a given document or passage. As a classical method of assessing language proficiency (Fotos, 1991; Jonz, 1991; Tremblay, 2011), cloze test (Taylor, 1953) has been widely employed due to its simplicity in form. Recently, a number of datasets for cloze test have been proposed for different languages. For instance, CNN/Daily Mail (Hermann et al., 2015) provides a benchmark for machine comprehension of English text, while the People Daily and Children’s Fairy Tale dataset (Cui et al., 2016) and CMRC-2017 (Cui et al., 2018) pioneer explorations in Chinese language.
In this paper we explore idiom comprehension (Wray, 2002; Jackendoff and Jackendoff, 2002; Cacciari and Tabossi, 2014; Jiang et al., 2018) in cloze test. Idiom , which is called “(chengyu) in Chinese, is an interesting linguistic phenomena in Chinese language, and this work
Table 1: An example of metaphor in idiom. The sense of “” should be inferred figuratively but not represented literally using the meanings of the four constituent characters.
Table 2: An example of near-synonymsin idiom, where idioms share similar meanings but are different in language usage.
is in parallel to several datasets (Hill et al., 2016; Xie et al., 2018) that have considered different language phenomena in English. Compared to other types of words, many idioms are unique for their non-compositionality and metaphorical meaning (see an example in Table 1). This feature requires a good representation of idiom. Meanwhile, the characteristic of near-synonym, i.e., words that have similar but not identical meanings (see an example in Table 2), may challenge a machine to choose an accurate idiom in a given context. Due to the fact that idioms are widely used in daily communication and in various literary genres, it is a new challenge to assess the ability of understanding and representing idioms in Chinese reading comprehension.
To this end, we propose ChID, a large-scale Chinese IDiom dataset for cloze test. ChID con-
Table 3: Comparison of ChID with other cloze-style reading comprehension datasets. Extractive denotes whether the answer is extracted directly from the given context. Option denotes whether candidate choices are provided. In the Answer Type column, the answers of all datasets except Story Cloze Test are single words. Size denotes the total number of queries or blanks in the dataset.
tains 581K passages and 729K blanks, and covers multiple domains. In ChID, the idioms in a passage were replaced with blank symbols. For each blank, a list of candidate idioms including the golden idiom are provided as choice. As the difficulty level of cloze test depends on candidate choices, we investigate several strategies of selecting candidate idioms. We evaluate several state-of-the-art models on the proposed corpus with different representations of idioms. Results show that machine performs much worse than human, which indicates a large room for further research.
Our contributions are summarized as follows:
• We propose a new dataset, ChID, for cloze-style reading comprehension in Chinese language. ChID contains 581K passages and 729K blanks from three domains (news, novels, and essays).
• We conduct extensive experiments on the design of candidate idioms and the idiom representation methods, and compare state-of-the-art models. Results show that the performance of these models is substantially worse than that of human.
• ChID provides a benchmark to evaluate the ability of understanding idioms, a unique yet common language phenomenon in Chinese. To our knowledge, this is the first work where this linguistic phenomenon is studied in the form of machine reading comprehension.
Recently, machine reading comprehension has been advanced by many corpora with various task settings. For instance, CNN/Daily Mail (Hermann et al., 2015) collects news articles and uses the cloze test (Taylor, 1953) to assess the ability of reading comprehension in English. RACE (Lai et al., 2017) and CLOTH (Xie et al., 2018) are constructed from questions in examinations designed for secondary and high school students. A number of question-answer datasets (Rajpurkar et al., 2016; Reddy et al., 2018) are also proposed and there are many other large-scale datasets (Nguyen et al., 2016; He et al., 2018). These corpora inspire various neural models (Chen et al., 2016; Cui et al., 2016; Seo et al., 2017; Dhingra et al., 2017; Cui et al., 2017). In Table 3, we present a survey on existing cloze-style reading comprehension datasets.
As the earliest cloze-style dataset for machine reading comprehension, CNN/Daily Mail (Hermann et al., 2015) has a very large scale. It collects news articles paired with a number of bullet points, which summarise key aspects of an article. Based on the fact that these summary points are abstractive and do not simply copy sentences from a news article, the corpus is constructed by transforming these bullet points into cloze-style questions, i.e., replacing one entity with a placeholder. Children’s Book Test (CBT) (Hill et al., 2016) also provides a benchmark for machine reading comprehension, while the key differences from CNN/Daily Mail include: a list of candidate choices is provided for each query, and more types of words are removed, including named entities, (common) nouns, verbs and prepositions. Who-did-What (Onishi et al., 2016) collects its corpus from news and provides options for questions similar to CBT. Each question is formed from two independent articles: an article is treated as context to be read and a separate article on the same event is used to form the query. LAMBADA (Paperno et al., 2016) removed the last word from a given passage and evaluates the ability of word prediction. By contrast, the Story Cloze Test dataset (Mostafazadeh et al., 2017) evaluates the ability of story understanding and script learning, where the task requires to select or generate a reasonable sentence to complete the story context.
To the best of our knowledge, People Daily (PD) and Children’s Fairy Tale (CFT) (Cui et al., 2016) and CMRC-2017 (Cui et al., 2018) are the only two existing cloze-style datasets for Chinese reading comprehension. Similar to CNN/Daily Mail and CBT, PD & CFT and CMRC-2017 replaced a word (usually a noun or named entity) in the document with a blank placeholder and treated the sentence containing this word as a query. PD collects data from news while CFT and CMRC-2017 are from children’s reading materials.
In most datasets, the answer can be directly found from context. CLOTH (Xie et al., 2018) has a similar setting to ChID, where the answer should be selected from given choices. However, CLOTH is collected from English examinations for secondary/high school students, whose size is limited because documents, blanks, and options are all manually created.
Idiom is a common language phenomenon and usually called “” (chengyu) in Chinese. Thanks to its conciseness in form and expressiveness in meaning, idiom is widely used in daily communication and in various text genres. The main challenges for machine reading comprehension with idiom lie in: idiom representation which represents the meaning of an idiom, and thorough discrimination among the near-synonyms of an idiom.
3.1 Idiom Representation
Many idioms are non-compositional and have metaphorical meanings (see an example in Table 1), which has also made idiom translation a challenging problem and attracted considerable research attentions (Anastasiou, 2010; Salton et al., 2014; Cap et al., 2015; Shao et al., 2017). The meaning of such idioms is generally different from the literal meanings of the constituent characters. Such idioms are usually originated from ancient cultural stories, but the meaning is reserved along the long history of language use. For instance, “” has a metaphorical meaning, which is derived from this story:
Near China’s northern borders lived an old man who bred many horses. One day, one of his horses, for no reason at all, escaped into the territory of the northern tribes. Everyone commiserated with him. “Perhaps this will soon turn out to be a blessing,” said the old man. After a few months, his horse came back, and brought back a fine horse from the north.
So the idiom “” usually refers to a blessing in disguise. Thus comprehending and representing an idiom may require the access to the corresponding cultural history. In addition, due to the polysemy of a single character, even those compositional idioms are likely to have ambiguity, which also makes idiom representation a challenging problem.
3.2 Near-synonyms
It is common that an idiom has near-synonyms. These idioms may be confused in language use due to their similar but not identical meanings1 (see an example in Table 2). To discriminate those near-synonyms, machine is required to figure out their subtle differences in usage, which is also challenging.
To verify the near-synonym phenomena, we conducted a user study. Based on the idiom vocabulary we collected (see Section 4.1), we manually evaluated the number of near-synonyms per idiom. We randomly sampled 200 idioms. For each idiom, we picked up the 20 most similar idioms whose embedding similarity score to the input idiom is less than some threshold. According
Figure 1: An example in ChID. Each data contains a given passage with several blanks that replace the original idioms (in this example, there is only one blank). For each blank, several options are provided. Among the list of candidate choices, there is one golden answer, three similar idioms and another three random ones.
Table 4: Annotation result of near-synonyms. It shows the number of idioms in the 200 sampled idioms that have at least K near-synonyms, for K = 1, 2, 3, 4. Fleiss’ kappa is 0.479, indicating moderate agreement.
to the similarity annotation result of Section 4.3 and Table 6, we set this threshold to 0.85. Then we hired four annotators to label these 4,000 idiom pairs in terms of whether a pair is near-synonyms or not. All the annotators have good command of Chinese.
The evaluation result is shown in Table 4. Note that for each idiom, we rounded down the mean of the numbers of near-synonyms labeled by the four annotators. We estimate that about 90% idioms have at least 1 near-synonym. About 23% of the idioms have 4 or more near-synonyms. Fleiss’ kappa (Fleiss, 1971) for measuring inter-annotator agreement is 0.479, indicating moderate agreement (within [0.4, 0.6]). This evaluation result strongly supports our claim that near-synonyms are very common among Chinese idioms.
Figure 1 presents an example in ChID. In each sample, idioms in a passage are replaced by blank
Table 5: Idiom frequency statistics in the whole corpus. The minimum and the maximum are 20 and 534 respectively.
symbols, and each blank is provided with several candidate idioms including the golden idiom. The task is to select the golden answer from the candidate choices given the context. Note that the answer is usually not occurring in the context in our setting, which is different from most existing cloze test corpora.
In the following subsections, we will explain the three steps in data collection: (1) Constructing the idiom vocabulary; (2) Extracting passages within a proper length; (3) Designing candidate choices.
4.1 Vocabulary Construction
We collected the idiom vocabulary from Chinese idioms Daquan 2, which contains over 23K idiom entries. Since vast majority of idioms consist of 4 characters, we only retained idioms with 4 characters in our vocabulary. In order to facilitate the design of candidate choices, we removed those idioms that do not have a pre-trained embedding using the large-scale open-source corpus provided by Song et al. (2018), where approximately 40% idioms were filtered out. We normalized synonyms with only slight morphological variation. Idioms that share the same explanation and meaning, but only differ in one character or the order of characters, are treated as the same idiom. This can be done with the Chinese idiom dictionary because some idioms are marked with: “written as), “
” (like), “
” (the same as), “
(also see). Such idioms in the passages are all replaced by their re-normalized ones.
We then counted the frequency of each idiom in the corpus, and removed those idioms that appear less than 20 times. Finally, the idiom vocabulary has 3,848 entries in total, and their frequency statistics on the whole corpus is shown in Table 5. The minimum and the maximum idiom frequencies are 20 and 534 respectively. We simply divide the idiom frequency into five intervals: very low (from 20 to 50), low (from 50 to 100), medium (from 100 to 200), high (from 200 to 400) and very high (higher than 400). The proportions of idioms in the frequency intervals are almost uniformly distributed.
4.2 Passage Extraction
To make the topic and domain more diversified, we collected passages from novel and essay on the Internet, and the news articles provided by Sun et al. (2016)3. Since some documents may be very long, we took a paragraph as the basic unit. Each idiom except those in double quotation marks4 is replaced with a blank symbol. A paragraph that is shorter than 100 characters is merged with the next paragraph to ensure that the context are sufficient for answer selection. Those passages that are longer than 600 characters are abandoned.
It is worth noting that if some idiom has a much higher word frequency than others, models may tend to bias answer selection to those more frequent idioms. In order to make frequent and infrequent idioms more balanced, we removed some passages which only contain high frequency idioms.
Table 6: Annotation result of embedding similarity. The three labels are: SYN (synonym), NEAR (near-synonym), OTHER. is the Fleiss’ kappa value.
4.3 Candidate Choice Selection
The semantic relevance between two idioms can be measured by the cosine similarity of their embeddings (Mikolov et al., 2013), which helps us to design candidate choices. However, idioms that are similar in embedding may or may not be synonyms or near-synonyms. We thus manually evaluated the correlation between embedding similarity and idiom synonymity. We split the embedding similarity from 0.9 to 0.5 into 8 intervals. Within each interval, 200 pairs of idioms are sampled. We used three labels to measure the relevance between two idioms: SYN (synonym, the two idioms are identical in meaning and can be interchangeably used), NEAR (near-synonym, have close or similar meanings but can not be used interchangeably), OTHER (irrelevant or opposite in meaning). We hired five annotators to label these samples.
As shown in Table 6, when the similarity score is larger than 0.75, there is a large proportion of idioms pairs that have the same meaning; when the score is between 0.65 and 0.80, there is a large probability that the two idioms are near-synonyms. For those pairs with high (larger than 0.85) or low (smaller than 0.60) similarity, annotators tend to reach substantial agreement5 according to Fleiss’ kappa, while we have moderate agreement between the similarity interval [0.65, 0.85].
The above annotation results inspire us to design proper candidate choices for each blank in a passage. First of all, we excluded those idioms that have a similarity score higher than 0.7 to the golden answer. This avoids to include synonyms of the golden answer in the candidate
Table 7: ChID dataset statistics. The out-of-domain data have longer passages (127 vs. 99) and more blanks per passage (1.49 vs. 1.25) than the in-domain data.
Table 8: Comparison on idiom frequency distribution between the in-domain and out-of-domain data.
choice. Then, we picked up top 10 similar idioms among the remaining idioms, and randomly chose three idioms as candidate choice. Note that the three idioms have a large probability of being near-synonyms of the golden answer, which affects the difficulty level of the cloze test to some degree. We further randomly sampled another three idioms from the remaining idioms that do not include the top 10 similar idioms. In this manner, the list of candidate choices consists of three parts: the correct answer, three similar idioms, and three other randomly sampled ones, as shown in Figure 1.
4.4 Corpus Statistics
The detailed statistics of ChID is shown in Table 7. News and novels are treated as in-domain data, which are divided into the training set Train, the development set Dev, and the test set Test. Essays are reserved for out-of-domain test Out to assess the generalization ability of cloze test models. The in-domain data cover 3,848 Chinese idioms, while Dev/Test/Out respectively cover
3,458/3,502/3,626 idioms.
There are some differences between in-domain and out-of-domain data. Firstly, the average length of passages in the in-domain data is nearly 100 words, while Out-of-domain data have longer passages (127 words). The average number of blanks per passage is also different (1.25 vs. 1.49). Secondly, the idiom distributions are different. As shown in Table 8, compared to the in-domain data, low-frequency idioms occupy a higher proportion of all the idiom occurrences in the out-of-domain data (8.2% vs. 3.5% for very low frequency interval and 12.0% vs. 7.2% for low frequency interval) while the high-frequency idioms occur less frequently (31.4% vs. 44.5%). These differences make the out-of-domain test set more challenging.
5.1 Models
In order to evaluate how well the state-of-the-art models can comprehend Chinese language with idiom, we tested the following models: Language Model (LM): We trained a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to obtain the hidden state at the blank (the hidden state to score candidate choices:
where |p| denotes the length of passage p, denote the words in the given context before or after the blank respectively,
concatenation, and
denotes the embedding of each candidate idiom. Then, the option that has the highest
is chosen as the answer.
Attentive Reader (AR) (Hermann et al., 2015): The bidirectional LSTM model is augmented with the attention mechanism (Bahdanau et al., 2015). The hidden state at blank is used as the query to attentively read the context as follows:
where are all parameters. Then, the attention vector r and the blank vector
used to score each candidate choice:
where are also parameters.
Stanford Attentive Reader (SAR) (Chen et al., 2016): Compared to AR, SAR applies a bilinear matrix to compute attention weights instead of using a tanh layer. The weighted contextual vector o is used for scoring candidates:
5.2 Implementation Details
All the models were implemented with Tensor-flow (Abadi et al., 2016). We employed the Jieba Chinese word segmenter6 to tokenize passages. We set the vocabulary size to 100K and used the 200-dimensional word embeddings initialized by Song et al. (2018). Those word embeddings that were not matched in Song et al. (2018) were initialized from a uniform distribution between (-0.1, 0.1). We applied a dropout rate of 0.5 on word embeddings. The number of hidden units of RNN cells were all set to 100. The cross entropy cost function is used to compute the training loss. ADAM (Kingma and Ba, 2015) was used to optimize all the models with the initial learning rate to 0.001 and the gradient was clipped when the norm of the gradient was larger than 5. We set the batch size to 32. The training was stopped when the accuracy on Dev did not improve within an epoch.
Table 9: Performance of human and models. cates Fleiss’ kappa. The overall best results are shown in bold, and AR performs significantly better than LM and SAR (sign test, p-value < 0.05).
5.3 Option Settings
To evaluate how the method of candidate choice design will impact the performance, we prepared two additional test sets: Ran and Sim, both of which have the same passages with Test, but candidate choices are designed differently. In Ran, all the candidate choices are sampled from the idioms that are not similar to the golden answer. Instead, in Sim, all the candidates are sampled from top 10 similar idioms. Therefore, Sim is more challenging than Ran as the former has more distracting options. Note that each blank has seven choices including the golden answer.
5.4 Results
To explore the ceiling of model performance, we also conducted Human Evaluation. We sampled 200 passages respectively from the aforementioned test sets: Test, Ran, Sim and Out. We then hired three annotators to complete the 800 cloze tests. These three annotators are first-year or second-year university students and all have very good command of Chinese language. The average accuracy of the annotators and the corresponding Fleiss’ kappa are reported as the final performance.
The experiment results are shown in Table 9. We analyzed the results from the following perspectives:
Option Setting: The setting of similar options is much harder than that of random options. Firstly, we noted that both human and models achieve worse performance on Test than on Ran, while the accuracy on Sim is even lower than Test, which indicates that including more similar candidate idioms makes the task more difficult. Secondly, the inter-annotator agreement on Ran
Table 10: Performance comparison using different idiom representations.
on other test sets which include similar options. This implies that similar options also make manual annotation harder.
Human vs. Models: Firstly, human performance is substantially better than model performance on all the test sets. The smallest gap between human and machine is 14.6 (on Test) and the largest gap is 23.3 (on Out). Secondly, humans perform very closely on Test and Out (87.1 vs. 86.2), however, the models perform much better on Test than on Out (72.4 vs. 62.9). This observation implies that human has a strong ability to generalize to out-of-domain data while the models cannot generalize well to Out which contains more low-frequency idioms.
Model Comparison: AR outperforms all other models significantly. The reason for this may be due to the fact: AR firstly uses the blank representation () to make an attentive read of the context (see Eq. 4 and 5), and the blank vector is used again with the attentive vector (r) to score a candidate choice. In this manner, the context is attentively used and the blank vector is used twice.
5.5 Comparison on Idiom Representation
In previous experiments, an idiom was treated as a token, and its representation are obtained through pretraining on a large corpus (Song et al., 2018). In this section, we explored another two methods for idiom representation, and evaluated the performance with different idiom representations. One method simply uses the average embedding of 4 constituent characters as the representation of an idiom. This method mimics to understand idioms purely based on its literal meanings. The other is to apply an MLP (Multi-Layer Perceptron, Bishop et al., 1995; Fine, 1999) which is fed with the concatenation of 4 character embeddings, and the output vector is used to represent an idiom. This method also applies a composition assumption: the representation of an idiom is a composite function of its constituent words. Note that the input to the MLP is an 800-dimension vector, and the MLP has a hidden layer of 400 units and uses tanh as the activation function. The final output of MLP is a 200-dimension vector.
Table 10 shows the performance comparison using three methods for idiom representation. We can observe remarkable drops from idiom embedding to average character embedding + MLP and to average character embedding for all the models, where all the differences are significant (sign test, p-value < 0.01). The results indicate that the other two idiom representation methods are worse than treating an idiom as an independent semantic unit. This study also implies that idiom representation is a key factor for the success of Chinese reading comprehension with idiom. In other words, a good cloze test model should have not only a proper model structure, but also a good method to represent idioms.
In this paper, we propose a large-scale Chinese cloze dataset (ChID) which contains 581K passages and 729K queries from news, novels, and essays, covering 3,848 Chinese idioms. The corpus provides a benchmark to evaluate the ability of Chinese cloze test with idiom. Firstly, we analyze how the embedding similarity correlates with synonymity and near-synonymity of Chinese idiom, and find that the difficulty level of Chinese cloze test with idiom correlates positively with the method of choosing candidate choices. Secondly, we find that idiom representation is a key factor to the success of reading comprehension models in this task due to the common non-compositionality and metaphorical meaning of Chinese idiom. Thirdly, we evaluate three state-of-the-art cloze test models on this corpus, and observe that existing model performance is still much worse than human performance. All these findings indicate that the corpus may be a proper benchmark for Chinese cloze test and worth further research7.
This work was jointly supported by the National Science Foundation of China (Grant No.61876096), and the National Key R&D Program of China (Grant No. 2018YFC0830200).
Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283.
Dimitra Anastasiou. 2010. Idiom treatment experiments in machine translation. Cambridge Scholars Publishing.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
Christopher M Bishop et al. 1995. Neural networks for pattern recognition. Oxford university press.
Cristina Cacciari and Patrizia Tabossi. 2014. Idioms: Processing, structure, and interpretation. Psychology Press.
Fabienne Cap, Manju Nirmal, Marion Weller, and Sabine Schulte Im Walde. 2015. How to account for idiomatic german support verb constructions in statistical machine translation. In Proceedings of the 11th Workshop on Multiword Expressions, pages 19–28.
Danqi Chen, Jason Bolton, and Christopher D Man- ning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2358–2367.
Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 593–602.
Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, Shi- jin Wang, and Guoping Hu. 2018. Dataset for the
7Our dataset is available at
https://github.com/chujiezheng/ChID-Dataset.
first evaluation on chinese machine reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).
Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1777–1786.
Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gatedattention readers for text comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1832–1846.
Terrence L Fine. 1999. Feedforward Neural Network Methodology. Springer Science & Business Media.
Joseph L Fleiss. 1971. Measuring nominal scale agree- ment among many raters. Psychological bulletin, 76(5):378.
Sandra S Fotos. 1991. The cloze test as an integrative measure of efl proficiency: A substitute for essays on college entrance examinations? Language learning, 41(3):313–336.
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2018. Dureader: a chinese machine reading comprehension dataset from realworld applications. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 37–46.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701.
Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In International Conference on Learning Representations.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Ray Jackendoff and Ray S Jackendoff. 2002. Foundations of language: Brain, meaning, grammar, evolution. Oxford University Press, USA.
Zhiying Jiang, Boliang Zhang, Lifu Huang, and Heng Ji. 2018. Chengyu cloze test. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 154–158.
Jon Jonz. 1991. Cloze item types and second language comprehension. Language testing, 8(1):1–22.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785– 794.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In International Conference on Learning Representations Workshop.
Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. 2017. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim- pel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2230– 2235.
Denis Paperno, Germ´an Kruszewski, Angeliki Lazari- dou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1525–1534.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042.
Giancarlo D. Salton, Robert J. Ross, and John D. Kelle- her. 2014. Evaluation of a substitution method for idiom transformation in statistical machine translation. In MWE@EACL.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations.
Yutong Shao, Rico Sennrich, Bonnie L. Webber, and Federico Fancellu. 2017. Evaluating machine translation performance on chinese idioms with a blacklist method. In Language Resources and Evaluation Conference.
Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. 2018. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 175–180.
M Sun, J Li, Z Guo, Z Yu, Y Zheng, X Si, and Z Liu. 2016. Thuctc: an efficient chinese text classifier. GitHub Repository.
Wilson L Taylor. 1953. “cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
Annie Tremblay. 2011. Proficiency assessment standards in second language acquisition research:“clozing” the gap. Studies in Second Language Acquisition, 33(3):339–372.
Alison Wray. 2002. Formulaic language and the lexi- con.
Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. Large-scale cloze test dataset created by teachers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2344–2356.