The ubiquity of English as an online lingua franca offers a rich opportunity for computational research on second language acquisition and on tools for aiding non-native speakers. Most computational research in second language (L2) has focused on spelling and grammar errors, and has been conducted on learners with beginner-to-intermediate proficiency level (henceforth, “learners”) (e.g. Ji et al., 2017; Sakaguchi et al., 2017; Rozovskaya et al., 2017; Lo et al., 2018). Little empirical work has looked at semantic errors, with existing research mostly focusing on collocations (e.g., Dahlmeier and Ng, 2011; Vecchi et al., 2011; Kochmar and Briscoe, 2013). Also, highly proficient, advanced L2 speakers (henceforth, “advanced L2s”) have received little attention (though see Daudaravicius et al., 2016). In contrast to learners, these speakers rarely violate grammatical norms of the L2, but rather deviate from native usage in much more nuanced ways, often exhibiting mild infelicities rather than outright errors.
We aim to explore an elusive aspect of mastering the subtle contours of a word’s meaning that are shaped by its context. Specifically, we investigate patterns of acquisition of English indefi-nite pronouns by L2 speakers. Indefinite pronouns (IPs) are linguistic devices that refer to an entity (such as a person or thing) that has not yet been introduced in discourse. In English, examples are words like someone, anything, and nobody. Consider the following sentences, taken verbatim from corpora of L2 speakers (original pronoun is boldfaced; less felicitous usages marked with ‘?’).1
1. Do you know someone/anyone who was discriminated based on gender?
2. It was a little amazing, because they didn’t stole ?something/anything.
3. ??Anyone/Someone told me the company has
millions in debts and isn’t able to pay it. Clearly, mastery of IPs in English relies on recognizing subtle factors that determine their appropriate usage in various contexts.
Here, in Section 2, we develop a linguistic analysis with detailed hypotheses on precisely how the tangled relations between some- and any- pronouns, exemplified above, pose a challenge for L2 learners. In Sections 3 and 4, we perform a largescale investigation of these linguistic predictions using productions of both learners and advanced L2s, and find that the predicted infelicities occur not only in the language of the former but also the latter, albeit (as expected) to a lesser extent.
A practical goal of this work is to gain predictive power regarding the nuanced semantic diffi-culties that L2 speakers face. As a first step in that direction, in Section 5 we consider the ability of deep learning language models (LMs) – shown to be adept at capturing grammatical phenomena (Ji
Table 1: Usage classes of IPs, an indication of those subsumed by some- and any-, and examples from our corpora.
et al., 2017; Sakaguchi et al., 2017; Marvin and Linzen, 2018; Goldberg, 2019) – to identify the subtle infelicities that stem from the semantic confusion introduced by some- and any- IPs. We show that while state-of-the-art models obtain encouraging initial results on this task, they leave room for future improvement (possibly informed by our linguistic findings) in mastering the semantic nuances of the system of English IPs.
The contribution of this work is thus three-fold: First, to our knowledge, we develop the first largescale empirical investigation of second-language acquisition of indefinite pronouns, constituting a case study of taking a computational approach in linguistic analysis to yield novel insights into challenges in L2 acquisition. Second, we suggest and evaluate an automatic approach to detect infelicities stemming from these challenges in a large collection of L2 productions. Finally, in both cases, we extend our experiments to utterances of highly proficient L2 speakers – a population that has heretofore received little attention in the context of automatic error/infelicity detection.2
Previous work has suggested that the English system of IPs is crosslinguistically atypical, with precise analogues to some- and any- unusual across languages (Haspelmath, 1997; Beekhuizen et al., 2017). Building on a suggestion from Beekhuizen et al. (2017), we analyze the factors that could lead to difficulty in learning these IPs, and develop detailed hypotheses concerning the challenges that L2 speakers are predicted to face.
Our analysis is based on patterns of colexifica-tion (Franois, 2008): that is, how usages expressing different semantics are grouped (or not) in various combinations under a single word. As the basis for our analysis, we first need to specify the allowable semantic and syntactic usages of IPs. These usage classes are adapted from Haspelmath (1997), who outlines a universal set of IP semantic functions across all languages.3 Our usage classes are shown in Table 1, with an indication of the classes that some- and any- can express.
Table 1 illustrates a striking fact about colexifi-cation of the usage classes in English: some- and any- each cover a very broad range of classes, with a high degree of overlap. This level of overlap in languages appears to be very rare: in the 40 languages studied by Haspelmath (1997), we find that only some 10% of languages have IPs that overlap over such a broad area of the semantic space.4
Within any of these classes, some semantic/syntactic contexts call for just one of some- or any-, while others allow both, but with differing meanings (and frequencies/preferences). For example, these similar contexts allow both, but the preferred pronoun differs:
1. ...people care a lot if something is a repost...
2. ...before you know if anything is wrong...
We thus predict a difficulty for English L2 speakers in having to choose between two (not interchangeable) terms that can be used in highly similar semantic/syntactic environments.
In addition to looking at difficulties posed by the colexification of IPs within English, we can consider crosslinguistic patterns of colexification for further insight. Semantic typologists have proposed (and empirically supported, across many domains) that the more two underlying concepts are colexified across languages, the more similar those two concepts are (e.g., Anderson, 1982). In
Figure 1: Layout of usage classes in crosslinguistic semantic space; light blue illustrates the scope of English any-, pink illustrates the natural grouping of QU/CD with SP/NS.
this way, crosslinguistic patterns of colexification can be used to deduce pairwise similarity among concepts, yielding a universal semantic similarity space for a domain (e.g., Berlin and Kay, 1969; Levinson et al., 2003).
Here, we derive such a similarity space over the IP usage classes of Table 1, using the colexifica-tion data across 40 languages, from Haspelmath (1997).5 We form a distance matrix (found in supplemental materials, A.1) by recording, for every pair of usage classes, the number of languages that have a term subsuming both those classes (indicating their relative similarity). We then use Multidimensional Scaling (MDS) to project the space onto two dimensions, as exemplified in Figure 1.6
Figure 1 demonstrates, first, that SP, FC, and DN form three natural “extremes” of the semantic space. In English, these correspond to the canonical uses of the IPs some-, any-, and no-, respectively; thus some- is anchored at SP and any- at FC (cf. Table 1). Moreover, we find that the usage classes of QU and CD are very close to SP and NS, indicating that QU and CD are most frequently colexified with SP/NS, in particular, much more so than with FC. For English, this means that it is much more natural for some- to express QU/CD than for any- to do so.
To summarize, our linguistic analysis reveals two potential challenges of English some- and any-: their confusability across many classes, and the particular difficulty of any- in the QU/CD classes. We further find empirically that some-IPs are more frequent than any- in native English text, suggesting that some- will be easier for L2 speakers, and that they may overgeneralize it when faced with uncertainty of which pronoun to use. Collectively, these findings motivate:
Hypothesis 1: The unusually large and overlapping extents of some- and any- are expected to pose difficulty for L2 speakers; any- is predicted to be especially difficult due to its lower frequency.
Hypothesis 2: Due to greater naturalness of grouping QU and CD with other classes subsumed by some-, we predict that QU and CD usages of any- will be particularly difficult for L2 speakers.
In exploring each of these hypotheses, we look for evidence in two forms: overuse of some- compared to native speakers, and more errors involving any-. We focus on the frequent semantic categories of people and things, specifically the set of IPs someone, anyone, something, and anything.7
3.1 Datasets
We expect that mastery of IPs will depend on a speaker’s command of English, and therefore consider language productions both of learners (largely beginner-to-intermediate), and of L2 speakers on Reddit (shown to be highly proficient, almost on par with Reddit natives; Rabinovich et al. 2018). Our learner dataset comprises several sub-corpora: EFCAMDAT (Geertzen et al., 2013), TOEFL11 (Blanchard et al., 2013), and the freely available part of the FCE corpus (Yan- nakoudakis et al., 2011). The advanced L2 dataset includes online posts by advanced non-native English speakers from the L2-Reddit corpus (released by Rabinovich et al., 2018, and comprising utterances by native as well as highly-proficient non-native speakers, published on the Reddit platform). We extended the L2-Reddit corpus (originally collected in 2017) with data published through September 2018; the final dataset includes over 320M native and L2 English sentences. Table 2 presents details of the two corpora.
Table 2: Statistics on datasets.
3.2 Classification of IP Usages
Evaluating our hypotheses in Section 2 depends on assessing which usage class an utterance with a some-/any- pronoun belongs to, so we can compare patterns of usage and infelicities across classes. In English, the IP usage classes are often associated with particular lexical or syntactic cues in the clause with the IP – e.g., a negative adverb for DN (I don’t want anything from this collection.), or a question mark for QU (Would you like to buy something online?). This enabled us to develop a rule-based classifier (see supplemental materials (A.3) for details), using a parser (Kitaev and Klein, 2018) and a set of heuristic rules.
We evaluated the classifier on sentences manually annotated by three in-house native English speakers with a background in linguistics. A sample of 750 sentences produced by Reddit native English speakers was selected for annotation, and the annotators assigned a label to each sentence from within the set of {DN, QU, CD, CP, MIXED}, where the MIXED class comprises the SP, NS, FC, and IN classes (cf. Table 1). The MIXED grouping contains classes that are (1) dif-ficult to distinguish using simple lexical and syntactic cues (essentially, an “other” class), and (2) predicted by our linguistic analysis to be relatively similar in their error patterns. Average annotator agreement on our task was annotation guidelines can be found in supplemental materials (A.2).
Table 3 shows that our rule-based classification is a reliable way to categorize a sentence with an IP (five-way classification baseline is 0.2). Because we use a subset of sentences associated with each usage class throughout our experiments, we focus on classification precision, while maintaining recall. We use this classifier to automatically label L2 sentences by usage class.
Table 3: Evaluation of classification of IP usage classes.
3.3 Annotation of (In)felicitous Usages
We used the FigureEight crowdsourcing platform for collecting annotations to be used as ground truth of L2 infelicities. We extracted a randomly sampled set of 3, 711 sentences from our learner corpus representing a balanced distribution over the five usage classes,8 and a similar set of 10, 000 sentences from our advanced L2 (Reddit) corpus, each containing a usage of someone, something, anyone, or anything.9 Each sentence was annotated by five native English speakers in a choicebased annotation scheme. The occurrence of the IP in the sentence was replaced with a blank line, and each annotator marked their preference for the some- or any- pronoun in that context (or “other”), reflecting the most natural choice between the two. The gold annotation for each sentence was determined by its majority choice, and the confidence score was computed based on the number of selections (out of five annotators) of each of the two pronouns. Annotation guidelines and a sample of 500 manually annotated sentences can be found in the supplemental materials (A.4).
Table 4 presents example sentences produced by learners and L2 Reddit authors where the majority annotation unanimously differed from the original pronoun (as indicated). The utterances are provided verbatim, maintaining grammatical errors typical to productions in our corpora.
Sentences with a confidence level considered close to equally felicitous with either pronoun, while the confidence of 1 represents a unanimous preference for one of the alternatives. Because we used a forced-choice task, if both pronouns were acceptable (e.g., Did you see something/anything you like?), we expect that the con-fidence score will indicate the level of naturalness or typicality of the pronoun in that context. For this reason, we only consider an example infelicitous when it differs from annotator choice with a confidence
, which indicates a stronger preference for one pronoun over the other.
The final annotation results include 50% (1556) and 77% (2857) of sentences with a confidence of , respectively, for learners. Our advanced L2 data has 56% (5639) of sentences with a confidence of
A question arises as to how meaningful it is to label an IP usage as infelicitous – i.e., the preferred IP in annotation differed from the original – if both some- and any- are in fact acceptable. To explore this, we also got crowdsourced annotations on 500
Table 4: Example sentences annotated by human annotators for infelicitous pronoun choice (original pronoun is boldfaced). The top part refers to learners’ utterances, the bottom part refers to advanced L2s’.
native utterances from Reddit, and compared the percentages of usages annotated as infelicitous to those of 500 randomly sampled sentences by advanced L2s. We found that 3% of native utterances were annotated as infelicitous at a confidence level of , indicating a high agreement among native writers and our annotators, while for advanced L2s, the percentage was around twice that high – 6.7%. Despite acceptable variation in some-/any-usage in a given context, even advanced L2 speakers differ from natives in their relative preferences.
4.1 Distribution of IPs by Usage Types
First, considering Hypothesis 1 from Section 2, we expect the confusability of some- and any- to be reflected in overgeneralization of some- due to its higher frequency. The subtle distinction between these pronoun types is assumed to be better mastered by advanced L2 speakers, so we expect the divergence from the native distribution to be amplified in learners’ productions.
Figure 2 presents relative frequencies of some-and any- pronouns in a random sample of 5M native, advanced L2, and learner productions, both in the entire sample (left) and distributed by usage class (right). In line with our predictions, we find in Figure 2 (left) that overall, L2 speakers use some- pronouns more than any- pronouns compared to native speakers. We can further see in Figure 2 (right), and discussed in detail below, that this pattern occurs in almost all the IP usage classes, especially pronounced for learners.
Elaborating on Hypothesis 1, we further suggest that in addition to general overuse of some-vs. any- (which may partly be due to avoidance of any-), L2 speakers are also expected in their infelicities to more often use some- where native speakers would use any-, than vice versa. This prediction is also supported by our annotated data: In cases where the preferred pronoun is some-, learners infelicitously use any- 8.4% of the time, but in cases where the preferred pronoun is any-, learners infelicitously use some- almost 23% of the time. That is, learners have almost three times as many infelicities of using some- instead of any-than the reverse. Our advanced L2s speakers also show more infelicities using some- instead of any-than vice versa, but the difference is less pronounced (5.8% and 10.1% respectively), as we expect given their greater proficiency.
4.2 Distribution of Infelicitous Usages
Next we turn to Hypothesis 2 from Section 2, which further predicts that the precise extent of deviation from native-like usage patterns will not be distributed uniformly across the different usage classes, but rather there will be a higher degree of deviation in classes that are atypically grouped under any- – that is, QU and CD – than in those that introduce less of a semantic challenge (DN, CP, and those in the MIXED class). L2 speakers are expected to exhibit both more overuse of some-and more infelicities in the QU and CD classes.
Our predictions regarding the non-uniform overuse of some- are largely borne out in Figure 2: the classes expected to be most difficult for L2 speakers – QU and CD – show a significant difference not only for learners, but even for advanced L2 speakers compared to natives, while DN and CP show only a difference for learners.
A few observations from Figure 2 do not follow our hypothesis. First, the difference in learner usage of some- vs. any- for DN goes in the direction opposite to the prediction: i.e., learners use any- more than some- pronouns in direct negation. We attribute this to the sheer frequency of any-in direct negation, such that learners are overgen-
Figure 2: Distribution of some- and any- pronouns by usage class (native, advL2, learner, left-to-right in each); see Table 1 for definitions of classes. ‘total’ refers to some- and any- counts extracted from the sample of 5M sentences for each population. ‘***’ indicates significant difference at the level of p < .001; ‘ns’ indicates non-significant difference.
eralizing any- here. Second, the MIXED grouping also shows a difference for the advanced L2 speakers, although these usages are not predicted to be especially difficult by our linguistic analysis. This class contains a very large and diverse set of usages, making it difficult to predict what is driving this effect, and we leave this for future work. Finally, the largest gap in overuse of some-vs. any- is observed in the CP class for learners, thereby not complying with our prediction of the highest difficulty being introduced by the QU and CD classes. Note, however, that this result is based on a relatively small amount of data in the CP class for learners (only 124 sentences; see Table 5).
To consider the pattern of infelicities across the usage classes, Table 5 shows the results from our crowdsourced annotation of IP usages of learners (top) and advanced L2s (bottom), separated by the classes. As expected, learners exhibit a very high percentage of infelicities in the QU class (24%); the CD class is not nearly as bad (12%), but is still higher than the other three (8–9%). Although advanced L2s have much fewer infelicities than learners, they also have more in the QU and CD classes (7% and over 9% respectively) than in the others (5–6%). Thus, as with Hypothesis 1, Hypothesis 2 is largely borne out by the data, and we find additional evidence that the IP system of English is particularly challenging for beginning to intermediate learners.
Our motivation for the above analysis is to use these insights to drive development of tools for L2
Table 5: Distribution of annotated infelicities by usage class. Top panel: learners; bottom: advanced L2s.
learners. Here we consider the first step, that of detection of infelicities with a language model (LM).
Neural network based approaches are currently among the most successful LMs. While being easily applied to a wide range of tasks, they provide significant improvements over classic backoff ngram models. A common use of a pre-trained LM – typically trained on an extremely large corpus – is to predict the likelihood of an ‘unseen’ sample of text: The higher the score (or the lower the perplexity) a text is assigned, the more probable it is, given the model. In particular, a fluent, wellformed text is likely to be scored higher by an LM than a text containing linguistic anomalies.
Encouraged by results on the task of grammatical error detection (Yuan and Briscoe, 2016; Ji et al., 2017), we adhere to a similar approach, casting the detection of infelicities as a binary classi-fication scenario: An LM is applied on a sentence with an original pronoun (e.g., something) and on the same sentence where the pronoun is substituted with its alternative (e.g., anything); then the one predicted as more probable (scored highest) is chosen as a model decision.
5.1 Models
Aiming to test the effect of various factors, such as training data size and register, on the predictive power of LMs in our task, we used both pre-trained models and models trained locally on in-domain, albeit much smaller, data.
Gulordava et al.: A successful variant of RNNs, the long short-term memory model (LSTM, Hochreiter and Schmidhuber, 1997), used for syntactic error detection in Gulordava et al. (2018). We trained the model using a similar set of parameters to Gulordava et al. (2018),10 on 10M sentences by native English speakers of Reddit (see Section 3), using a 20K sentence validation set and a 50K sentence test set. This model allows us to test the benefits of using in-domain data (for advanced L2s), despite its significantly lower volume, compared to other models.
Google 1B: A very large publicly available LM released by Jozefowicz et al. (2016). This fine-tuned language model, trained on a billion-word corpus (Chelba et al., 2013), requires a massive infrastructure for training. It achieves impressive perplexity scores on common benchmarks, and has been shown effective on a range of NLP tasks.
BERT: A recent bidirectional encoder representations from transformers (BERT) LM released by Google (Devlin et al., 2018). Proven highly effective in several language modeling tasks, it achieves state-of-the-art results in syntax-sensitive scenarios (Goldberg, 2019), pushing the limits of what is feasible with current language modeling tools.
We report the models’ precision, recall and F1 scores for infelicitous and correct classes separately. We also report the overall accuracy of each, computed as the ratio of correctly classified cases out of all sentences. Following the intuition laid out in Section 3.3, we conducted two sets of experiments: (1) considering cases where annotators’ confidence score was 0.8 or higher, and (2) considering cases with confidence of 1. Sentences with a lower confidence score (i.e., where both some-and any- were roughly equally preferred) were excluded from these experiments.
5.2 Results and discussion
Tables 6 and 7 present the results for learners and advanced L2 speakers, each split by the degree of annotation confidence. Baseline accuracy is computed as the ratio of felicitous usages (the majority class) out of all instances. The Gulordava et al. LM yields results inferior to the baseline, despite training on in-domain (but much smaller) data. BERT performs best overall, and both it and Google 1B exceed the baseline for learners, but BERT performs only at baseline for advanced L2s, confirming the extreme difficulty of this task. Results obtained for the correct class are far superior to those for the infelicitous class, suggestive of the inherent difficulty of the latter cases, compared to (occasionally clear-cut) correct usage patterns.
Systematically higher scores obtained for learner utterances (Table 6), compared to advanced L2s (Table 7), imply that the mild infelicities of the latter pose a higher challenge to automatic tools. That is, not only do advanced L2s show fewer errors, but their errors are likely more subtle and more difficult to detect. The high-confidence setup (= 1.0) yields results superior to those produced by the lower-confidence setup (), further supporting that clear-cut infelicities are more easily captured by an LM.
Returning to our linguistic predictions, the preference of some- over any- predicted by Hypothesis 1 and shown for non-native speakers (Section 4.1) does not hold for our best-performing LM. We found a roughly equal rate (up to two percent points) of infelicities in model preferences in cases with some- vs. any- gold annotations, showing that the model (unlike non-natives) does not have greater difficulty with any- overall.
We also consider the non-uniform difficulty of IPs across various usage cases, predicted by Hypothesis 2 and shown for non-natives (Section 4.2). To address this question, we test BERT for infelicitous choices compared to annotators’ decisions: That is, for each sentence, we compare the pronoun preferred by the model to the gold annotation. Table 8 presents statistics across usage classes, for learners and advanced L2s (taken from Table 5), as well as for BERT. The top panel refers to learner data; the bottom panel, to advanced L2 data. While (expectedly) outperforming the two non-native populations, the model exhibits similar distributional patterns, with more infelicities in the CD and QU classes. The model also has
Table 6: Automatic detection of infelicities in learner data (sentences where annotation disagrees with author usage of IP), with confidence level (top), and with confidence level = 1 (bottom). Baseline accuracy is 0.850 for the former and 0.887 for the latter. Best result in a column (for each part) is boldfaced.
Table 7: Automatic detection of infelicities in advanced L2 data (sentences where annotation disagrees with author usage of IP), with confidence level (top), and with confidence level = 1 (bottom). Baseline accuracy is 0.918 for the former and 0.956 for the latter. Best result in a column (for each part) is boldfaced.
a higher number of infelicities in the CP class for learners; again, we note the small sample of data in this class, entailing a need for further investigation of this particular pattern. The model results here pose intriguing questions for future work regarding the nature of challenges faced by automatic neural methods, and their potential analogues to those of humans.
Table 8: Distribution of % of infelicities (difference from gold annotation) across classes for humans and for BERT on the corresponding data.
Computational approaches to grammatical error correction (GEC) in learners’ productions has been a prolific field of research in recent years. A standard approach to dealing with grammar and spelling errors makes use of a machine-learning classification paradigm; a comprehensive survey of these methods can be found in Ng et al. (2014). Recent advances in the field of GEC were achieved by using neural models (Yuan and Briscoe, 2016; Ji et al., 2017; Sakaguchi et al., 2017; Lo et al., 2018). Most studies used a supervised setup for selecting a correct choice (e.g., a preposition) out of a set of multiple alternatives, rendering our experimental setup not directly comparable.
Another line of work has assessed the capability of neural LMs to capture errors stemming from violation of syntax-sensitive dependencies (Linzen et al., 2016; Gulordava et al., 2018; Marvin and Linzen, 2018). The recent BERT model (Devlin et al., 2018) has been shown to be highly effective for detection of syntactic anomalies stemming from subject-verb disagreement (Goldberg, 2019).
Most research on L2 error correction focuses on function words, such as prepositions and determiners. Very little work has been done on detecting and correcting incorrect usage of content words. Most has been focused on the felicity of word combinations, such as identifying disfluencies stemming from L1 paraphrases (e.g., eat medicine or look movies, Brooke and Hirst, 2011; Dahlmeier and Ng, 2011), or using models of compositionality to detect semantically deviant pairs (residential steak, Vecchi et al., 2011) or infelicitous collocations (?big importance vs. great importance, Kochmar and Briscoe, 2013). A shared task on automatic evaluation of scientific writing (Daudaravicius et al., 2016) addressed automatic detection of a variety of grammatical errors (e.g., misuse of an article or punctuation) and lexical infelicities (e.g., phrasing choices stemming from style requirements of the genre) in sci-entific papers, edited by a professional company.
While most closely related to the field of semantic error detection, our work deals with subtle linguistic choices that shape the ultimate attainment of L2 in non-native speakers. Compared to grammatical and semantic anomalies explored in previous work, the choice of indefinite pronoun is often guided by implicit contextual clues that are not necessarily reflected in superficial collocational patterns, thereby posing a higher challenge for automatic techniques.
We develop and evaluate linguistic hypotheses on the difficulties for second language learners of the atypical system of English indefinite pronouns. We find that the tangled relation between some- and any- pronouns pose challenges that are evident in the productions of both learners and advanced L2 speakers. This work thus demonstrates the promise of extending computational approaches for error-detection in L2 productions to more subtle semantic usages. Moreover, our results reveal the challenges that these subtleties can pose for even advanced non-native speakers.
Much research in second language acquisition establishes native language transfer as one of the major factors that shape productions of non-native speakers. While the work here addresses universal (i.e., native-language independent) challenges posed to L2 speakers, a plausible assumption is that mastery of English IPs is also affected by the proximity of the analogous system in a speaker’s L1. We leave this direction for future research.
We also evaluate here the ability of language models to detect the errors arising in the use of English indefinite pronouns in L2 productions. Not surprisingly, we find that the more clearcut errors exhibited by learners are easier to automatically identify than the potentially more subtle errors that arise with advanced L2 speakers. The best performing language model shows a varying match to human patterns of difficulty, raising issues for further research regarding the factors that influence difficulty for both humans and language models.
The practical impact of this work will be in facilitating the development of educational applications for L2 English speakers at various levels of proficiency. At present, most error correction and detection tools focus on explicit spelling or grammar errors. Enriching these tools with the ability to capture subtle semantic infelicities in the usage of IPs would advance the current state of the art in educational applications for language learners.
This research is supported by an NSERC Discovery Grant RGPIN-2017-06506 to Suzanne Stevenson. We are thankful to Paola Merlo for her insight and advice. We are also grateful to our anonymous reviewers for their constructive feedback.
Lloyd B Anderson. 1982. The ‘perfect’ as a universal and as a language-specific category. In Paul J. Hopper, editor, Tense-aspect: Between semantics and pragmatics, pages 227–264. John Benjamins, Amsterdam.
Barend Beekhuizen, Julia Watson, and Suzanne Stevenson. 2017. Semantic typology and parallel corpora: Something about indefinite pronouns. In Proceedings of the 39th Annual Conference of the Cognitive Science Society.
Brent Berlin and Paul Kay. 1969. Basic color terms: Their university and evolution. California UP.
Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15.
Julian Brooke and Graeme Hirst. 2011. Lexicalizing computational stylistics for language learner feed- back. In Proceedings, Conference on Stylistics Across Disciplines, Leiden.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. Technical report.
Daniel Dahlmeier and Hwee Tou Ng. 2011. Correcting semantic collocation errors with l1-induced para- phrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 107–117. Association for Computational Linguistics.
Vidas Daudaravicius, Rafael E Banchs, Elena Volo- dina, and Courtney Napoles. 2016. A report on the automatic evaluation of scientific writing shared task. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 53–62.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of
deep bidirectional transformers for language un- derstanding. Technical Report arXiv:1810.04805 [cs.CL].
Alexandre Franois. 2008. Semantic maps and the ty- pology of colexification: intertwining polysemous networks across languages. In Martine Vanhove, editor, From polysemy to semantic change, pages 163– 215. Benjamins, Amsterdam.
Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen. 2013. Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge open language database (EFCAMDAT). In Proceedings of the 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project.
Yoav Goldberg. 2019. Assessing BERT’s syntac- tic abilities. Technical Report arXiv:1901.05287 [cs.CL].
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1195–1205. Association for Computational Linguistics.
Martin Haspelmath. 1997. Indefinite pronouns. Oxford University Press.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, and Jianfeng Gao. 2017. A nested attention neural hybrid model for grammati- cal error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 753–762.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. Technical Report arXiv:1602.02410 [cs.CL].
Nikita Kitaev and Dan Klein. 2018. Multilin- gual constituency parsing with self-attention and pre-training. Technical Report arXiv:1812.11760 [cs.CL].
Ekaterina Kochmar and Ted Briscoe. 2013. Capturing anomalies in the choice of content words in com- positional distributional semantic space. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pages 365–372.
Stephen Levinson, S´ergio Meira, The Language, and Cognition Group. 2003. ’natural concepts’ in the spatial topological domain-adpositional meanings in crosslinguistic perspective: An exercise in semantic typology. Language, pages 485–516.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association of Computational Linguistics, 4(1):521– 535.
Yu-Chun Lo, Jhih-Jie Chen, Chingyu Yang, and Jason Chang. 2018. Cool english: A grammatical error correction system based on large learner corpora. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 82–85.
Rebecca Marvin and Tal Linzen. 2018. Targeted syn- tactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202. Association for Computational Linguistics.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The conll-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.
Ella Rabinovich, Yulia Tsvetkov, and Shuly Wintner. 2018. Native language cognate effects on second language lexical choice. Transactions of the Association of Computational Linguistics, 6:329–342.
Alla Rozovskaya, Dan Roth, and Mark Sammons. 2017. Adapting to learner errors with minimal su- pervision. Computational Linguistics, 43(4):723– 760.
Keisuke Sakaguchi, Matt Post, and Benjamin Van Durme. 2017. Grammatical error correction with neural reinforcement learning. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, volume 2, pages 366–372.
Eva Maria Vecchi, Marco Baroni, and Roberto Zam- parelli. 2011. (linear) maps of the impossible: cap- turing semantic anomalies in distributional space. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 1–9. Association for Computational Linguistics.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.
Zheng Yuan and Ted Briscoe. 2016. Grammatical error correction using neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.