The scope of this study is limited to SN that have their origin as specialized knowledge units (terms) as defined by [6]:
[...]a lexical unit, whose structure is related to an origin lexical unit or product of the lexicalization of a syntagm, that has a specific meaning in the ambit to which it is related and it is necessary in the conceptual structure of the domain in which it takes part. [6, p.77]
These terminological units are part of a broader general concept defined as specialized significance units (SSU). There are different types of SSU, such as lexical, verbal, nominal, adjectival or adverbial [7]. In this study we will explore verbal, nominal and adjectival SSU obtained from our SN database. To carry out the detection of SN in a semiautomatic way, we designed a system that uses topic detection, keyword extraction and word embeddings to detect new word meanings. We named this system DENISE
, a multilingual tool that mimics the work that specialists perform at OBNEO.
DENISE takes simple text as input, then the user can set the working language or let the system detect it automatically, currently the supported languages are: Catalan, French and Spanish. The next step is text preprocessing: normalization and tokenization as well as elimination of stopwords, non latin characters and symbols. Once the text is preprocessed the user has the option to introduce the theme of the text or let the system detect the topics within the text using a TF-IDF and logistic regression based model. Once the text has a topic assigned, the system will automatically extract keywords using TextRank with POS filtering.
Each of these keywords serve as query items for our word embeddings models; with these queries we can obtain the 140 most similar termsand then create a "semantic field" (SF) for each keyword. This SF serves as a representation of the most common meaning of each queried word in the general language. Then, the system evaluates these SF to detect their theme and proceeds to evaluate if concordance exists between the topic of the input text and the topic of the
Figure 1: Excerpt from the OBNEO Database
SF. To obtain valid candidates for SN, DENISE filters the keywords keeping only keywords that meet the following criteria: a candidate for SN is a keyword whose detected topic is different from the topic of the embeddings, since this would indicate that a known word is being used is a context that is different from its most common SF.
The automatic detection of SN, because of its nature, has been a more complicated task compared with other kinds of neologisms such as transferred words or derived words [30, 13, 26, 29, 28]. Therefore, presently there is a necessity that is being covered partially. One of the first specialized systems to detect SN is April [25, 26, 27]; this approach uses statistical and linguistic rules, collocation patterns and heuristic rules to track semantic change over time. These methods include boot-strapping, a chronologically divided corpus used as a control database and a reference dictionary. April analyzes common collocations to track SN, which means that a word that a appears in a context different from its most usual context might be a valid candidate for SN. However, no evaluation is provided and the authors mention two specific problems: the superficial defi-nition of novelty does not distinguish between a new sense or a new reference, and April can not identify collocations that have at least four concordances.
A second system that could be used to detect SN is Logoscope[9, 8, 10, 11, 2, 3], which uses a combination of topic modeling using Latent Dirichlet Allocation (LDA) and a linear support vector machine classifier that could be used to identify possible new word meanings. In order to detect SN, Logoscope analyzes theme concordance between the collocation of a keyword and its definition in the dictionary: when the collocation and the definition do not share the same topic this could indicate that a given keyword might be candidate for SN. The authors were able to detect a new sense for the word quenelle using this methodology, but they affirm that relying on dictionary definitions complicates this task given the nature of SN. While the authors provide all the formulas that were used, there is no formal evaluation for this particular task.
Finally, there are other methodological approaches such as those proposed by [12, 13, 14], who developed a POS tagger that uses statistical rules and could obtain SN as a byproduct from this process: ambiguous words are assigned a special tag that, when inspected, can indicate a semantic change in course. Finally, [20, 21, 22] proposes a methodology that involves the combination of word sense induction (WSI) and clustering to group word senses. The author states that this method could be implemented to develop a tool that detects SN automatically. While both methodologies explain how word meanings can be grouped and classified, neither provide implementations for the detection of SN.
As part of our methodology, we expect to classify the different themes or topics that are being treated in the input text, because we assume that new word meaning depends on the topic in which we find this word. For example, if a word has one known meaning related to economy and we find this same word in a CS text, this word might have a change of meaning related to this new theme. We treated this step as a classification problem, which means that each text that is entered to DENISE is evaluated using a logistic regression to predict its main theme or topic.
Our corpus was compiled using articles from specialized publications in Spanish: PC World (CS) with a total of 308,930 words; Marca (sports), 275,872 words; and El Financiero (economy), 280,404 words. This corpus was used to generate a TF-IDF model and then train the logistic regression model using the following parameters: L2 penalty, max intercept scaling of 1, max tolerance of 1e-4 and 1.0 for inverse of regularization strength, the resulting confusion matrix can be seen in Figure 3.
The model is capable of detecting three topics: sports (deportes), economy (), and computer science (
). As we mentioned before, our goal is to detect new word meanings related to computer science, therefore we expect that, for every text we evaluate, DENISE will return “informática” (labeled as 1) as the main topic, and if it does not detect CS as the main topic it will return “not informática” (labeled as 0). We compared the precision of three classifiers using a cross validation score: logistic regression obtained 0.982; multinomial naïve Bayes, 0.898; and random forest classifier, 0.790. After obtaining these results, we selected the logistic regression model and proceeded to evaluate its mean accuracy on a train-test split: it obtained 0.913 on the train set and 0.889 on the test set.
Figure 2: Confusion Matrix of the Logistic Regression Model
After detecting the theme of the input text, DENISE extracts the keywords that will be evaluated as candidates. One of the particularities of SN is that they are known words and, therefore, we can not use a set of lexicographical rules or dictionaries because we are not looking for new words at a formal (structural) level, but for new meaning of known words. For this reason we decided to use the TextRank [17] algorithm (as defined by Equation 1) with POS tag filtering. This graph-based algorithm was inspired by the PageRank [23] algorithm originally used by the Google search engine. TextRank is a widely used in ranking and recommendation systems, keyword extraction and automatic summarization systems [16, 1, 24].
DENISE’s implementation of TextRank uses the original graph evaluation and incorporates POS tag filtering to prioritize the extraction of verbs, nouns and adjectives. These type of units are of interest because in our database of neologisms we found that most SN, as shown in Table 4, fall into these POS categories. With the implementation of a POS filter we obtained 14% more accuracy in comparison with the regular TextRank implementation, this might indicate that the algorithm is correctly extracting possible candidates.
The last step in DENISE’s analysis process is sense disambiguation using word embeddings. With the resulting keyword from the previous step, our work hypothesis is the following: a new word sense might be found when a known word is used in a text about a topic that is different from the topics where
this word is usually collocated. Therefore we assume that the most common word representation of a given keyword is closely related to its main (or most common) meaning, and this meaning might is also related to a certain topic.
We carried out an analysis using three different neural network based models: Word2Vec [18, 19], FastText [4, 15] and Sense2Vec [33]. To train the models we used Wikipedia in Spanish as corpus and the training was performed using the same training values described in the bibliography. In the following subsections we describe some of the particularities of each model and the training parameters that were used.
Word2Vec. Word2Vec is a model that uses neural networks to produce dense vector representations of words. Its two main architectures are skip-gram and CBOW, the first one being slower but better for projecting uncommon words. In order to perform the disambiguation process, we require embeddings that represent the most common meaning of the input keywords, therefore we trained our skip-gram based model with a dimension size of 300, a window of 5 and a min count of 20 elements.
Sense2Vec. To obtain a Sense2Vec model we tagged the same Wikipedia corpus that was used to train the Word2Vec model, using the Universal Dependenciestagset. This approach allows for the generation of a model that has similar characteristics to a Word2Vec model with the added advantage of POS tag disambiguation. This means that a word that is being used as a noun and a verb has two different representations in the model, one for each case. We followed the training parameters proposed by [33]: a dimension size of 500, a window of 5 and a min count of 10 elements, again using continuous skip-grams.
FastText. Altough the FastText model also uses neural networks to generate word representations, this model uses subword data as the minimum units to train these representations; meanwhile, the minimum unit to trained by Word2Vec and Sense2Vec are words. For example, in a FastText model the vector for the word "viral" would be composed by the ngrams within "viral" in the following way: "<vi", "vir", "vira", "viral", "viral>", "ira", "iral>", "iral>", "ral", "ral>", "al>". We used the pretrained vectors available at the FastText website, which were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
To extract SN from a text first our algorithm (6) detects the working language, if the language is not supported it continues with a new text. If the language is supported then DENISE proceeds to assign a theme to the input text using the logistic regression model. If the user knows the language and the topic it can be set manually, the number of keywords to extract and the number of words for the SF can also be set manually.
The next step is to extract keywords using TextRank, the implementation returns a list with the n number of KW that were selected by the user. For each of the items from the TextRAnk list, the algorithm returns a SF composed by the TOPN most similar words for each item of the list using the word embedding model, which by default is set to Sense2Vec.
Finally, the systems evaluates if there is theme concordance between the semantic field and the input text. When concordance exists it indicates that the most common meaning from the embedding model is being use, which means that a given word is not a candidate for SN. Meanwhile, if concordance does not exist between both topic, this might indicate that a given keyword is a possible candidate for SN, when this is the case, the algorithm returns a list with the possible candidates and their detect themes so the user can determine if there are true SN candidates.
To determine the performance of DENISE with each model, we used a list of keywords and their concordances (all pertaining to the CS field) from the SN database as input text. Our database has a total of 5562 SN registered and manually validated; from this total we selected SN that correspond to the CS field and discarded concordances shorter than 130 words to have enough context words. These are SN that have already been manually analyzed and all of them belong to the field of the CS.
Pseudocode Description
i f lang_flag [ 0 ] == True : sn _l is t = [ ] i f topic == ’ auto ’ : text_topic = analyze_topic ( [ text ] , lang_flag [ 1 ] ) e l s e : text_topic = topic text_rank = extract_keywords ( text , lang_flag [ 1 ] , KW) f o r i in text_rank : s f = model . most_similar ( i , TOPN) sf_topic = analyze_topic ( [ s f ] , lang_flag [ 1 ] ) i f text_topic != sf_topic : sn _l is t . append ( [ i , text_topic , sf_topic ] ) e l s e : continue return sn _l ist e l s e : print (" Language not supported ")
With these keywords and concordances we generated a CSV table containing the 125 term-concordance items, then we proceeded to query each item of the list on each model to obtain the 140 most similar terms for each item. As shown in Table 7, FastText was the model that yielded the best results, since with this model we obtained 125 SF for each of the 125 terms that were queried, followed by the Word2Vec model with 100 from 125 terms, and last the Sense2Vec model with 97 of 125.
After obtaining the most similar terms for each item, we performed the topic detection process on these SF. This process is performed in order to assess if the generated SF have the same topic than the concordances. If the set of embeddings and the concordance share the same topic, it might indicate that the keyword in question might no be candidate for SN, whereas a set of dissimilar embeddings and concordance might indicate that the keyword is a candidate for SN. The results of this process are shown in Tables 7 and 7.
While the Sense2Vec model was the model that retrieved the lowest number of unique terms, it provided the more representations in total: 172 when combining all three categories. This gives us more detailed information regarding
use: a word might be used mainly as a verb but it could also be used a noun, and one of these uses could be the neological meaning.
We evaluated each of the SF manually to ensure that the automatic topic detection process was accurate, and to ensure that the model classifies the terms correctly: whether they belong to CS or not. After manually evaluating all the embeddings, we proceeded to evaluate the number of correct cases, that is, if the predicted topic is the same as the manually observed topic and the percentage of agreement between the automatically and the manually labeled SF. We expected that the classifier could determine if a context and a SF are, in fact, related to the CS field or not. We used f1-Score, precision, recall and support for each model; in the case of the Sense2Vec model we calculated these metric for each POS category independently.
The results for the FastText model can be seen in Table 7 and the results for Word2Vec on Table 7. Both models obtained similar f1-scores, and both seem to have high precision and low recall when classifying SF that do not belong to CS, but also present low precision and high recall when classifying SF that belong to CS.
On the other hand, the Sense2Vec model obtained better f1-scores than the Word2Vec and FastText model, with being NOUN the most productive –and balanced– category with a weighted average f1-score of 0.84. This value is also greater than both the f1-score of the Word2Vec model and the FastText model. Overall, while the Sense2Vec model retrieved the least amount of SF, the resulting embeddings were better classified.
Finally, following the condition of disagreement between the topic of the embeddings and the topic of the input text, each model generated a list o candidates for SN. The Sense2Vec model generated a list of 55 candidates from the original 125 SN list, FastText 42 candidates and Word2Vec, 35 candidates. The lists that each model generated are shown below:
Sense2Vec Candidates: ’almacenado’, ’navegabilidad’, ’palm’, ’mini’, ’cablear’, ’controladora’, ’terminal’, ’viral’, ’descarga’, ’navegación’, ’cargarse’, ’objeto’, ’cuenta’, ’perfil’, ’visual’, ’directorio’, ’asistente’, ’bitá-cora’, ’acelerar’, ’chip’, ’caída’, ’caerse’, ’conversión’, ’muro’, ’word’, ’cortafuego’, ’vacuna’, ’nube’, ’infectar’, ’celular’, ’gusano’, ’troyano’, ’dominio’, ’navegar’, ’alojamiento’, ’electrónico’, ’portal’, ’migración’, ’apli-cación’, ’ipod’, ’motor’, ’procesador’, ’agujero’, ’avatar’, ’androide’, ’pi-ratería’, ’virus’, ’enlace’, ’apuntador’, ’subir’, ’clonación’, ’vínculo’, ’api’, ’herramienta’, ’guru’.
FastText Candidates: ’almacenado’, ’navegabilidad’, ’palm’, ’jaquear’, ’controladora’, ’game boy’, ’navegación’, ’cargarse’, ’iserie’, ’cuenta’, ’clonar’, ’visual’, ’menú’, ’asistente’, ’acelerar’, ’caída’, ’caerse’, ’conver-sión’, ’mapeo’, ’muro’, ’vacuna’, ’nube’, ’infectar’, ’gusano’, ’dominio’, ’navegar’, ’alojamiento’, ’correo_electrónico’, ’electrónico’, ’migración’, ’clic’, ’motor’, ’agujero’, ’avatar’, ’virus’, ’apuntador’, ’subir’, ’vínculo’, ’disco duro’, ’descargar’, ’guru’, ’descargarse’.
It is a common practice to assume that word representations created using the methods mentioned above can give useful information to create NLP applications. Nevertheless, upon manually analyzing all the resulting SF, we observed that some representations are ideal, ambiguous, represented in an foreign language (L2) used inside the working language (L1) and non-informative. Some examples of the last three groups include nube (cloud) and dominio (domain) from FastText; and palm from Word2Vec.
In the case of DENISE, the use of a different method of classification ensures that the data goes trough a double-check step that turns in candidates that otherwise would be discarded. As a general recommendation, the linguistic content should be taken into account when implementing neural word embeddings. Regarding the particularities of each model, one key disadvantage of Word2Vec (specially for this task) is that it only yields representations of one meaning of the words that conform the vocabulary, and, as a consequence, there are other known meanings that are not being represented in this model. This kind of modeling could create ambiguous embeddings.
The overall performance of FastText was adequate. Even thought it might not be useful for this particular task, this kind of model might be better suited for detecting new words on a formal level, since it creates words representations for words that are not included in the vocabulary. This model could also be useful for analyzing composition and derivation processes on a lexical and morphological level.
The Sense2Vec model gave the best results for this particular task, in great part due to implementation of POS tags. These tags add information that can be used to disambiguate meaning of new words or polysemic words. However, from the 125 keywords it only had representations for 97. This might be due to the training parameters suggested by the authors or that, in comparison with FastText, we require more training data. Wikipedia is commonly used as corpus, but for a system that requires a general and broad representation of a language, more diverse data is required.
In this study we have shown the application of word embeddings for the detection of semantic neologisms. For this particular task, the Sense2Vec model gave the best performance. We explored some of the advantages that FastText models have over Word2Vec; for instance, representations of uncommon words.
After further manual analysis of the most similar terms that each models generated, we observed three types of representations that are not useful for the development of the DENISE system: ambiguous embeddings, L2 in L1 embeddings and non informative embeddings. These kind of embeddings should be taken into account when designing an NLP application since the final goal is to implement rich linguistic knowledge. In the case of DENISE, we use TF-IDF and logistic regression for theme classification so the system does not rely on one single method to analyze semantic change.
While analyzing the characteristics of the generated embeddings we could observe that ambiguous representations usually contain words related to two or more different topics. While the Sense2Vec model can differentiate between words that can be used as a verb or as a noun, this process still generates one representation per POS tag. Figure 3 shows the most common words related to troyano (trojan in Spanish) in the Word2Vec model.DENISE classified this word as a valid candidate but, on further inspection, when selecting 300 most common words we can observe two clusters of words: one on the upper-left part that is related to its mythological sense and a small cluster on the lower-right that contains words related to the CS field.
Figure 3: Most Common Words Related to “troyano” in Spanish
Based on this observation, a possible future line of work might be the development of polysemic embeddings, be it as an added layer during the training process that could generate more than one representations for each word, or as a post training process using clustering or classification techniques. Such embeddings could be a good addition for other NLP related tasks such as automatic translation or Automatic Text Summarization [32, 31] or word sense disambiguation.
Authors want to thank CONACYT (https://www.conacyt.gob.mx) Convocatoria de Becas al Extranjero 2015-2019, for supporting this research. The UILATERM group from the Pompeu Fabra University for giving access to their neologisms database and in particular Prof. Rosa Estopâ for her insights regarding the theoretical background.
[1] Federico Barrios, Federico López, Luis Argerich, and Rosa Wachenchauzer. Variations of the Similarity Function of TextRank for Automated Summarization. CoRR, abs/1602.0, February 2016.
[2] Delphine Bernhard, Lauren Bruneau, Ingrid Falk, and Christophe Gérard. Création lexicale et corpus dynamique: quelles variables contextuelles? In en situation: texte, genres, cultures, Strasbourg, France, September 2015.
[3] Delphine Bernhard, Lauren Bruneau, Ingrid Falk, and Christophe Gérard. Le logoscope : une approche textuelle de la veille néologique. des raisons linguistiques à l’interface web. In CINEO 2015 - III Congreso Internacional , Salamanca, Spain, October 2015.
[4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
[5] Maria Teresa Cabré. La classificació dels neologismes: una tasca complexa. In María Teresa Cabré and Rosa Estopà, editors, Les paraules noves: criteris per detectar i mesurar els neologismes, pages 11–37. Eumo Editorial, Universitat Pompeu Fabra, Vic, Barcelona, 2009.
[6] María Teresa Cabré and Rosa Estopà. Unidades de conocimiento especial- izado: caracterización y tipología. In María Teresa Cabré and Bach Carme, editors, Coneixement, llenguatge i discurs especialitzat, pages 69–93. Institut Universitari de Lingüística Aplicada. Universitat Pompeu Fabra, Barcelona, 2005.
[7] Rosa Estopà. Les aplicacions terminològiques. In Ona Domènech and Rosa Estopà, editors, . Universitat Oberta de Catalunya, Barcelona, 2013.
[8] Ingrid Falk, Delphine Bernhard, and Christophe Gérard. De la quenelle culinaire à la quenelle politique : identification de changements séman-tiques à l’aide des Topic Models. In Automatique des Langues Naturelles, Marseille, France, July 2014.
[9] Ingrid Falk, Delphine Bernhard, and Christophe Gérard. From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers. In LREC - The 9th edition of the Language Resources and Evaluation Conference, Proceedings of the International Conference on Language Resources and Evaluation, Reykjavik, Iceland, May 2014.
[10] Ingrid Falk, Delphine Bernhard, Christophe Gérard, and Romain Potier- Ferry. Étiquetage morpho-syntaxique pour des mots nouveaux. In , Marseille, France, July 2014.
[11] Christophe Gérard, Ingrid Falk, and Delphine Bernhard. Traitement au- tomatisé de la néologie : pourquoi et comment intégrer l’analyse théma-tique ? In 2014), volume 8 of SHS Web of Conferences., pages 2627 – 2646, Berlin, Germany, July 2014.
[12] Maarten Janssen. NeoTrack: semi-automatic neologism detection. In APL XXI, Porto, Portugal, 2005.
[13] Maarten Janssen. Detección de Neologismos: una perspectiva computa- cional. , 5(05):68–75, 2009.
[14] Maarten Janssen. NeoTag: a POS Tagger for Grammatical Neologism Detection. Proceedings of the Eight International Conference on Language , (1):2118–2124, 2012.
[15] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
[16] Guangyi Li and Houfeng Wang. Improved Automatic Keyword Extrac- tion Based on TextRank Using Domain Knowledge. In Natural Language Processing and Chinese Computing, pages 403–413. 2014.
[17] Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.
[18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Es- timation of Word Representations in Vector Space. CoRR, abs/1301.3, 2013.
[19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013.
[20] Rogelio Nazar. Neología semántica: un enfoque desde la lingüística cuan- titativa, 2011.
[21] Rogelio Nazar. Word sense discrimination using statistic analysis of texts. , 1(1):5–26, 2013.
[22] Rogelio Nazar. Una metodología para depurar los resultados de los extrac- tores de términos, 2014.
[23] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.
[24] Tayfun Pay, Stephen Lucci, and Jim Cox. An Ensemble of Automatic Keyphrase Extractors: TextRank, RAKE and TAKE. temas, 23(3):703–710, 2019.
[25] Antoinette Renouf. Explorations in Corpus Linguistics, chapter Aviating among the Hapax Legomena: morphological grammaticalisation in current British Newspaper English. Rodopi, 1998.
[26] Antoinette Renouf. Identification automatique de la néologie lexicologique et sémantique : questions soulevées par notre méthode. In M. Teresa Cabré, Ona Domènech, Rosa Estopà, Judith Freixa, and Mercè Lorente, editors, , pages 129–141, Barcelona, 2010. Intitut Universitari de Lingüística Aplicada.
[27] Antoinette Renouf. A finer definition of neology in English: the life-cycle of a word. Corpus perspectives on patterns of lexis, pages 177–208, 2012.
[28] Coralie Reutenauer, Evelyne Jacquey, Sandrine Ollinger, Neologismes De, and Langues Romanes. Neologismes de sens: contribution à leur caracterisation dans un corpus autour du th àme de la crise financière., 2011.
[29] Jean-François Sablayrolles. Extraction automatique et types de néologismes : une nécessaire clarification. Cahier de lexicologie, 100(1):37–53, 2012.
[30] Carles Tebé. Bases pour une sélection de neologismes. In Elisabet Solé M. Teresa Cabré, Judit Freixa, editor, , pages 43–50. Observatori de Neologia and Universitat Pompeu Fabra, Barcelona, 2002.
[31] Juan Manuel Torres-Moreno. Résumé automqtique de documents - une approche statistique. Hermès-Lavoisier, 2011.
[32] Juan-Manuel Torres-Moreno. Automatic Text Summarization. Cognitive science and knowledge management series. ISTE Ltd ; John Wiley & Sons, Inc, London : Hoboken, NJ, 2014.
[33] Andrew Trask, Phil Michalak, and John Liu. sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. CoRR, abs/1511.06388, 2015.