In today’s globalized world, companies need to be able to understand and analyze what is being said out there, about them, their products, services, or their competitors, regardless of the human language used. Many organizations have spent tremendous resources to develop cognitive applications and services for dealing with customers in different countries. For example, cognitive systems may use machine learning techniques to process input messages or statements to determine their meaning and to provide associated con-fidence scores based on knowledge acquired by the cognitive system. Typically, the use of such cognitive systems requires training individual natural language understanding models in a specific human language. For example, a tone analyzer model can be built to predict tones from English conversations (Liu et al., 2018), but such model would not work effectively with other languages. While translation techniques can be applied to translate data from an existing language to another language, human translation is labor-intensive and time-consuming, and machine translation can be costly and unreliable. As a result, attempts to scale existing applications to multiple human languages has traditionally proven to be difficult, mainly due to the language-dependent nature of preprocessing and feature engineering techniques employed in traditional approaches (Akkiraju et al., 2018).
In this work, we empirically investigate the feasibility of multilingual representations to build language-independent models, which can be trained with data from multiple source languages and then serve multiple target languages (target languages can be different from source languages). We explore this question using a unified language model Multilingual BERT (Devlin et al., 2019), which is pre-trained on the combination of monolingual Wikipedia corpora from 104 languages. Through a series of experiments on multiple task types, language sets and data resources, we contribute empirical findings of how factors affect language-independent models:
• Task Type. We analyze and compare language-independent models on two most representative NLP tasks: sentence classifi-cation and sequence labeling. On both tasks, we show that language-independent models can be comparable to or even outperform the models trained using monolingual data. Language-independent models are generally more effective on sentence classification.
• Language Set. Theoretically language-
independent models can be trained using any language set, and be used to make predictions in any language. Through training and testing language-independent models with many different languages, we show that they are more suitable for typologically similar languages.
• Data Resource. We explore the effects of different data sizes when training language-independent models. We demonstrate that language-independent models are not only suitable for high-resource languages, but also very effective in low-resource languages.
We derive insights from our experiments to facilitate the development and customization of natural language understanding models and solutions in new languages. First of all, it can be used to solve the cold-start problem, where no initial model is available for a new target language, when building such models from scratch is costly. Secondly, it largely saves the cost and time for acquiring annotated data of a new target language by reusing data already annotated in previously supported languages. Thirdly, it simplifies the deployment process of a new model and save the efforts for simultaneously maintaining multiple monolingual models in a production setting. Our annotated data for low-resource languages will be made publicly available.
Multilingual representation learning has been an active area of research, starting from word embeddings alignment that uses small dictionaries to align word representations from different languages (Mikolov et al., 2013). Research by (Faruqui and Dyer, 2014) has demonstrated that multilingual representations can be leveraged to improve the quality of monolingual representations. An unsupervised learning method has been proposed by (Conneau et al., 2017) to align multilingual word embeddings without parallel data. In addition to word embedding alignment, aligning sentence representations from multiple languages has also been studied in machine translation, on both supervised learning (Johnson et al., 2017; Artetxe and Schwenk, 2018) and unsupervised learning (Lample et al., 2017; Artetxe et al., 2017). However, most of these approaches focus on pairwise multilingual representation learning. In this work, we empirically investigate the impact of multilingual representations learned from a large number of languages on tasks that involves more languages than a certain language pair.
Our work builds on top of recent advances in pre-trained language modeling. ELMo (Pe- ters et al., 2018) extracts context-sensitive features from a bidirectional LSTM language model and provides additional features for a task-specific architecture. ULMFiT (Howard and Ruder, 2018) advocates discriminative fine-tuning and slanted triangular learning rates to stabilize the fine-tuning process with respect to end tasks. OpenAI GPT (Radford et al., 2018) builds on multi-layer transformer (Vaswani et al., 2017) decoders instead of LSTM to achieve effective transfer while requiring minimal changes to the model architecture. Recently, BERT (Devlin et al., 2019) uses bidirectional transformer encoders to pre-train a large corpus, and fine-tunes the pre-trained model that requires almost no specific architecture for each end task. In this work, we leverage the multilingual representations learned from multilingual BERT (Devlin et al., 2019) to build models that can scale to many languages.
In this section, we describe the motivation of language-independent models, and how to create such models via multilingual representation learning and fine-tuning.
3.1 One Model, Many Languages
To scale our efforts to support the diversity of people in the world, it is important to build and customize machine learning models for many different languages in various NLP tasks. For each target language, however, this often requires going through the whole lifecycle of data collection, data cleansing, data annotation, data storage, feature creation and selection, machine learning model training, model validation, benchmarking and deployment of these models as services in production (Akkiraju et al., 2018). It easily becomes overwhelming as the number of target languages increases. To address this problem, we advocate to build one model for all target languages together, which we called a Language-Independent Model (LIM), as the target languages to serve in production do not necessarily depend on which source languages were used in training. Figure 1 shows a conceptual example: an LIM can be trained
Figure 1: A conceptual example of a LanguageIndependent Model (LIM). The target languages to serve in production do not necessarily depend on which source languages were used in training. For instance, an LIM can be trained using annotated data from the source languages such as English (EN) and French (FR), and then serve in the target languages including Spanish (ES), Italian (IT), Japanese(JA) and so on.
using annotated data from the source languages such as English (EN) and French (FR), and then serve in the target languages including Spanish (ES), Italian (IT), Japanese(JA), which are different from the source languages. This not only accelerates the enablement of a new language by reusing data already annotated in previously supported languages, but also simplifies the deployment process and save efforts for maintaining multiple monolingual models in production.
3.2 Multilingual Representation Learning with BERT
The basis for building LIMs lies in learning a representation that can feature multiple languages. Among the recent significant advances in deep contextualized representation learning for natural language understanding, BERT (Devlin et al., 2019) stands out as its pre-training process naturally supports multilingual representation learning. Specifically, multilingual BERT was pre-trained on the Wikipedia pages (excluding user and talk pages) of 104 languages with a 110K shared WordPiece (Wu et al., 2016) vocabulary. It is a 12-layer, 768-hidden, 12-head transformer model (Vaswani et al., 2017) with 110M parameters. To alleviate the bias towards high-resource languages such as English, data from high-resource languages were under-sampled and those from low-resource languages were oversampled. The pre-training of multilingual BERT does not use any marker denoting the input language, and does not rely on parallel corpus to explicitly encourage translation-equivalent pairs to have similar representations.
Figure 2: An illustration of generalized multilingual representation learning for different NLP tasks.
3.3 Fine-Tuning Multilingual BERT for End Tasks
The multilingual representations learned with BERT can be generalized for many natural language understanding tasks such as Sentiment Analysis, Named Entity Recognition, Categorization, and so on (as illustrated in Figure 2). The input representation of multilingual BERT is a sequence of tokens in any language, which may be a single sentence or two sentences packed together. The input representation of each token is constructed as the sum of the corresponding token, segment, and position embeddings. For sentence classification tasks, the first token of each sequence is a special classification embedding ([CLS]) and its final hidden state will be used as the aggregate representation of the whole sequence. For sequence labeling tasks, the final hidden state of each token will encode its contextualized representation with respect to the whole sequence. To fine-tune multilingual BERT, a clas-sification layer is added on top of the final representation layer, and the probabilities of all label classes are computed with a standard softmax. The parameters of multilingual BERT and the classification layer are fine-tuned jointly to maximize the log-probability of the correct label. The labeled data of end tasks are shuffled across different languages when fine-tuning multilingual BERT.
The effects of LIMs can be affected by at least three factors: task type, language set and data resource. In this section, we empirically investigate the effects of these factors on the performance of LIMs.
4.1 Factor Characterization
Task Type We explore whether LIMs are equally effective across different end tasks. For the scope of this paper, we consider sentence clas-sification and sequence labeling as two of the most popular NLP tasks. In particular, we select and compare two representative tasks: Sentiment Analysis and Named Entity Recognition (NER). Sentiment Analysis represents a typical sentence classification task, while NER is a popular sequence labeling task.
Language Set While theoretically an LIM can be trained using any language set, and be used to make predictions in any language, multilingual representations may not be equally effective across different languages (Gerz et al., 2018). For instance, it has been shown that a multilingual word embedding alignment between English and Chinese is much more difficult to learn than that between English and Spanish (Conneau et al., 2017). We explore many different languages when training and testing LIMs.
Data Resource For high-resource languages, the annotated data can be of different sizes; for low-resource languages, large amounts of data do not often exist (Kasai et al., 2019). We explore the effects of different data sizes when training and testing LIMs.
4.2 Case Study on Sentiment Analysis
We take Sentiment Analysis as a 3-class classi-fication problem: given a sentence s in a target language T, which consists of a series of words: , predict the sentiment polarity
{positive, neutral, negative}.
For this case study, we consider 7 high-resource languages: English, Spanish, Italian, Brazilian Portuguese, Dutch, Japanese and Chinese, covering both western and eastern languages. The high-resource training set consists of 770K data points — 230K English, and 90K each in other 6 languages; the test set contain both public available test data and high quality in-house test data — 630K English, 10K Spanish, 57K Japanese, 10K Chinese and 15K French. Meanwhile, we collect 5K data points each in 5 languages: Danish, Swedish, Norwegian, Russian, and Turkish, which are considered as low-resource languages in our experiments. We use 4K as training set and 1K as test set for each low-resource language.
We randomly split 1/10 from the training set as the development set for model selection and the rest for model training (i.e., fine-tuning the parameters of Multilingual BERT and the sentence classification layer). Following original BERT fine-tuning (Devlin et al., 2019), we fine-tune the multilingual BERT with the following parameter choices: (1) batch size: 16, 32; (2) learning rate: 5e-5, 3e-5, 2e-5; (3) number of epochs: 3, 4. The model of 32 batch size, 2e-5 learning rate and 4 epochs was selected as the best model based on its performance on the development set. We denote the LIM for Sentiment Analysis trained with high-resource languages as LIM-H, and the LIM trained with the mix of high-resource and low-resource languages as LIM-M.
4.2.1 Results on High-Resource Languages
For high-resource languages, we compare LIM-H with the following methods:
• CNN (Kim, 2014) is a convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We use this method to train monolingual Sentiment Analysis models as a baseline because of its popularity and simple implementation for reproducibility.
• ULMFiT (Howard and Ruder, 2018) is a recent generative pretrained language model with task-specific fine-tuning. We follow ULMFiT by adopting discriminative fine-tuning and slanted triangular learning rates to stabilize the fine-tuning process and create monolingual Sentiment Analysis models.
• Monolingual-BERT. We trained monolingual Sentiment Analysis models by fine-tuning BERT with monolingual datasets for every language, respectively. For example, a Chinese-only BERT model refers to the BERT model fine-tuned using Chinese-only annotated data for Sentiment Analysis.
In Table 1, we report the accuracy results of Sentiment Analysis on English and Spanish across various models. We get a significant boost in performance of 7.4% than CNN, and 3.2% than ULMFiT in English. As for Spanish, we outperform the previous methods by 4.5% and 2.3% respectively.
Furthermore, we show that our method is able to compete with the monolingual BERT models
Table 1: Accuracy results of Sentiment Analysis on English and Spanish across various models.
on Sentiment Analysis in Table 2. By leveraging data from non-native languages, our LIM outperforms the English-only BERT model by 1.8% and the Japanese-only BERT model by 0.7%, but falls behind the Chinese-only BERT model by 1.2%. It should be noted that BERT specifically pre-trained the Chinese-only model to account for its unique character tokenization. Therefore, it is still very encouraging to see that our LIM is comparable to a specially customized monolingual BERT model.
Table 2: Accuracy results of Sentiment Analysis on English, Japanese and Chinese between monolingual BERT and LIM-H.
In Table 3, we evaluate the impact of LIM on Sentiment Analysis via zero-shot transfer learning. When we do not include any French annotated data for training, we can still obtain a significant improvement of 5.7% over the monolingual CNN model trained using French annotated data.
Table 3: Accuracy results of Sentiment Analysis on French between CNN and LIM-H. This demonstrates a zero-shot transfer learning case for LIM-H as it does not involve any French annotated data when training the model.
4.2.2 Results on Low-Resource Languages
For low-resource languages, we compare both LIM-H and LIM-M in Table 4. LIM-H demonstrates the effects of zero-shot transfer learning on low-resource languages, with an average of 60% accuracy. Since we do not use any low-resource training data in LIM-H, this shows that LIM can be used to address the cold-start problem, where no initial model is available for a new target low-resource language, when building such models from scratch is costly. Furthermore, LIMM demonstrates how much improvement a LIM can gain by adding only a small amount of data in low-resource languages. In particular, by adding 4K annotated data in each low-resource language, we obtain an average of 11% improvement. This largely saves the cost and time for acquiring annotated data of a new target low-resource language by transferring the knowledge learned from a larger amount of annotated data available in high-resource languages.
Table 4: Accuracy results of Sentiment Analysis on low-resource languages. We compare the performance of zero-shot transfer learning in LIM=H (without any annotated data from the target languages) and low-resource transfer training in LIM-M (only 4K annotated data from the target languages were used in training).
4.3 Case Study on Named Entity Recognition
Given a sentence s in a target language T, which consists of a series of words: outputs a sequence of labels
spect to the named entity type
Person, Location, Organization, Date, Time, JobTitle, Duration, Facility, GeographicFeature, Measure, Ordinal, Money}. This is much more fine-grained and complex than the traditional CoNLL NER task that only considers 4 entity types (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meul- der, 2003). We follow the Inside-outside-beginning (IOB2) tagging format (Ramshaw and Marcus, 1999): a B-prefix means that the tag is the beginning of a chunk, an I-prefix indicates that the tag is inside a chunk, and an O tag represents that a token belongs to no chunk.
We build an LIM for NER with annotated data in 3 languages: French, Italian and German. The training set consists of 679K data points (148K in French, 470K in Italian and 61K in German). We randomly split 1/10 from the training set as the development set for model selection and the rest for model training (i.e., fine-tuning the parameters of Multilingual BERT and the sequence labeling layer). We selected the best model of 32 batch size, 2e-5 learning rate and 3 epochs, after fine-tuning with different parameters (described in Section 4.2) on the development set.
4.3.1 Compared Methods
We compare LIM with the following methods:
• BiLSTM+CRF (Lample et al., 2016) is a bidirectional LSTM with a sequential conditional random field above it. We use this method to train monolingual NER models as a baseline because it has been effective and widely used on sequence labeling tasks.
• FLAIR (Akbik et al., 2019) is one of the latest NLP frameworks that achieved state-of-the-art for sequence labeling tasks. It models words as sequence of characters and leverages contextual string embeddings produced from a trained character language model (Akbik et al., 2018). We adopt the pre-trained multilingual FLAIR embedding to build multilingual NER models using the FLAIR framework.
4.3.2 Results
We evaluate the models on high quality in-house benchmark datasets for NER in various languages including French (3870 entities), Italian (3776 entities), and German (5023 entities)1.
First of all, we report the F-measure results of NER on French, Italian and German. Regarding French, we reach a significant improvement in performance of 9.9% than BiLSTM+CRF, and 7.1% than FLAIR. Similarly, on German, we outperform the previous methods by 6.1% and 2.4% respectively. Our LIM approach is comparable to BiLSTM+CRF and outperforms FLAIR by 3.5% on Italian.
Secondly, we evaluate the effects of our LIM approach for zero-shot transfer learning on NER. We trained another FLAIR and LIM using only the concatenation of French and Italian annotated data while excluding German annotated data. Table 6 shows that our LIM method is able to retain the performance of 58.6% while FLAIR drops to 20.3%. demonstrates shows the power of our LIM method in accelerating the development of models for a new language where no annotated data is available.
Table 5: F-measure results of NER on French, Italian and German. The BiLSTM-CRF models were trained using monolingual data in each language respectively. The FLAIR and LIM models were trained using the concatenation of French, Italian and German annotated data.
Table 6: F-measure results of NER on German (zero-shot transfer learning). The FLAIR and LIM models were trained using the concatenation of French and Italian annotated data, while German annotated data was excluded.
4.4 Discussion
Task Type While the results demonstrate the effectiveness of LIMs on two most representative NLP tasks, we found that LIMs are generally more effective on a sentence classification task than a sequence labeling task, particularly for zero-shot transfer learning. For example, LIM outperforms the corresponding baseline on Sentiment Analysis (Table 3), but falls behind the corresponding baseline on NER (Table 5 and 6), when no annotated data from the target language was used in model training.
Language Set Powered by the multilingual representations learned in pre-trained BERT, LIMs seem more suitable for typologically similar languages. For instance, the LIM-H is not as good as the model trained using Chinese-only BERT on Sentiment Analysis, though the difference is relatively small (Table 2). This is consistent with the findings from multilingual representation learning using word embeddings (Conneau et al., 2017).
Data Resource Language-independent models are not only suitable for high-resource languages, but also very effective in low-resource languages. In particular, adding a relatively small amount of low-resource training data can result in a signifi-cant improvement of performance (Table 4).
Implications These insights bring unique values to the development and customization of natural language understanding models and solutions in new languages. First of all, it can be used to solve the cold-start problem, where no initial model is available for a new target language, when building such models from scratch is costly. Secondly, it largely saves the cost and time for acquiring annotated data of a new target language by reusing data already annotated in previously supported languages. Thirdly, it simplifies the deployment process of a new model and save the efforts for simultaneously maintaining multiple monolingual models in a production setting.
As the use of machine learning becomes more pervasive all over the world, people speaking different languages will come to expect seamless and customized experience of their own. Building a language independent model can accelerate the enablement of machine learning and cognitive solutions in new languages at a large scale. We demonstrate the power of this language-independent modeling approach through a series of experiments on multiple task types, language sets and data resources. Our annotated data for low-resource languages will be made publicly available. We hope that the insights gained from these experiments will help researchers and practitioners develop solutions and tools that enable better scalability, integration and operations in many other languages. In future, we will continue to explore the effects of different combinations of languages with respect to various end tasks. Besides, we plan to extend the studies to more NLP tasks, and investigate the feasibility of multi-task learning for building a task and language independent framework.
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.
Rama Akkiraju, Vibha Sinha, Anbang Xu, Jalal Mah- mud, Pritam Gundecha, Zhe Liu, Xiaotong Liu, and
John Schumacher. 2018. Characterizing machine learning process: A maturity framework. arXiv preprint arXiv:1811.04871.
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041.
Mikel Artetxe and Holger Schwenk. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464.
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv´e J´egou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Manaal Faruqui and Chris Dyer. 2014. Improving vec- tor space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 462–471.
Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 316–327.
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi´egas, Martin Wattenberg, Greg Corrado, et al. 2017. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042.
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Guillaume Lample, Miguel Ballesteros, Sandeep Sub- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
Xiaotong Liu, Anbang Xu, Vibha Sinha, and Rama Akkiraju. 2018. Voice of customer: a tone-based analysis system for online user engagement. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, page LBW001. ACM.
Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf.
Lance A Ramshaw and Mitchell P Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora, pages 157–176. Springer.
Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning - Volume 20, COLING-02, pages 1–4, Stroudsburg, PA, USA. Association for Computational Linguistics.
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.