Neural machine translation (NMT) has achieved impressive performance on machine translation task in recent years. However, in consideration of efficiency, a limited-size vocabulary that only contains the top-N highest frequency words are employed for model training, which leads to many rare and unknown words. It is rather difficult when translating from the low-resource and morphologically-rich agglutinative languages, which have complex morphology and large vocabulary. In this paper, we propose a morphological word segmentation method on the source-side for NMT that incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time. It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks. Experimental results show that our morphologically motivated word segmentation method is better suitable for the NMT model, which achieves significant improvements on Turkish-English and Uyghur-Chinese machine translation tasks on account of reducing data sparseness and language complexity.
Neural machine translation (NMT) has achieved impressive performance on machine translation task in recent years for many language pairs (Sutskever et al., 2014; Bahdanau et al., 2015; Luong et al., 2015). However, in consideration of time cost and space capacity, the NMT model generally employs a limited-size vocabulary that only contains the top-N highest frequency words (commonly in the range of 30K to 80K) (Jean et al., 2015), which leads to the Out-of-Vocabulary (OOV) problem following with inaccurate and terrible translation results. Research indicated that sentences with too many unknown words tend to be translated much more poorly than sentences with mainly frequent words. For the low-resource and source-side morphologically-rich machine translation tasks, such as TurkishEnglish and Uyghur-Chinese, all the above issues are more serious due to the fact that the NMT model cannot effectively identify the complex morpheme structure or capture the linguistic and semantic information with too many rare and unknown words in the training corpus.
Both the Turkish and Uyghur are agglutinative and highly-inflected languages in which the word is formed by suffixes attaching to a stem (Ablimit et al., 2010). The word consists of smaller morpheme units without any splitter between them and its structure can be denoted as “stem + suffix1 + suffix2 + ... + suffixN”. A stem is attached in the rear by zero to many suffixes that have many inflected and morphological variants depending on case, number, gender, and so on. The complex morpheme structure and relatively free constituent order can produce very large vocabulary because of the derivational morphology, so when translating from the agglutinative languages, many words are unseen at training time. Moreover, due to the semantic context, the same word generally has different segmentation forms in the training corpus.
For the purpose of incorporating morphology knowledge of agglutinative languages into word segmentation for NMT, we propose a morphological word segmentation method on the source-side of Turkish-English and Uyghur-Chinese ma-
Table 1: The sentence examples with different segmentation strategies for Turkish-English.
chine translation tasks, which segments the complex words into simple and effective morpheme units while reducing the vocabulary size for model training. In this paper, we investigate and compare the following segmentation strategies:
• Stem with combined suffix
• Stem with singular suffix
• Byte Pair Encoding (BPE)
• BPE on stem with combined suffix
• BPE on stem with singular suffix
The latter two segmentation strategies are our newly proposed methods. Experimental results show that our morphologically motivated word segmentation method can achieve significant improvement of up to 1.2 and 2.5 BLEU points on Turkish-English and Uyghur-Chinese machine translation tasks over the strong baseline of pure BPE method respectively, indicating that it can provide better translation performance for the NMT model.
We will elaborate two popular word segmentation methods and our newly proposed segmentation strategies in this section. The two popular segmentation methods are morpheme segmentation (Ablimit et al., 2010) and Byte Pair Encoding (BPE) (Sennrich et al., 2015). After word segmentation, we additionally add an specific symbol behind each separated subword unit, which aims to assist the NMT model to identify the morpheme boundaries and capture the semantic information effectively. The sentence examples with different segmentation strategies for Turkish-English machine translation task are shown in Table 1.
2.1 Morpheme Segmentation
The words of Turkish and Uyghur are formed by a stem followed with unlimited number of suffixes. Both of the stem and suffix are called morphemes, and they are the smallest functional unit in agglutinative languages. Study indicated that modeling language based on the morpheme units can provide better performance (Ablimit et al., 2014). Morpheme segmentation can segment the complex word into morpheme units of stem and suf-fix. This representation maintains a full description of the morphological properties of subwords while minimizing the data sparseness caused by inflection and allomorphy phenomenon in highly-inflected languages.
2.1.1 Stem with Combined Suffix
In this segmentation strategy, each word is segmented into a stem unit and a combined suffix unit. We add “##” behind the stem unit and add “$$” behind the combined suffix unit. We denote this method as SCS. The segmented word can be denoted as two parts of “stem##” and “suf-fix1suffix2...suffixN$$”. If the original word has no suffix unit, the word is treated as its stem unit. All the following segmentation strategies will follow this rule.
2.1.2 Stem with Singular Suffix
In this segmentation strategy, each word is segmented into a stem unit and a sequence of suffix units. We add “##” behind the stem unit and add “$$” behind each singular suffix unit. We denote this method as SSS. The segmented word can be denoted as a sequence of “stem##”, “suffix1$$”, “suffix2$$” until “suffixN$$”.
2.2 Byte Pair Encoding (BPE)
Table 2: The training corpus statistics of TurkishEnglish machine translation task.
Table 3: The training corpus statistics of UyghurChinese machine translation task.
2015) for word segmentation and vocabulary reduction by encoding the rare and unknown words as a sequence of subword units, in which the most frequent character sequences are merged iteratively. Frequent character n-grams are eventually merged into a single symbol. This is based on the intuition that various word classes are translatable via smaller units than words. This method making the NMT model capable of open-vocabulary translation, which can generalize to translate and produce new words on the basis of these subword units. The BPE algorithm can be run on the dictionary extracted from a training text, with each word being weighted by its frequency. In this segmentation strategy, we add “@@” behind each no-final subword unit of the segmented word.
2.3 Morphologically Motivated Segmentation
The problem with morpheme segmentation is that the vocabulary of stem units is still very large, which leads to many rare and unknown words at the training time. The problem with BPE is that it do not consider the morpheme boundaries inside words, which might cause a loss of morphological properties and semantic information. Hence, on the analyses of the above popular word segmentation methods, we propose the morphologically motivated segmentation strategy that combines the morpheme segmentation and BPE for further improving the translation performance of NMT.
Compared with the sentence of word surface forms, the corresponding sentence of stem units only contains the structure information without considering morphological information, which can make better generalization over inflectional variants of the same word and reduce data sparseness (Tamchyna et al., 2016). Therefore, we learn a BPE model on the stem units in the training corpus rather than the words, and then apply it on the stem unit of each word after morpheme segmentation.
2.3.1 BPE on Stem with Combined Suffix
In this segmentation strategy, firstly we segment each word into a stem unit and a combined suffix unit as SCS. Secondly, we apply BPE on the stem unit. Thirdly, we add “$$” behind the combined suffix unit. If the stem unit is not segmented, we add “##” behind itself. Otherwise, we add “@@” behind each no-final subword of the segmented stem unit. We denote this method as BPE-SCS.
2.3.2 BPE on Stem with Singular Suffix In this segmentation strategy, firstly we segment each word into a stem unit and a sequence of suffix units as SSS. Secondly, we apply BPE on the stem unit. Thirdly, we add “$$” behind each singular suffix unit. If the stem unit is not segmented, we add “##” behind itself. Otherwise, we add “@@” behind each no-final subword of the segmented stem unit. We denote this method as BPE-SSS.
3.1 Experimental Setup
Turkish-English Data : Following (Sennrich et al., 2016), we use the WIT corpus (Cettolo et al., 2012) and SETimes corpus (Tyers et al., 2010) for model training, and use the newsdev2016 from Workshop on Machine Translation in 2016 (WMT2016) for validation. The test data are newstest2016 and newstest2017.
Uyghur-Chinese Data : We use the news data from China Workshop on Machine Translation in 2017 (CWMT2017) for model training, validation and test.
Data Preprocessing : We utilize the Zemberek 1 with a morphological disambiguation tool to segment the Turkish words into morpheme units, and
Table 4: The training corpus statistics with different segmentation strategies of Turkish
Table 5: The training corpus statistics with different segmentation strategies of Uyghur
utilize the morphology analysis tool (Tursun et al., 2016) to segment the Uyghur words into morpheme units. We employ the python toolkits of jieba 2 for Chinese word segmentation. We apply BPE 3 on the target-side words and we set the number of merge operations to 35K for Chinese and 30K for English and we set the maximum sentence length to 150 tokens. The training corpus statistics of Turkish-English and Uyghur-Chinese machine translation tasks are shown in Table 2 and Table 3 respectively.
Number of Merge Operations : We set the number of merge operations on the stem units in the consideration of keeping the vocabulary size of BPE, BPE-SCS and BPE-SSS segmentation strategies on the same scale. We will elaborate the number settings for our proposed word segmentation strategies in this section.
In the Turkish-English machine translation task, for the pure BPE strategy, we set the number of merge operations on the words to 35K, set the number of merge operations on the stem units for BPE-SCS strategy to 15K, and set the number of merge operations on the stem units for BPE-SSS strategy to 25K. In the Uyghur-Chinese machine translation task, for the pure BPE strategy, we set the number of merge operations on the words to 38K, set the number of merge operations on the stem units for BPE-SCS strategy to 10K, and set the number of merge operations on the stem units for BPE-SSS strategy to 35K. The detailed training corpus statistics with different segmentation strategies of Turkish and Uyghur are shown in Table 4 and Table 5 respectively.
According to Table 4 and Table 5, we can find that both the Turkish and Uyghur have a very large vocabulary even in the low-resource training corpus. So we propose the morphological word segmentation strategies of BPE-SCS and BPE-SSS that additionally applying BPE on the stem units after morpheme segmentation, which not only consider the morphological properties but also eliminate the rare and unknown words.
3.2 NMT Configuration
We employ the Transformer model (Vaswani et al., 2017) with self-attention mechanism architecture implemented in Sockeye toolkit (Hieber et al., 2017). Both the encoder and decoder have 6 layers. We set the number of hidden units to 512, the number of heads for self-attention to 8, the source and target word embedding size to 512, and the number of hidden units in feed-forward layers to 2048. We train the NMT model by using the Adam optimizer (Kingma et al., 2014) with a batch size of 128 sentences, and we shuffle all the training data at
Table 6: Experimental results of Turkish-English machine translation task.
Table 7: Experimental results of Uyghur-Chinese machine translation task.
each epoch. The label smoothing is set to 0.1. We report the result of averaging the parameters of the 4 best checkpoints on the validation perplexity. Decoding is performed by beam search with beam size of 5. To effectively evaluate the machine translation quality, we report case-sensitive BLEU score 4 with standard tokenization and character n-gram ChrF3 score 5.
In this paper, we investigate and compare morpheme segmentation, BPE and our proposed morphological segmentation strategies on the low resource and morphologically-rich agglutinative languages. Experimental results of TurkishEnglish and Uyghur-Chinese machine translation tasks are shown in Table 6 and Table 7 respectively.
According to Table 6 and Table 7, we can find that both the BPE-SCS and BPE-SSS strategies outperform morpheme segmentation and the strong baseline of pure BPE method. Especially, the BPE-SSS strategy is better and it achieves significant improvement of up to 1.2 BLEU points on Turkish-English machine translation task and 2.5
BLEU points on Uyghur-Chinese machine translation task. Furthermore, we also find that the translation performance of our proposed segmentation strategy on Turkish-English machine translation task is not obvious than Uyghur-Chinese machine translation task, the probable reasons are: the training corpus of Turkish-English consists of talk and news data while most of the talk data are short informal sentences compared with the news data, which cannot provide more language information for the NMT model. Moreover, the test corpus consists of news data, so due to the data domain is different, the improvement of machine translation quality is limited.
In addition, we estimate how the number of merge operations on the stem units for BPE-SSS strategy effects the machine translation quality. Experimental results are shown in Table 8 and Table 9. We find that the number of 25K for Turkish, 30K and 35K for Uyghur maximizes the translation performance. The probable reason is that these numbers of merge operations are able to generate a more appropriate vocabulary that containing effective morpheme units and moderate subword units, which makes better generalization over the morphologically-rich words.
The NMT system is typically trained with a limited vocabulary, which creates bottleneck on translation accuracy and generalization capability. Many word segmentation methods have been proposed to cope with the above problems, which consider the morphological properties of different languages.
Bradbury and Socher (Bradbury et al., 2014) employed the modified Morfessor to provide morphology knowledge into word segmentation, but they neglected the morphological varieties be- tween subword units, which might result in am-
Table 8: Different numbers of merge operations for BPE-SSS strategy on Turkish-English.
Table 9: Different numbers of merge operations for BPE-SSS strategy on Uyghur-Chinese.
biguous translation results. Sanchez-Cartagena and Toral (S´anchez-Cartagena et al., 2016) proposed a rule-based morphological word segmentation for Finnish, which applies BPE on all the morpheme units uniformly without distinguishing their inner morphological roles. Huck (Huck et al., 2017) explored target-side segmentation method for German, which shows that the cascading of suffix splitting and compound splitting with BPE can achieve better translation results. Ataman et al. (Ataman et al., 2017) presented a linguistically motivated vocabulary reduction approach for Turkish, which optimizes the segmentation complexity with constraint on the vocabulary based on a category-based hidden markov model (HMM). Our work is closely related to their idea while ours are more simple and realizable. Tawfik et al. (Tawfik et al., 2019) confirmed that there is some advantage from using a high accuracy dialectal segmenter jointly with a language independent word segmentation method like BPE. The main difference is that their approach needs sufficient monolingual data additionally to train a segmentation model while ours do not need any external resources, which is very convenient for word segmentation on the low-resource and morphologically-rich agglutinative languages.
In this paper, we investigate morphological segmentation strategies on the low-resource and morphologically-rich languages of Turkish and Uyghur. Experimental results show that our proposed morphologically motivated word segmentation method is better suitable for NMT. And the BPE-SSS strategy achieves the best machine translation performance, as it can better preserve the syntactic and semantic information of the words with complex morphology as well as reduce the vocabulary size for model training. Moreover, we also estimate how the number of merge operations on the stem units for BPE-SSS strategy effects the translation quality, and we find that an appropriate vocabulary size is more useful for the NMT model.
In future work, we are planning to incorporate more linguistic and morphology knowledge into the training process of NMT to enhance its capacity of capturing syntactic structure and semantic information on the low-resource and morphologically-rich languages.
This work is supported by the National Natural Science Foundation of China, the Open Project of Key Laboratory of Xinjiang Uygur Autonomous Region, the Youth Innovation Promotion Association of the Chinese Academy of Sciences, and the High-level Talents Introduction Project of Xinjiang Uyghur Autonomous Region.
Tawfik, Ahmed and Emam, Mahitab and Essam, Khaled and Nabil, Robert and Hassan, Hany. Morphology-Aware word-segmentation in dialectal Arabic adaptation of neural machine translation. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 11-17, 2019.
Tamchyna, Aleˇs and Alexander, Fraser and Ondˇrej, Bo- jar and Marcin, Junczys-Dowmunt. Target-side context for discriminative models in statistical machine translation. In arXiv preprint arXiv:1607.01149, 2016.
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998-6008, 2017.
Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
Ataman, Duygu and Negri, Matteo and Turchi, Marco and Federico, Marcello. Linguistically motivated vocabulary reduction for neural machine translation from Turkish to English. In The Prague Bulletin of Mathematical Linguistics, volume 108(1), pages 331-342, 2017.
Tursun, Eziz and Ganguly, Debasis and Osman, Turghun and Yang, Ya-Ting and Abdukerim, Ghalip and Zhou, Jun-Lin and Liu, Qun. A semisupervised tag-transition-based markovian model for uyghur morphology analysis. In ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), volume 16(2), pages 8, 2016.
Hieber, Felix and Domhan, Tobias and Denkowski, Michael and Vilar, David and Sokolov, Artem and Clifton, Ann and Post, Matt. Sockeye: A toolkit for neural machine translation. In arXiv preprint arXiv:1712.05690, 2017.
Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V. Se- quence to sequence learning with neural networks. In Advances in NIPS, 2014.
Bradbury, James and Socher, Richard. MetaMind neu- ral machine translation system for WMT 2016. In Proceedings of the First Conference on Machine Translation, pages 264-267, 2014.
Cettolo, Mauro and Girardi, Christian and Federico, Marcello. Wit3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation, pages 261-268, 2016.
Ablimit, Mijit and Neubig, Graham and Mimura, Masato and Mori, Shinsuke and Kawahara, Tatsuya and Hamdulla, Askar. Uyghur morpheme-based language models and ASR. In IEEE International Conference on Signal Processing, 2010.
Tyers, Francis M and Alperen, Murat Serdar. South- east european times: A parallel corpus of balkan languages. In Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pages 49-53, 2010.
Huck, Matthias and Riess, Simon and Fraser, Alexan- der. Target-side word segmentation strategies for neural machine translation. In Proceedings of the Second Conference on Machine Translation, pages 56-67, 2017.
Ablimit, Mijit and Kawahara, Tatsuya and Hamdulla, Askar. Lexicon optimization based on discriminative learning for automatic speech recognition of agglutinative language. In Speech Communication, 2014, volume 60, pages 78-87, 2014.
S´anchez-Cartagena, V´ıctor M. and Antonio Toral. Abu-matran at wmt 2016 translation task: Deep learning, morphological segmentation and tuning on character sequences. In Proceedings of the First Conference on Machine Translation, pages 362-370, 2016.
Gage, Philip. A new algorithm for data compression. In The C Users Journal, volume 12(2), pages 23-38, 1994.
Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. In Computer Science, 2014.
Sennrich, Rico and Haddow, Barry and Birch, Alexan- dra. Neural machine translation of rare words with subword units. In Computer Science, 2015.
Sennrich, Rico and Haddow, Barry and Birch, Alexan- dra. Improving neural machine translation models with monolingual data. In Computer Science, 2016.
Jean, S´ebastien and Cho, Kyunghyun and Memisevic, Roland and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. In ACL, 2015.
Luong, Minh Thang and Pham, Hieu and Manning, Christopher D. Effective approaches to attentionbased neural machine translation. In Computer Science, 2015.