Neural Machine Translation [1]–[3] has been the state-of-the-art approach for translation in recent years [4], [5], outperforming Phrase-Based Statistical Machine Translation [6] when high-quality parallel data is available in abundance between the languages [7]. This huge training dataset is usually scarce and expensive to compile for many language pairs. Recently, researchers have proposed methods to exploit the easier-to-get monolingual data of one or both of the languages to augment the parallel data and improve the performance of the translation models. Such methods include integrating a language model [8], back-translation [9]–[11], forward translation [12] and dual learning [13].
The back-translation is simple and has been the most effective technique yet [5], [9]. This method involves the training of a target-to-source (backward) model on the real bitext and using the model to translate a large monolingual data in the target language into the source language – synthetic parallel data. The real and synthetic data are then mixed to train a source-to-target – forward – model. The works of [5], [14], [15] have found that indicating to the model that a data is back-translated improves the model. This was done using noise or tags (and gates) in the synthetic inputs.
This work is aimed at simplifying the works of [5], [14], [15] that explicitly differentiate between the two data using noise/tags/gates. Instead of noising the back-translated data, our approach – tag-less back-translation – aims to enable the model to learn efficiently from the two data through pretraining and finetuning. Pretraining involves training a model for some time on a dataset. The model is not final because the parameters learned will either be finetuned on an in-domain data – domain adaptation – or transferred to a different dataset – transfer learning. As shown in Fig. 1, the forward model will be trained on the synthetic data and finetuned on the natural data. Training the model on synthetic data has been shown to attain a performance that is close to that of using natural data only [5], [16] and finetuning has been shown to improve the model even when it is trained on a general domain data [10].
This section presents prior work on neural machine translation, back-translation and pretraining in neural machine translation.
A. Neural Machine Translation (NMT)
The neural machine translation system (NMT) is based on a sequence-to-sequence encoder-decoder system with attention mechanism [1], [17], [18]. The encoders and decoders are made of neural networks that model the conditional probability of source sentence
, to a target sentence
. The encoder converts the input in the source language into a set of vectors while the decoder converts the set of vectors into another language through an attention mechanism, one word at a time. The attention mechanism was introduced to keep track of context in longer sentences [1].
The NMT model produces the translation sentence by
generating one target word at every time step. Given an input sequence and previously translated words
, the probability of the next word
where is the decoder hidden state for time step
computed as
Here, are nonlinear transform functions, which can
be implemented as long short-term memory (LSTM) network [19] or gated recurrent units (GRU) [20], and is a distinct context vector at time step
, which is calculated as a weighted sum of the input annotations
where is the annotation of
calculated by a bidirectional
Recurrent Neural Network. The weight is calculated as
and
where is the weight vector,
are the weight matrices.
All of the parameters in the NMT model, represented as
are optimized to maximize the following conditional loglikelihood of the sentence aligned bilingual samples
To remove the recurrence and enable parallelization across multiple GPUs during training, the convolutional neural networks were used to create the convolutional neural machine translation (CNMT) encoder-decoder architecture [2], [21]. The CNMT utilizes 1-dimensional convolutional layers followed by gated linear units, GLU [22]. The decoders compute and apply attention to each of the layers. The model uses positional embeddings along with residual connections.
The transformer [3], [23] architecture was introduced to remove the recurrence and convolutions of previous architectures. The transformer is based solely on multi-headed self-attention layers. It enables parallelization across multiple
GPUs, thereby, reducing training time. The architecture is used in current state-of-the-art translation systems [4], [5].
B. Leveraging Monolingual Data for NMT
The use of monolingual data of target and/or source language has been studied extensively to improve the performance of translation models, especially in low resource settings. Gulcehre et al. [8] explored the use of language models trained on monolingual data, [24], [25] proposed augmenting a copy or slightly modified copy respectively of the target data as source, Sennrich et al. [10] proposed the back-translation approach, Zhang and Zong [12] proposed the forward translation and [13], [26] used both forward and back-translations to improve the translation models. The back-translation approach has been shown to outperform other approaches in low and high resource languages [5], [9].
Various studies have investigated back-translation with the aim to improve the backward model, to select the most suitable generation/decoding methods and to reduce the impact of the ratio of the synthetic to real bitext. Various studies have found that the quality of the models trained using back-translation depends on the quality of the backward model [5], [9], [11], [15], [25], [27], [28]. To improve the quality of the synthetic parallel data, [9] used iterative back-translation – iteratively using the back-translated data to improve the backward model, Kocmi and Bojar [28] and Dabre et al. [29] used high resource languages through transfer learning and Zhang et al. [30] explored the use of both target and source monolingual data to improve both the backward and forward models.
Niu et al. [31] trained a bilingual system based on [32] to do both forward and backward translations, eliminating the need for two separate models. Poncelas et al. [33] used Transductive data selection methods to select monolingual data that are in the same domain as the test set for back-translation, improving performance.
Pre-Train
Fig. 1. Tag-less Back-Translation: Training the forward model on the synthetic parallel data generated using the backward model. The forward model is then finetuned on the natural data
The works of [16], [27] have found that the ratio of synthetic to natural data affects the performance of the models most. When the ratio is high, the model tends to learn from the synthetic data which contains more mistakes than the natural data. Investigations have found that sampling and adding noise to beam search output outperforms the regular beam decoding technique [5], [34]. These approaches improve the models by enhancing source-side diversity. Caswell et al. [14] claimed, instead, that noise only indicates to the model that the input is either synthetic or natural, enabling the model to better utilize the two data.
Zhang et al. [15] and Caswell et al. [14] used tags (and gates) to enable the model to distinguish between the data and the approach has been shown to utilize more synthetic data, outperforming standard back-translation and enhancing the efficiency of iterative back-translation.
C. Pretraining
Pretraining has been used successfully in various machine learning tasks to improve performance when the data is not enough to train a good model. It was used for training word embeddings [35], computer vision [36], domain adaptation [5] and low resource neural machine translation [7].
In neural machine translation, the approach was used in transfer learning. The transfer learning for machine translation approach involves training a model on a high resource language pair and transferring the training on a low resource pair. The works of [7], [37], [38] have shown tremendous improvements over models that are trained with the low resource data from scratch.
Pretraining has been used in back-translation. Sennrich et al. [10] showed that finetuning a pre-trained model improves the quality of back-translation model. Popel [39] pre-trained the model on the mixed synthetic and natural data and finetunes on the natural data. Kocmi and Bojar [28] and Dabre et al. [29] pre-trained a model on a high resource language and finetunes it on a low resource language pair.
The approach is shown in Algorithm 1. The natural parallel data: is used to train a target-to-source
model, . This model – the backward model – is used to translate the monolingual target data,
generate the synthetic parallel data:
synthetic data is then used to train the forward model, until no improvement is observed on the development set. Finally, the forward model is finetuned on the natural data.
A. Set-up
We used the TensorFlow [40] implementation of the OpenNMT [41] framework to train the models. The set-up is based on the NMTSmallV1 configuration. Specifically, the configuration is a 2-layer unidirectional LSTM encoder-decoder model with Luong attention [18]. It has 512 hidden units and a vocabulary size of 50,000 for both source and target languages. The models are trained for 50,000 training steps using Adam [42] optimizer and a batch size of 64 with a dropout probability of 0.3, a static learning rate of 0.0002 and is evaluated on the development set after every 5,000 training steps.
We first created a backward (En-Vi) model on the IWSLT’15 English-Vietnamese parallel data. The model was used to translate the monolingual data to generate synthetic parallel data. The synthetic sources were labelled with the <BT> token and mixed with the natural sources to generate the mix parallel corpus. This mixed data is used to train the forward tagged back-translation model – tagged_bt. The untagged synthetic corpus was used to pre-train a forward model – tagless_bt – for 30,000 training steps. Finally, the real parallel data was used to finetune the tagless_bt model for a further 20,000 training steps.
The two models – tagged_bt and tagless_bt were trained further for 20,000 training steps to determine how (or if) they continue to improve. The models were evaluated using the bilingual evaluation understudy metric, BLEU [43], specifically the multi-bleu [44] implementation.
TABLE I. DATASETS USED
TABLE II. BLEU SCORES FOR EACH CHECKPOINT OF THE TWO MODELS.
Input: Parallel data
4: Pre-train forward model on parallel data
5: Finetune the forward model
on parallel data
B. Data
For this work, we use the preprocessed [45] low resource English-Vietnamese parallel data of the IWSLT 2015 Translation task [46]. We used the 2012 and 2013 test sets for development and testing respectively. For the monolingual data, we used the preprocessed English monolingual data of WMT 2017 English-Czech translation task [47]. The statistics of the datasets are shown in Table 1. We shuffled the monolingual data and selected the same number of monolingual sentences as the En-Vi parallel data.
The backward model achieved a BLEU score of 24.66 after 50,000 training steps. Table 2 shows the evaluation scores of the forward models for various training steps. In Fig. 2, we show how the BLEU scores continue to improve with increase in training steps. The tagless_bt model continues to slightly outperform the tagged_bt during training.
All the models – backward and forward – were trained for a specified number of training steps. Their performance can be improved further when the criteria for stopping the training is when there is no improvement in the accuracy of the models – when training further does not improve the models.
The pre- trained model, although trained for a few steps compared to the other models, performed very low. Finetuning the model results in a sharp rise in performance. The training was done to equal the number of training steps between the tagged_bt and tagless_bt models.
This work has shown that a neural machine translation model pre-trained on synthetic data and finetuned on the natural data performs comparably to the successful method of tagging the synthetic data. The technique – tag-less back-translation – continues to outperform the tagging approach on EnglishVietnamese low resource translation during training.
In ongoing work, we plan to use more monolingual data and to investigate the effects of further training on the backward and pre-trained models. In the future, we will investigate the technique on high resource machine translation and also, using the more advanced Transformer architecture.
[1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv Prepr. arXiv1409.0473, 2014.
[2] J. Gehring, A. Michael, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional Sequence to Sequence Learning,” arXiv:1705.03122v3, 2017.
[3] A. Vaswani et al., “Attention Is All You Need,” in 31st Conference on Neural Information Processing Systems, 2017.
[4] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling Neural Machine Translation,” arXiv:1806.00187v3, 2018.
[5] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding BackTranslation at Scale,” arXiv:1808.09381v2, 2018.
[6] P. Koehn, J. F. Och, and D. Marcu, “Statistical Phrase-Based Translation,” in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003, pp. 127–133.
[7] B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer Learning for LowResource Neural Machine Translation,” arXiv:1604.02201v1, 2016.
[8] C. Gulcehre, O. Firat, K. Xu, K. Cho, and Y. Bengio, “On integrating a language model into neural machine translation,” Comput. Speech Lang., vol. 45, no. 2017, pp. 137–148, 2017.
[9] V. C. D. Hoang, P. Koehn, G. Haffari, and T. Cohn, “Iterative BackTranslation for Neural Machine Translation,” in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 2018, pp. 18– 24.
[10] R. Sennrich, B. Haddow, and A. Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 86–96.
[11] M. Graca, Y. Kim, J. Schamper, S. Khadivi, and H. Ney, “Generalizing Back-Translation in Neural Machine Translation,” arXiv:1906.07286v1 [cs.CL], 2019.
[12] J. Zhang and C. Zong, “Exploiting Source-side Monolingual Data in Neural Machine Translation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1535– 1545.
[13] D. He et al., “Dual Learning for Machine Translation,” CoRR, vol. abs/1611.0, 2016.
[14] I. Caswell, C. Chelba, and D. Grangier, “Tagged Back-Translation,” arXiv:1906.06442v1 [cs.CL], 2019.
[15] Z. Yang, W. Chen, F. Wang, and B. Xu, “Effectively training neural machine translation models with monolingual data,” Neurocomputing, vol. 333, pp. 240–247, 2019.
[16] A. Poncelas, D. Shterionov, A. Way, G. W. Maillette de Buy, and P. Passban, “Investigating Backtranslation in Neural Machine Translation,” arXiv:1804.06189v1 [cs.CL], 2018.
[17] I. Sutskever, O. Vinyals, and Q. V Le, “Sequence to Sequence Learning with Neural Networks,” in NIPS, 2014.
[18] M. T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” arXiv:1508.04025v5, 2015.
[19] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[20] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation,” CoRR, vol. abs/1406.1, 2014.
Fig. 2. BLEU scores showing performance of the two approaches. The tagless_bt model outperforms the tagged_bt model at all training checkpoints.
0
[21] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay Less
[22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language Modeling
[23] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser,
[24] A. Currey, A. V. Miceli Barone, and K. Heafield, “Copied Monolingual
Data Improves Low-Resource Neural Machine Translation,” in Proceedings of the Second Conference on Machine Translation, 2017, vol. 1, pp. 148–156.
[25] F. Burlot and F. Yvon, “Using Monolingual Data in Neural Machine
[26] Y. Xia et al., “Dual Learning for Machine Translation,” CoRR, vol.
[27] M. Fadaee and C. Monz, “Back-Translation Sampling by Targeting
Difficult Words in Neural Machine Translation,” arXiv:1808.09006v2 [cs.CL], 2018.
[28] T. Kocmi and O. Bojar, “CUNI Submission for Low-Resource Languages
in WMT News 2019,” in Proceedings of the Fourth Conference on Machine Translation (WMT), 2019, vol. 2, no. Shared Task Papers (Day 1), pp. 234–240.
[29] R. Dabre et al., “NICT’s Supervised Neural Machine Translation Systems
for the WMT19 News Translation Task,” in Proceedings of the Fourth Conference on Machine Translation (WMT), 2019, vol. 2, no. Shared Task Papers (Day 1), pp. 168–174.
[30] Z. Zhang, S. Liu, M. Li, M. Zhou, and E. Chen, “Joint Training for Neural
[31] X. Niu, M. Denkowski, and M. Carpuat, “Bi-Directional Neural Machine
Translation with Synthetic Parallel Data,” arXiv:1805.11213v2 [cs.CL], 2018.
[32] M. Johnson et al., “Google’s Multilingual Neural Machine Translation
System: Enabling Zero-Shot Translation,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 339–351, 2017.
[33] A. Poncelas, G. W. Maillette de Buy, and A. Way, “Adaptation of
Machine Translation Models with Back-translated Data using Transductive Data Selection Methods,” arXiv:1906.07808v1 [cs.CL], 2019.
[34] K. Imamura, A. Fujita, and E. Sumita, “Enhancement of Encoder and
Attention Using Target Monolingual Corpora in Neural Machine Translation,” in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 2018, pp. 55–63.
[35] T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient Estimation of
Word Representations in Vector Space,” arXiv:1301.3781v3, pp. 1–12, 2013.
[36] P. N. Whatmough, C. Zhou, P. Hansen, S. K. Venkataramanaiah, J. Seo,
and M. Mattina, “FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning,” CoRR, vol. abs/1902.1, 2019.
[37] T. Kocmi and O. Bojar, “Trivial Transfer Learning for Low-Resource
Neural Machine Translation,” in Proceedings of the Third Conference on Machine Translation (WMT), 2018, vol. 1, pp. 244–252.
[38] T. Q. Nguyen and D. Chiang, “Transfer Learning across Low-Resource,
Related Languages for Neural Machine Translation,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing, 2017, vol. 2, pp. 296–301.
[39] M. Popel, “Machine Translation Using Syntactic Analysis,” Charles
[40] M. Abadi et al., “TensorFlow: A System for Large-scale Machine
Learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 265–283.
[41] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT:
Open-Source Toolkit for Neural Machine Translation,” arXiv e-prints, p. arXiv:1701.02810, Jan. 2017.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[43] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A Method for
40th Annual Meeting on Association for Computational Linguistics, 2002, pp. 311–318.
[44] P. Koehn et al., “Moses: Open Source Toolkit for Statistical Machine
Translation,” in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007, pp. 177–180.
[45] M.-T. Luong and C. D. Manning, “Stanford Neural Machine Translation
Systems for Spoken Language Domain,” in International Workshop on Spoken Language Translation, 2015.
[46] M. Cettolo, G. Christian, and M. Federico, “WIT3: Web Inventory of