2 BACKGROUND
2.1 NEURAL MACHINE TRANSLATION
of the (student) multilingual model, and denote the parameters of the (teacher) individual model
D, and denotes the total loss on training data
, which consists of the original
Table 5: BLEU scores improvements of our method over the individual models (∆1) and multi-baseline model (∆2) on the 44 languages→English in the Ted talk dataset. Selective Distillation We study the effectiveness of the selective distillation (discussed in Section 3.3) on the Ted talk dataset, as shown in Table 6. We list the 16 languages on which the two methods (selective distillation, and distillation all the time) that have difference bigger than 0.5 in terms of BLEU score. It can be seen that selective distillation performs better on 13 out of 16 languages, with large BLEU score improvements, which demonstrates the effectiveness of the selective distillation. Table 6: BLEU scores of selective distillation (our method) and distillation all the time during the training process on the Ted talk dataset. Top-K Distillation In our experiments, the student model just matches the top-K output distribution of the teacher model, instead of the full distribution, in order to reduce the memory cost. We analyze whether there is accuracy difference between the top-K distribution and the full distribution. We conduct experiments on IWSLT dataset with varying K (from 1 to |V |, where |V | is the vocabulary size), and just show the BLEU scores on the validation set of De-En translation due to space limitation, as illustrated in Table 7. It can be seen that increasing K from 1 to 8 will improve the accuracy, while bigger K will bring no gains, even with the full distribution (K = |V |). Table 7: BLEU scores on De-En translation with varying Top-K distillation on the IWSLT dataset. Back Distillation In our current distillation algorithm, we fix the individual models and use them to teach and improve the multilingual model. After such a distillation process, the multilingual model outperforms the individual models on most of the languages. Then naturally, we may wonder whether this improved multilingual model can further be used to teach and improve individual 8
Table 8: BLEU score improvements of the individual models with back distillation on the IWSLT dataset. models through knowledge distillation. We call such a process back distillation. We conduct the experiments on the IWSLT dataset, and find that the accuracy of 9 out of 12 languages gets improved, as shown in Table 8. The other 3 languages (He, Pt, Zh) cannot get improvements because the improved multilingual model performs very close to individual models, as shown in Table 1.
Comparison with Sequence-Level Knowledge Distillation We conduct experiments to compare
Figure 1: The loss (Figure a) and BLEU score (Figure b: Ar-En, Figure c: Cs-En, Figure d: De-En) changes on the test set of the IWSLT dataset, with varying perturbation parameter σ.
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR 2015, 2015.
Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2006.
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp. 1723–1732, 2015.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine trans- lation with a shared attention mechanism. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 866–875, 2016.
Markus Freitag, Yaser Al-Onaizan, and Baskaran Sankaran. Ensemble distillation for neural ma- chine translation. arXiv preprint arXiv:1702.01802, 2017.
Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 1243–1252, 2017.
Chengyue Gong, Xu Tan, Di He, and Tao Qin. Sentence-wise smooth regularization for sequence to sequence learning. In AAAI, 2018.
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. Universal neural machine translation for extremely low resource languages. In NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 344–354, 2018a.
Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low- resource neural machine translation. arXiv preprint arXiv:1808.08437, 2018b.
Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. Non-autoregressive neural machine translation with enhanced decoder input. In AAAI, 2018.
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. Toward multilingual neural machine translation with universal encoder and decoder. CoRR, abs/1611.04798, 2016. URL http: //arxiv.org/abs/1611.04798.
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567, 2018. URL http: //arxiv.org/abs/1803.05567.
Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Layer-wise coor- dination between encoder and decoder for neural machine translation. In NIPS, pp. 7955–7965, 2018.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Vi´egas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351, 2017. URL https://transacl.org/ojs/index.php/tacl/ article/view/1081.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1317–1327, 2016a.
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016b.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Xu Lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-fly native ensemble. arXiv preprint arXiv:1806.04606, 2018.
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In ICCV, pp. 1928–1936, 2017.
Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. A neural interlingua for multilingual machine translation. CoRR, abs/1804.08198, 2018. URL http://arxiv.org/abs/1804.08198.
Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. CoRR, abs/1511.06114, 2015a. URL http://arxiv.org/ abs/1511.06114.
Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1412– 1421, 2015b.
Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. arXiv preprint arXiv:1808.04189, 2018.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pp. 311–318, 2002. URL http://www.aclweb.org/anthology/P02-1040.pdf.
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. URL http://aclweb.org/anthology/P/P16/P16-1162.pdf.
Yanyao Shen, Xu Tan, Di He, Tao Qin, and Tie-Yan Liu. Dense information flow for neural machine translation. In NAACL, volume 1, pp. 1294–1303, 2018.
Kaitao Song, Xu Tan, Di He, Jianfeng Lu, Tao Qin, and Tie-Yan Liu. Double path networks for sequence to sequence learning. In COLING, pp. 3064–3074, 2018.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112, 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000–6010, 2017.
Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. Beyond error propagation in neural machine translation: Characteristics of language also matter. In EMNLP, pp. 3602–3611, 2018.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine translation with shared encoder and decoder. In AAAI, 2018.
Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan Yuille. Knowledge distillation in generations: More tolerant teachers educate better students. arXiv preprint arXiv:1805.05551, 2018.
Qi Ye, Sachan Devendra, Felix Matthieu, Padmanabhan Sarguna, and Neubig Graham. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL, 2018.
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. arXiv preprint arXiv:1706.00384, 6, 2017.
We give a detailed description about the IWSLT,WMT and Ted Talk datasets used in experiments.
IWSLT: We collect 12 languagesEnglish translation pairs from IWSLT evaluation campaign9 from year 2014 to 2016. Each language pair contains roughly 80K to 200K sentence pairs. We use the official validation and test sets for each language pair. The data sizes of the training set for each language
English pair are listed in Table 10.
Table 10: The training data size on the 12 languagesEnglish on the IWSLT dataset.
WMT: We collect 6 languagesEnglish translation pairs from WMT translation task10. We use 5 language
English translation pairs from WMT 2016 dataset: Cs-En, De-En, Fi-En, Ro-En, RuEn and one other translation pair from WMT 2017 dataset: Lv-En. We use the official released validation and test sets for each language pair. The training data sizes of each language
English pair are shown in the Table 11.
Table 11: The training data size on the 6 languagesEnglish on the WMT dataset.
Ted Talk: We use the common corpus of TED talk which contains translations between multiple languages (Ye et al., 2018)11. We select 44 languages in this corpus that has sufficient data for our experiments. We use the official validation and test sets for each language pair. The data sizes of the training set for each languageEnglish pair are listed in Table 12.
Table 12: The training data size on the 44 languagesEnglish on the Ted talk dataset.
The language names and their corresponding language codes according to ISO 639-1 standard12 are listed in Table 13.
Table 13: The ISO 639-1 code of each language in our experiments. There are two extra language codes in our datasets: Ptbr represents Portuguese spoken in Brazil, Frca represents French spoken in Canada.
The detailed results of the 44 languagesEnglish on the Ted talk dataset are listed in Table 14. It can be seen that while multilingual baseline performs worse than the individual model, multilingual model based on our method nearly matches and even outperforms the individual model. Note that the multilingual model handles 44 languages in total, which means our method can reduce the model parameters size to 1/44 without loss of accuracy.
Table 14: BLEU scores of the individual and multilingual models on the 44 languagesEnglish on the Ted talk dataset.