b

DiscoverSearch
About
My stuff
Modeling Future Cost for Neural Machine Translation
2020·arXiv
Abstract
Abstract

Existing neural machine translation (NMT) systems utilize sequence-to-sequence neural networks to generate target translation word by word, and then make the generated word at each time-step and the counterpart in the references as consistent as possible. However, the trained translation model tends to focus on ensuring the accuracy of the generated target word at the current time-step and does not consider its future cost which means the expected cost of generating the subsequent target translation (i.e., the next target word). To respond to this issue, we propose a simple and effective method to model the future cost of each target word for NMT systems. In detail, a time-dependent future cost is estimated based on the current generated target word and its contextual information to boost the training of the NMT model. Furthermore, the learned future context representation at the current time-step is used to help the generation of the next target word in the decoding. Experimental results on three widely-used translation datasets, including the WMT14 German-to-English, WMT14 English-to-French, and WMT17 Chinese-to-English, show that the proposed approach achieves significant improvements over strong Transformer-based NMT baseline.

The future cost estimation plays an important role in traditional phrase-based statistical machine translation (PBSMT) [Koehn, 2009]. Typically, it utilizes the pre-learned translation knowledge (i.e., translation model and language model) to compute a future cost of any span of input words in advance for one source sentence. The computed future cost estimates how hard it is to translate the untranslated part of the source sentence. For example, for all translation options that have the same number of input words, the higher future cost means that the untranslated part of the source sentence is more difficult to be translated. During the decoding, PBSMT adds up the partial translation probability score of the current span and its future cost to measure the quality of each translation option. As a result, a (or several) translation hypothesis, which is extended by translation options with cheaper future cost, is remained in the beam-search stack as the best paths to generate subsequent translation.

Neural machine translation (NMT) systems [Bahdanau et al., 2015; Vaswani et al., 2017] often utilize sequence-to-sequence neural networks to model translation between the source language and the target language, and achieve state-of-the-art performance on most of the translation tasks [Barrault et al., 2019]. Compared with the traditional PBSMT, NMT systems model translation knowledge through neural networks. This means that there is no need to learn large-scale translation rules as traditional PBSMT. However, lack of translation rules prevents the future cost from being estimated in advance for NMT systems. Therefore, it is difficult to directly use this effective future cost mechanism in PBSMT to enhance the beam-search stack decoding in NMT systems.

In addition, the NMT systems generally model translation between a source language and a target language in an auto-regressive way, that is, based on the previously translated target word (or context) and the source representation to generate target translation word by word. However, this makes the trained translation model only focus on ensuring the accuracy of the generated target word at the current time-step and do not consider its future cost as PBSMT. In other words, there is no mechanism to estimate the future cost of the current generated target word for generating subsequent target translation (i.e., next target word) in NMT systems.

In this paper, we propose a future cost mechanism to learn the expected cost of generating the next target word for NMT systems, for example, state-of-the-art Transformerbased NMT system [Vaswani et al., 2017]. Specifically, the future cost is dynamically estimated based on the current target word and its contextual representation instead of preestimated in PBSMT. We then use the estimated future cost to compute an additional loss item to boost the training of the Transformer-based NMT model. This allows the Transformer-based NMT model to preview the future cost of the current generated target word for the generation of the target word at the next time-step. In addition, the learned future context representation at the current time-step is further used to help the generation of the next target word in the decoding. This allows the future cost information to be applied to the beam-search stack decoding in the auto-regressive way instead of in the isolation way in PBSMT, and thereby enhances translation performance of Transformerbased NMT model.

This paper primarily makes the following contributions:

It introduces a novel future cost mechanism to estimate the impact of the current generated target word for generating subsequent target translation (i.e., next target word) in NMT.

The proposed two models can integrate the proposed future cost mechanism into the state-of-the-art Transformer-based NMT system to improve translation performance.

Experiment results on the WMT14 English-to-German, WMT14 English-to-French, and WMT17 Chinese-to-English translation tasks verify the effectiveness and universality of the proposed future cost mechanism.

An advanced Transformer-based NMT model [Vaswani et al., 2017], which solely relies on self-attention networks (SANs), generally consists of a SAN-based encoder and a SANbased decoder. Formally, given an source input sequence x={x1, · · · , xJ}with length of J, this encoder is adopted to encode the source input sequence x. In particular, each layer includes an SAN sub-layer SelfATT(·)and a positionwise fully connected feed-forward network sub-layer FFN(·). A residual connection [He et al., 2016] is applied between the SAN sub-layer and the FFN syb-layer, followed by layer normalization LN(·)[Ba et al., 2016]. Thus, the output of the first sub-layer Cneand the second sub-layer Hneare sequentially calculated as Eq.(1) and Eq.(2):

image

Typically, this encoder is composed of a stack of N identical layers. As a result, HNeis the final source sentence representation to model translation.

Furthermore, this decoder, which is also composed of a stack of N identical layers, models the context information for predicting translations. In addition to two sub-layers in each decoder layer, the decoder inserts a third sub-layer ATT(Cni , HNe )perform attention over the output of the encoder HNe:

image

At the i-th time-step, the top layer of the decoder HNiis then used to generate the target word  yiby a linear, potentially multi-layered function (or a softmax function):

image

where Woand Wware projection matrices. Thus, the cross entropy loss is computed over a bilingual parallel sentence pair {[x, y]}:

image

In the traditional PBSMT, the future cost aims to estimate the difficulty of each translation option for one source sentence. Generally, PBSMT takes the beam-search stack decoding algorithm to select translation options to expand the current hypotheses. In detail, given a source sentence, all available translation options for any span of input words are collected in advance from the pre-learned translation model and language model. The future cost of each translation option is then computed based on the statistical scores of the translation model and the language model. By adding up the partial translation score and the future cost, PBSMT selects a translation option with the cheapest future cost to expand the current translation hypothesis. Finally, this makes a much better basis for pruning decisions in the beam-search stack decoding.

However, it is difficult to directly apply this future cost of PBSMT to the existing NMT system due to lack of pre-learned translation rules and its auto-regressive characteristic. Compared with the traditional PBSMT system, the NMT system models the translation knowledge as a time-dependent context vector for translation prediction through large-scale neural networks instead of translation rules. In particular, the time-dependent context representation is input to Eq.(6) to compute the translation probability of the target word. Actually, the translation probability is an important part of computing the future cost in PBSMT. Meanwhile, NMT is seen as a neural network language model with attention mechanism [Kalchbrenner and Blunsom, 2013; Bahdanau et al., 2015]. For example, the time-dependent context representation of NMT is regarded as that of the neural network language model [Bengio et al., 2003].

Based on the above analysis, we propose a new method to model the future cost for the existing NMT systems. Specifically, the proposed approach utilizes the current target word and its context representation to learn a future context representation. Thus, this future context representation is input to a softmax layer to compute its future cost for the current target word. Formally, we utilize the top layer HNilearned by the stacked Eq.(1)∼Eq.(5) to model the current context information. The HNiis together with the current generated target word  yito learn a future context representation Fias follows:

image

where E is the embedding matrix of target vocabulary, Wr, Ur, Wz, Uz, W, and U are model parameters.  σ(·)is sigmoid function in which  ⊙means the element-wise dot. Note that for the initial future context representation, we use the special source end token “</s>” and the mean of vectors in the source representation HNeas the input to Eq.(8)∼Eq.(11) to learn F0.

Finally, the learned future context representation Fiis as the input to a softmax layer to compute approximate probabilities of temporary target word  ˆyi+1at the current time-step, called as the future cost of the current generated target word:

image

where  Woand  Wware projection matrices. Later, ˆP(ˆyi+1|y<i, yi, x)will be used to guide the training of NMT.

In this section, we design two NMT models as Figure 1 to make use of the proposed future cost in the previous section. For the first model, we compute an additional loss item of future cost at each time-step, and thereby gain a future costware translation model. In addition to the additional loss item of future cost, the second model utilizes the learned future context representation to help the generation of target word at the next time-step, thus improving the translation performance of the Transformer-based NMT model.

4.1 Model I

The training objective of NMT is to minimize the loss between the words in the translated sentences and those in the references. Specifically, the word-level cross-entropy between the generated target word by NMT and the reference serves as the loss item at each time-step. However, as we analyzed in Section 1, this existing training objective does not consider the future cost of the current generated target word for generating the next target word.

Therefore, we introduce an addition loss term  F(θ)to preview the future cost of the generated target word at the current time-step according to Eq.(12):

image

The  F(θ)encourages the translation model to select a target word that is beneficial to the generation of target word at next time-step. Thus, the loss of the proposed Model I is computed over a bilingual parallel sentence pair {[x, y]}:

image

where  λis a hyper-parameter to weight the expected importance of the future cost loss in relation to the trained translation model. Finally, the trained Model I performs the translation decoding according to Eq.(6).

4.2 Model II

In PBSMT, the future cost mechanism can help the generation of next target word (or phrase) in addition to minimize search

image

Figure 1: The proposed Transformer-based NMT architecture.

errors in the beam-search stack decoding. However, the proposed Model I only focuses on learning future cost-aware NMT model to remain optimal translation hypotheses into the search stack. In other words, this future cost information may be not adequately utilized to predict target translation in NMT. Therefore, we further make use of the learned future context representation to help the generation of target word at the next time-step.

Formally, at the (i+1)-th time-step, the future context representation Filearned at the i-th time-step is first concatenated with the top layer of the decoder HNi+1as the input to the sigmoid function to learn a gate scalar  gi+1:

image

where  σis a sigmoid function and  gi+1∈[0, 1]is used to weight the expected importance of the learned future context representation Fito gain a fused context representation HNi+1as follows:

image

where Wg ∈ R2dmodel×1is a trainable parameter, and  ⊙is the element-wise dot product.

Finally, the fused context representation HNi+1is as the input to a softmax layer to compute translation probabilities of the target word  yi+1at the (i+1)-th time-step:

image

Meanwhile, the training objective of Model II is the same to that of Model I as Eq.(14). Note that the future cost is estimated over the ground-truth target word during the training, and is estimated over the generated target word during the decoding.

image

Table 1: Results for the WMT14 EN-DE, WMT14 EN-FR, and WMT17 ZH-EN translation tasks. “#Speed” denotes the training speed measured in source tokens per second. “#Param.” indicates the number of model parameters. “+/++” after a score indicates that the proposed method was better than the Trans.base model at significance level p <0.05/0.01 [Collins et al., 2005].

5.1 Datasets

The proposed methods were evaluated on the WMT14 English-to-German (EN-DE), WMT14 English-to-French (EN-FR), and WMT17 Chinese-to-English (ZH-EN) translation tasks. The EN-DE training set contains 4.5M bilingual sentence pairs, and the newstest2013 and newstest2014 data sets were used as the validation and test sets, respectively. The EN-FR training set contains 36M bilingual sentence pairs, and the newstest2012 and newstest2013 datasets were combined for validation and newstest2014 was used as the test set. The ZH-EN training set contains 22M bilingual sentence pairs, where the newsdev2017 and the newstest2017 data sets were used as the validation and test sets, respectively. The baselines are involved:

Trans.base/big: a vanilla Transformer-based NMT system without future cost [Vaswani et al., 2017], for example Transformer (base) and Transformer (big) models.

+Future and Past [Zheng et al., 2019]: introduce a capsule network into the Transformer NMT system which is adopted to recognize the translated and untranslated contents, and pay more attention to untranslated parts.

Besides, we reported results of the state-of-the-art works [Hao et al., 2019; Li et al., 2019; Li et al., 2020] for the three translation tasks.

5.2 Settings

We implemented the proposed method in the fairseq [Ott et al., 2019] toolkit, following most settings in Vaswani et al. (2017). In training the NMT model (base), the byte pair encoding (BPE) [Sennrich et al., 2016] was adopted. The vocabulary size of EN-DE and EN-FR was set to 40K and ZH-EN was set to 32k. The dimensions of all input and output layers were set to 512, and that of the inner feedforward neural network layer was set to 2,048. The total heads of all multi-head modules were set to 8 in both the encoder and decoder layers. In each training batch, there was a set of sentence pairs containing approximately 4,096×8 source tokens and 4,096×8 target tokens. The value of label smoothing was set to 0.1, and the attention dropout and residual dropout were p = 0.1. We adopt the Adam optimizer [Kingma and Ba, 2015] to learn the parameters of the model. The learning rate was varied under a warmup strategy with warmup steps of 8,000. For evaluation, we validated the model with an interval of 2,000 batches on the dev set. Following the training of 300,000 batches, the model with the highest BLEU score for the validation set was selected to evaluate the test sets. We used multi-bleu.perl1 script to obtain the case-sensitive 4-gram BLEU score. All models were trained on eight V100 GPUs.

5.3 Overall Results

The main results of the translation are shown in Tables 1. We made the following observations:

1) The performance of our implemented Trans.base/big is slightly superior to that of the original Trans.base/big in the EN-DE and EN-FR dataset. This indicates that it is a strong baseline NMT system and it makes the evaluation convincing.

2) The proposed Model I and Model II significantly outperformed the baseline Trans.base. This indicated that the future cost information was beneficial for the Transformerbased NMT. Meanwhile, the Model II outperformed the comparison system +Future and Past [Zheng et al., 2019], which means that the future cost estimated at the previous

image

Figure 2: Trends of BLEU scores with different source length on the EN-DE, EN-FR and ZH-EN test sets

time-step is further used to help the generation of the current target word in the decoding.

3) Compared with Model I, Model II achieved a slight advantage on all tasks. This means that it is more effective to enhance the translation of the next target word by integrating the learned future hidden representation into the contextual representation of the next word.

4) We also compared our methods with the baseline Trans.big model. In particular, the proposed models yielded similar improvements on the three translation tasks, indicating that the proposed future cost mechanism is a universal method for improving the performance of the Transformer-based NMT model.

5) The proposed models contain approximately 3% additional parameters. Training and decoding speeds are nearly the same as Trans.base. This indicates that the proposed method is efficient by only adding a few training and decoding costs.

5.4 Translating Sentences of Different Lengths

The proposed future cost mechanism focuses on capturing its future cost for the generated target word at each time-step, thus measuring how good it is to generate the next target word. Thus, we show the translation performance of source sentences with different sentence lengths, further verifying the effectiveness of our method. Specifically, we divided each test set into 6 groups according to the length of the source sentence. Figure 2 shows the results of the proposed models and Trans.base model on the three translation tasks. We observed as follows:

1) Model I and Model II were superior to the Trans.base model in almost every length group on all three tasks. This means that the future cost information capturing by the proposed approach is beneficial to Transformer-based NMT.

2) Compared with Model I, Model II achieved a slight advantage in most groups on each task. This indicates that this future cost information also helps the generation of the next target word in addition to that of the current target word.

3) BLEU scores of all models decreased when the length was greater than 30 over ZH-EN task. In contrast, the trend of BLEU scores increased with the sentence length for ENDE and EN-FR tasks. We think that NMT may be good at

modeling translation between distant language pairs (i.e., ZHEN) than similar language pairs (i.e., EN-DE and EN-FR).

image

Figure 3: BLEU scores of Trans.based, the proposed Model I, and Model II with different  λon the EN-DE and ZH-EN validate sets.

5.5 Learning Curve of Hyper-parameter λ

In Eq.(14), we introduce a hyper-parameter  λto adjust the weight of the future cost loss in relation to the trained translation model. To tune the value, we conducted experiments with different  λon validate set for the three tasks. As shown in Figure 3, the proposed models achieved an advantage over the Trans.base model with different  λon the two tasks. As a result, for the EN-DE task, Model I and Model II achieved the best BLEU score with  λ=0.7respectively. The trend on EN-FR (not shown) is similar with that on EN-DE and we set  λ=0.7for EN-FR. For the ZHEN task, Model I achieved the best BLEU score with  λ=0.3while Model II with  λ=0.5. Finally, the results of Table 1 are obtained according to these optimized hyper-parameter  λ.

5.6 Translation Cases

Figure 4(a) shows a translation case to observe the effect of the proposed method. Our method translated “具有[had] 额外的[additional]  政治[political]  价值[value]” into “had additional political value” which are same as the reference,

image

Figure 4: (a) A translation case; (b) Beam search with beam size=5 for Trans.base model; (c) Beam search with beam size=5 for +Model II. The end score is translation probability of each decoding path in subfigure (b) and (c), that is, the higher scores denote the better translation.

while Trans.base translated it into “has added political value” which is different from the reference in the meaning. We think the reason is that +Model II not only predicts “had” accurately at the current step, but also captures future cost information that is beneficial for generating “additional” at the next time-step. Concretely, Figure 4(b) and Figure 4(c) illustrate the beam-search processing for the Trans.base model and +Model II, respectively. In the proposed +Model II, no matter after “had” or after “has”, the candidate “additional” is produced. However, in the Trans.base model, only “added” follows “has”. This indicates that the learned future contextual representation is beneficial for NMT.

Modeling translated and untranslated information in a source sentence is beneficial to generate target translation in NMT. Tu et al. (2016) employed a coverage vector to track the translation part in the source sequence. Similarly, Mi et al. (2016) proposed to use a coverage embedding to model the degree of translation for each word in the source sentence. Later, Li et al. (2018b) presented a coverage score to describe to what extent the source words are translated. Recently, Zheng et al. (2018) introduced two extra recurrent layers in the decoder to maintain the representations of the past and future translation contents. Furthermore, Zheng et al. (2019) adopted a capsule network to model the past and future translation contents explicitly.

Compared with the source information, Lin et al. (2018) proposed to adopt a deconvolution network to model the global information of the target sentence. He et al. (2017) applied a value network to dynamically compute a BLEU score for the rest part of the target sequence based on the difference between the generated sub-sequence and the source sequence. Xia et al. (2017) designed a deliberation network to preview future words through multipass decoding. Li et al. (2018a) enhanced the attention model by the implicit information of target foresight word oriented to both alignment and translation. Zhou et al. (2019) employed two bidirectional decoders to generate a target sentence in an interactive translation setting. Closely related to our work, Weng et al. (2017) proposed to adopt a word prediction mechanism to enhance the contextual representation during the training. The main difference is that the proposed future cost mechanism can not only minimize search errors but also help the generation of the next word explicitly while Weng et al. (2017) just used it as an extra training objective.

In short, the above mentioned works focused on adopting the future context to enhance the contextual information of the target word at the current time-step. Inspired by the future cost in PBSMT, we propose a simple and effective method to estimate the future cost of the current generated target word. This approach enables the NMT model to preview the translation cost of the subsequent target word at the current time-step and thereby helps the generation of the target word at the next time-step. This proposed future cost mechanism is integrated into the existing Transformer-based NMT model to improve translation performance.

In this paper, we propose a simple and effective future cost mechanism to enable the translation model to preview the translation cost of next target word at the current time-step. We empirically demonstrate that such explicit future cost mechanism benefits NMT with considerable and consistent improvements on three language pairs. In the future, we will further extend this work to other NLP tasks.

[Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, July 2016.

[Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, San Diego, CA, USA, May 2015.

[Barrault et al., 2019] Lo¨ıc Barrault, Ondˇrej Bojar, Marta R. Costa-juss`a, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias M¨uller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 conference on machine translation. In WMT, pages 1–61, Florence, Italy, Aug 2019.

[Bengio et al., 2003] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, March 2003.

[Collins et al., 2005] Michael Collins, Philipp Koehn, and Ivona Kuˇcerov´a. Clause restructuring for statistical machine translation. In ACL, pages 531–540, Ann Arbor, Michigan, June 2005.

[Hao et al., 2019] Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling recurrence for transformer. In NAACL-HLT, pages 1198–1207, Minneapolis, Minnesota, June 2019.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, June 2016.

[He et al., 2017] Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, and Tie-Yan Liu. Decoding with value networks for neural machine translation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NeurIPS, pages 178–187, Dec 2017.

[Kalchbrenner and Blunsom, 2013] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700–1709, Seattle, Washington, USA, Oct 2013.

[Kingma and Ba, 2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, April 2015.

[Koehn, 2009] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.

[Li et al., 2018a] Xintong Li, Lemao Liu, Zhaopeng Tu, Shuming Shi, and Max Meng. Target foresight based attention for neural machine translation. In NAACL-HLT, pages 1380–1390, New Orleans, Louisiana, June 2018.

[Li et al., 2018b] Yanyang Li, Tong Xiao, Yinqiao Li, Qiang Wang, Changming Xu, and Jingbo Zhu. A simple and effective approach to coverage-aware neural machine translation. In ACL, pages 292–297, Melbourne, Australia, July 2018.

[Li et al., 2019] Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R. Lyu, and Zhaopeng Tu. Information aggregation for multi-head attention with routing-by-agreement. In Proceedings of NAACL-HLT, pages 3566– 3575, Minneapolis, Minnesota, June 2019.

[Li et al., 2020] Zuchao Li, Rui Wang, Kehai Chen, Masso Utiyama, Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. Data-dependent gaussian prior objective for language generation. In ICLR, 2020.

[Lin et al., 2018] Junyang Lin, Xu Sun, Xuancheng Ren, Shuming Ma, Jinsong Su, and Qi Su. Deconvolutionbased global decoding for neural machine translation. In COLING, pages 3260–3271, Aug 2018.

[Mi et al., 2016] Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. Coverage embedding models for neural machine translation. In EMNLP, pages 955– 960, Austin, Texas, Nov 2016.

[Ott et al., 2019] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL-HLT, pages 48–53, Minneapolis, Minnesota, June 2019.

[Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, pages 1715–1725, Berlin, Germany, Aug 2016.

[Tu et al., 2016] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neural machine translation. In ACL, pages 76–85, Berlin, Germany, Aug 2016.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, Dec 2017.

[Weng et al., 2017] Rongxiang Weng, Shujian Huang, Zaixiang Zheng, Xinyu Dai, and Jiajun Chen. Neural machine translation with word predictions. In EMNLP, pages 136– 145, Copenhagen, Denmark, Sep 2017.

[Xia et al., 2017] Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Deliberation networks: Sequence generation beyond one-pass decoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NeurIPS, pages 1784–1794, 2017.

[Zheng et al., 2018] Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng

Tu. Modeling past and future for neural machine translation. TCL, 6:145–157, 2018.

[Zheng et al., 2019] Zaixiang Zheng, Shujian Huang, Zhaopeng Tu, Xin-Yu Dai, and Jiajun Chen. Dynamic past and future for neural machine translation. In EMNLP-IJCNLP, Hong Kong, China, Nov 2019.

[Zhou et al., 2019] Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous bidirectional neural machine translation. TACL, 7:91–105, March 2019.


Designed for Accessibility and to further Open Science