arxiv: v1 [cs.cl] 16 Jan 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.cl] 16 Jan 2018"

Elfreda Ashley Jennings
5 years ago
Views:

1 Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang 1, Jinsong Su 1, Yue Qin 1, Yang Liu 2, Rongrong Ji 1, Hongi Wang 1 Xiamen University, Xiamen, China 1 Tsinghua University, Beiing, China 2 xwzhang@stu.xmu.edu.cn, ssu@xmu.edu.cn, qinyue@stu.xmu.edu.cn liuyang2011@tsinghua.edu.cn, rri@xmu.edu.cn, hw@xmu.edu.cn arxiv: v1 [cs.cl] 16 Jan 2018 Abstract The dominant neural machine translation (NMT) models apply unified attentional encoder-decoder neural networks for translation. Traditionally, the NMT decoders adopt recurrent neural networks (RNNs) to perform translation in a left-toright manner, leaving the target-side contexts generated from right to left unexploited during translation. In this paper, we equip the conventional attentional encoder-decoder NMT framework with a backward decoder, in order to explore bidirectional decoding for NMT. Attending to the hidden state sequence produced by the encoder, our backward decoder first learns to generate the target-side hidden state sequence from right to left. Then, the forward decoder performs translation in the forward direction, while in each translation prediction timestep, it simultaneously applies two attention models to consider the source-side and reverse target-side hidden states, respectively. With this new architecture, our model is able to fully exploit source- and target-side contexts to improve translation quality altogether. Experimental results on NIST Chinese-English and WMT English-German translation tasks demonstrate that our model achieves substantial improvements over the conventional NMT by 3.14 and 1.38 BLEU points, respectively. The source code of this work can be obtained from NMT. Introduction Recently, end-to-end neural machine translation (NMT) (Kalchbrenner and Blunsom 2013; Sutskever, Vinyals, and Le 2014; Cho et al. 2014) has achieved promising results and gained increasing attention. Compared with conventional statistical machine translation (SMT) (Koehn, Och, and Marcu 2003; Chiang 2007) which needs to explicitly design features to capture translation regularities, NMT aims to construct a unified encoder-decoder framework based on neural networks to model the entire translation process. Further, the introduction of the attention mechanism (Bahdanau, Cho, and Bengio 2015) enhances the capability of NMT in capturing long-distance dependencies. Despite being a relatively new framework, the attentional encoder-decoder NMT quickly become the de facto method. Corresponding author. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. Source Reference L2R R2L rì fángwèitīng zhǎngguān : bú wàng ūnguó lìshǐ zūnzhòng línguó zūnyán apan defense chief : never forget militaristic history, respect neighboring nations dignity apan s defense agency chief : death of militarism respects its neighbors dignity apanese defense agency has never forgotten militarism s history to respect the dignity of neighboring countries Table 1: Translation examples of NMT systems with different decoding manners. L2R/R2L denotes the translation produced by the NMT system with left-to-right/right-to-left decoding. Texts highlighted in wavy/dashed lines are incorrect/correct translations, respectively. Generally, most NMT decoders are based on recurrent neural networks (RNNs) and generate translations in a leftto-right manner. Thus, despite the advantage of encoding unbounded target words predicted previously for the prediction at each time step, these decoders are incapable of capturing the reverse target-side context for translation. Once errors occur in previous predictions, the quality of subsequent predictions would be undermined due to the negative impact of the noisy forward encoded target-side contexts. Intuitively, the reverse target-side contexts are also crucial for translation predictions, since they not only provide complementary signals but also bring different biases to NMT model (Hoang, Haffari, and Cohn 2017). Take the example in Table 1 into consideration. The latter half of the Chinese sentence, misinterpreted by the conventional NMT system, is accurately translated by the NMT system with right-toleft decoding. Therefore, it is important to investigate how to integrate reverse target-side contexts into the decoder to improve translation performance of NMT. To this end, many researchers resorted to introducing bidirectional decoding into NMT (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017). Most of them re-ranked candidate translations using bidirectional decoding scores together, in order to select a translation with both proper prefixes and suffixes. However, such methods also come with some drawbacks limiting the potential of bidirectional decoding in NMT. On the one hand,

2 due to the limited search space and search errors of beam search, the generated 1-best translation is often far from satisfactory and thus it fails to provide sufficient information as a complement for the other decoder. On the other hand, because the bidirectional decoders are often independent from each other during the translation, the unidirectional decoder is unable to fully exploit target-side contexts produced by the other decoder, and consequently the generated candidate translations are still undesirable. Therefore, how to effectively exert the influence of bidirectional decoding on NMT is still worthy of further study. In this paper, we significantly extend the conventional attentional encoder-decoder NMT framework by introducing a backward decoder, for the purpose of fully exploiting reverse target-side contexts to improve NMT. As shown in Fig. 1, along with our novel asynchronous bidirectional decoders, the proposed model remains an end-to-end attentional NMT framework, which mainly consists of three components: 1) an encoder embedding the input source sentence into bidirectional hidden states; 2) a backward decoder that is similar to the conventional NMT decoder but performs translation in the right-to-left manner, where the generated hidden states encode the reverse target-side contexts; 3) a forward decoder that generates the final translation from left to right and introduces two attention models simultaneously considering the source-side bidirectional and target-side reverse hidden state vectors for translation prediction. Compared with the previous related NMT models, our model has the following advantages: 1) The backward decoder learns to produce hidden state vectors that essentially encode semantics of potential hypotheses, allowing the following forward decoder to utilize richer target-side contexts for translation. 2) By integrating right-to-left target-side context modeling and left-toright translation generation into an end-to-end oint framework, our model alleviates the error propagation of reverse target-side context modeling to some extent. The maor contributions of this paper are concluded as follows: We thoroughly analyze and point out the existing drawbacks of researches on NMT with bidirectional decoding. We introduce a backward decoder to encode the left-toright target-side contexts, as a supplement to the conventional context modeling mechanism of NMT. To the best of our knowledge, this is the first attempt to investigate the effectiveness of the end-to-end attentional NMT model with asynchronous bidirectional decoders. Experiments on Chinese-English and English-German translation show that our model achieves significant improvements over the conventional NMT model. Our Model As described above, our model mainly includes three components: 1) a neural encoder with parameter set θ e ; 2) a neural backward decoder with parameter set θ b ; and 3) a neural forward decoder with parameter set θ f, which will be elaborated in the following subsections. Particularly, we choose Gated Recurrent Unit (GRU) (Cho et al. 2014) to build the encoder and decoders, as it is widely used in the NMT literature with relatively few parameters required. However, it should be noted that our model is also applicable to other RNNs, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997). The Neural Encoder The neural encoder of our model is identical to that of the dominant NMT model, which is modeled using a bidirectional RNN. The forward RNN reads a source sentence x=x 1, x 2...x N in a left-to-right order. At each timestep, we apply a recurrent activation function φ( ) to learn the semantic representation of the word sequence x 1:i as h i =φ( h i 1, x i ). Likewise, the backward RNN scans the source sentence in the reverse order and generates the semantic representation h i of the word sequence x i:n. Finally, we concatenate the hidden states of these two RNNs to form an annotation sequence h = {h 1, h 2,...h i..., h N }, where h i = [ h T i, h T i ]T encodes information about the i-th word with respect to all the other surrounding words in the source sentence. In our model, these annotations will provide source-side contexts for not only the forward decoder but also the backward one via different attention models. The Neural Backward Decoder The neural backward decoder of our model is also similar to the decoder of the dominant NMT model, while the only difference is that it performs decoding in a right-to-left way. Given the source-side hidden state vectors of the encoder and all target words generated previously, the backward decoder models how to reversely produce the next target word. Using this decoder, we calculate the conditional probability of the reverse translation y =(y 0, y 1, y 2,..., y M ) as follows P ( y x; θ e, θ b ) = = M =0 M =0 P (y y >, x; θ e, θ b ) g (y +1, s, m eb ), (1) where g ( ) is a non-linear function, s and m eb denote the decoding state and the source-side context vector at the -th time step, respectively, and M indicates the length of the reverse translation. Among s and m eb, s is computed by the GRU activation function f( ): s =f( s +1, y +1, m eb ), and m eb is defined by a encoder-backward decoder attention model as the weighted sum of the source annotations {h i }: m eb = N i=1 α eb,i h i, (2) α eb exp(e eb,i =,i ) N i =1 exp(eeb,i ), (3) e eb,i = (va eb ) T tanh(wa eb s +1 + Ua eb h i ), (4)

3 Figure 1: The architecture of the proposed NMT model. Note that the forward decoder directly attends to the reverse hidden state sequence s ={ s 0, s 1,... s M } rather than the word sequence produced by the backward decoder. where va eb, Wa eb and Ua eb are the parameters of the encoderbackward decoder attention model. In doing so, the decoder is also able to automatically select the effective source words to reversely predict target words. By introducing this backward decoder, our NMT model is able to better exploit target-side contexts for translation prediction. In addition to the generation of target word sequence, more importantly, our backward decoder will produce target-side hidden states s, which essentially captures richer reverse target-side contexts for the further use of the forward decoder. The Neural Forward Decoder The neural forward decoder of our model is extended from the decoder of the dominant NMT model. It performs decoding in a left-to-right manner under the semantic guides of source-side and reverse target-side contexts, which are separately captured by the encoder and the backward decoder. The forward decoder is trained to sequentially predict the next target word given the source-side hidden state vectors of the encoder, the reverse target-side hidden state sequence generated by the backward encoder, and all target words generated previously. Formally, the conditional probability of the translation y=(y 0, y 1,..., y M ) is defined as follows: P (y x; θ e, θ b, θ f ) = = M P (y y <, x; θ e, θ b, θ f ) =0 M =0 g(y 1, s, m ef, mbf ), (5) where g( ) is a non-linear function, s is the decoding state, m ef and m bf denote the source-side and reverse target-side context vectors at the -th timestep, respectively. As illustrated in Fig. 1, we use the first hidden state of the reverse encoder, denoted as h 1, to initialize the first hidden state s 0 of the forward decoder. More importantly, we introduce two attention models to respectively capture the source-side and reverse target-side contexts: one is the encoder-forward decoder attention model that focuses on the source annotations and the other is the backward decoder-forward decoder attention model considering all reverse target-side hidden states. Specifically, we produce m ef from the hidden states {h i } of the encoder as follows: m ef = α ef,i = N i=1 e ef,i = (vef α ef,i h i, (6) exp(e ef,i ) N (7) i =1 exp(eef,i ), a ) T tanh(wa ef s 1 + U ef a h i ), (8) where va ef, Wa ef, and Ua ef are the parameters of the encoderforward decoder attention model. Note that we directly choose hidden state sequence rather than word sequence to model the target-side contexts, for the reason that the former enables our model to better avoid negative effect of translation prediction errors to some extent. Likewise, we define as a weighted sum of the hidden states { s } of the m bf

4 backward decoder: α bf m bf = M =0 α bf, s, (9), = exp(e bf, ) M (10) =1 exp(ebf, ), e bf, = (vbf a ) T tanh(wa bf s 1 + U bf a s ), (11) where va bf, Wa bf, and Ua bf are the parameters of the backward decoder-forward decoder attention model. Then, we incorporate m ef and m bf into the GRU hidden unit of the forward decoder. Formally, the hidden state s of the forward decoder is computed by s = (1 z d ) s 1 + z d s, s = tanh(w d v(y 1 ) + U d [r d s 1 ] (12) + C ef m ef + C bf m bf ), where W d, U d, C ef, and C bf are the weight matrices, z d and r d are update and reset gates of GRU, respectively, depending on y 1, s 1, m ef and m bf. Finally, we further define the probability of y as p(y y <, x; θ e, θ b, θ f ) exp(g(y 1, s, m ef, mbf )), (13) where y 1, s, m ef and m bf are concatenated and fed through a single feed-forward layer. Training and Testing Given a training corpus D={(x, y)}, we train the proposed model according to the following obective: J(D; θ e, θ b, θ f ) = 1 D arg max θ e,θ b,θ f (x,y) D (14) {λ logp (y x; θ e, θ b, θ f ) + (1 λ) logp ( y x; θ e, θ b )} where y is obtained by inverting y, and λ is a hyperparameter used to balance the preference between the two terms. The first term logp (y x; θ e, θ b, θ f ) models the translation procedure illustrated in Figure 1. To ensure the consistency between model training and testing, we perform beam search to generate reverse hidden states s when optimizing logp (y x; θ e, θ b, θ f ). In addition, to guarantee the s produced by beam search is of high quality, we further introduce the second term logp ( y x; θ e, θ b )} to maximize the conditional likelihood of y. Note that the beam search requires high time complexity, and therefore, we directly adopt greedy search to implement right-to-left decoding, while proves to be sufficiently effective in our experiments. Once the proposed model is trained, we adopt a two-phase scheme to translate the unseen input sentence x: First, we use the backward decoder with greedy search to sequentially generate s until the target-side start symbol s occurs with the highest probability. Then, we perform beam search on the forward decoder to find the best translation that approximately maximizes logp (y x; θ e, θ b, θ f ). Experiments We evaluated the proposed model on NIST Chinese-English and WMT English-German translation tasks. Setup For Chinese-English translation, the training data consists of 1.25M bilingual sentences with 27.9M Chinese words and 34.5M English words. These sentence pairs are mainly extracted from LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. We chose NIST 2002 (MT02) dataset as our development set, and the NIST 2003 (MT03), 2004 (MT04), 2005 (MT05), and 2006 (MT06) datasets as our test sets. Finally, we evaluated the translations using BLEU (Papineni et al. 2002). For English-German translation, we used WMT 2015 training data that contains 4.46M sentence pairs with 116.1M English words and 108.9M German words. Particularly, we segmented words via byte pair encoding (BPE) (Sennrich, Haddow, and Birch 2016b). The news-test 2013 was used as development set and the news-test 2015 as test set. To efficiently train NMT models, we trained each model with sentences of length up to 50 words. In doing so, 90.12% and 89.03% of the Chinese-English and English-German parallel sentences were covered in the experiments. Besides, we set the vocabulary size to 30K for Chinese-English translation, and 50K for English-German translation, and mapped all the out-of-vocabulary words in the Chinese-English corpus to a special token UNK. Finally, such vocabularies contained 97.4% Chinese words and 99.3% English words of the Chinese-English corpus, and almost 100.0% English words and 98.2% German words of the English-German corpus, respectively. We applied Rmsprop (Graves 2013) (momentum = 0, ρ = 0.95, and ɛ = ) to train models for 5 epochs and selected the best model parameters according to the model performance on the development set. During this procedure, we set the following hyper-parameters: word embedding dimension as 620, hidden layer size as 1000, learning rate as , batch size as 80, gradient norm as 1.0, and dropout rate as 0.3. All the other settings are the same as in (Bahdanau, Cho, and Bengio 2015). Baselines We compared the proposed model against the following state-of-the-art SMT and NMT systems: Moses 1 : an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data. Note that we used all data to train MOSES. RNNSearch: a re-implementation of the attention-based NMT system (Bahdanau, Cho, and Bengio 2015) with slight changes from dl4mt tutorial 2. RNNSearch(R2L): a variant of RNNSearch that produces translation in a right-to-left direction

5 BLEU Score SYSTEM MT03 MT04 MT05 MT06 Average COVERAGE MemDec DeepLAU DMAtten Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 2: Evaluation of the NIST Chinese-English translation task using case-insensitive BLEU scores (λ=0.7). Here we displayed the experimental results of the first four models reported in (Wang et al. 2017; Zhang et al. 2017). COVERAGE (Tu et al. 2016) is a basic NMT model with a coverage model. MemDec (Wang et al. 2016) improves translation quality with external memory. DeepLAU (Wang et al. 2017) reduces the gradient propagation length inside the recurrent unit of RNN-based NMT. DMAtten (Zhang et al. 2017) incorporates word reordering knowledge into attentional NMT. ATNMT: an attention-based NMT system with two directional decoders (Liu et al. 2016) which explores the agreement on target-bidirectional NMT. Using this model, we first run beam search for forward and backward models independently to obtain two k-best lists, and then re-score the combination of these two lists using the oint model to find the best candidate. Following (Liu et al. 2016), we set both beam sizes of two decoders as 10. Note that we replaced LSTM adopted in (Liu et al. 2016) with GRU to ensure fair comparison. NSC(RT): it is a variant of neural system combination framework proposed by Zhou et al. (2017). It first uses an attentional NMT model consisting of one standard encoder and one backward decoder to produce the best reverse translation. Finally, another attentional NMT model generates the final output from its standard encoder and a reverse translation encoder which embeds the best reverse translation, in a way similar to the multi-source NMT model (Zoph and Knight 2016). This model differs from ours in two aspects: (1) it is not an end-to-end model, and (2) it considers the embedded hidden states of the reverse translation, while our model considers the hidden states produced by the backward decoder. NSC(HS): it is similar to NSC(RT), with the only difference that it directly considers the reverse hidden states produced by the backward decoder. We set beam sizes of all above-mentioned models as 10, and the beam sizes of the backward and forward decoders of our model as 1 and 10, respectively. Results on Chinese-English Translation Parameters. RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT), NSC(HS) models have 85.6M, 85.6M, 171.2M, 120.0M and 130.0M parameters, respectively. By contrast, the parameter size of our model is about 130.0M. Speed. We used a single GPU device 1080Ti to train models. It takes one hour to train 6,500, 6,500, 6,500 and 4,700 and 3,708 minibatches for RNNSearch, RNNSearch(R2L), λ Figure 2: Experiment results on the development set using different λs. ATNMT, NSC(RT), NSC(HS) models, respectively. The training speed of the proposed model is relatively slow: about 1,758 mini-batches are processed in one hour. We first investigated the impact of the hyper-parameter λ (see Eq. (14)) on the development set. To this end, we gradually varied λ from 0.5 to 1.0 with an increment of 0.1 in each step. As shown in Fig. 2, we find that our model achieved the best performance when λ=0.7. Therefore, we set λ=0.7 for all experiments thereafter. The experimental results on Chinese-English translation are depicted in Table 2. We also displayed the performances of some dominant individual models such as COVERAGE (Tu et al. 2016), MemDec (Wang et al. 2016), DeepLAU (Wang et al. 2017) and DMAtten (Zhang et al. 2017) on the same data set. Specifically, the proposed model significantly outperforms Moses, RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT) and NSC(HS) by 7.38, 3.14, 3.26, 1.86, 2.34, and 1.92 BLEU points, respectively. Even when compared with (Tu et al. 2016; Wang et al. 2016; 2017; Zhang et al. 2017), our model still has better performance in the same setting. Moreover, we draw the following conclusions: (1) In contrast to RNNSearch and RNNSearch(R2L), our model exhibits much better performance. These results testify our hypothesis that the forward and backward decoders

6 BLEU Score [ 1, 1 0 ] [ 1 1, 2 0 ] [ 2 1, 3 0 ] [ 3 1, 4 0 ] [ 4 1, 5 0 ] [ 5 1,... ] Sentence Length RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Figure 3: BLEU scores on different translation groups divided according to source sentence length. are complementary to each other in target-side context modeling, and therefore, the simultaneous exploration of bidirectional decoders will lead to better translations. (2) On all test sets, our model outperforms ATNMT, which indicates that compared with k-best hypotheses rescoring (Liu et al. 2016), oint modeling with attending to reverse hidden states behaves better in exploiting reverse target-side contexts. The underlying reason is that the reverse hidden states encode richer target-side contexts than single translation. In addition, compared with the k-best hypotheses rescoring, our model could refine translation at a more fine-grained level via the attention mechanism. (3) Particularly, the fact that NSC(HS) outperforms NSC(RT) reveals the advantage of reverse hidden state representations of the backward decoder in overcoming data sparsity. Besides, our model behaves better than NSC(HS), which accords with our intuition that to some extent, oint model is able to alleviate the error propagation when encoding target-side contexts. (4) Note that the performance of our model is better than that of our model (RR). This result verifies our speculation that model training with the translations obtained by greedy search is superior due to the consistency during the training and testing procedure. Finally, based on the length of source sentences, we divided our test sets into different groups and then compared the system performances in each group. Fig. 3 illustrates the BLEU scores on these groups of test sets. We observe that our model achieves the best performance in all groups, although the performances of all systems drop with the increase of the length of source sentences. These results clearly demonstrate once again the effectiveness of our model. Case Study To better understand how our model outperforms others, we studied the 1-best translations using different models. Table 3 provides a Chinese-English translation example. We find that RNNSearch produces the translation with good prefix, while RNNSearch(R2L) generates the translation with desirable suffix. Although there are various models with bidirectional decoding that could exploit bidirectional SYSTEM TEST BPEChar RecAtten ConvEncoder Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 4: Evaluation of the WMT English-German translation task using case-sensitive BLEU scores (λ=0.8). We directly cited the experimental results of the first three models provided by (Gehring et al. 2017). BPEChar (Chung, Cho, and Bengio 2016) is an attentional NMT model with a character-level decoder. RecAtten (Yang et al. 2017) uses a recurrent attention model to explicitly model the dependence between attentions among target words. ConvEncoder (Gehring et al. 2017) introduces a convolutional encoder into NMT. contexts, most of them are unable to translate the whole sentence precisely and our model is currently the only one capable to produce a high quality translation in this circumstance. Results on English-German Translation To enhance the persuasion of our experiments, we also provided some experiments results on the same data set, including BPEChar (Chung, Cho, and Bengio 2016), RecAtten (Yang et al. 2017), and ConvEncoder (Gehring et al. 2017). We determined the optimal λ as 0.8 according to the performance of our model on the development set. Table 4 presents the results on English-German translation. Our model still significantly outperforms others including some dominant NMT systems with other improved techniques. We believe that our work can be applied to other architectures easily. It should be noted that the BLEU score gaps between our model and the others on English- German translation are much smaller than those on Chinese- English translation. The underlying reasons lie in the following two aspects, which have also been mentioned in (Shen et al. 2016). First, the Chinese-English datasets contain four reference translations for each sentence while the English- German dataset only have single reference. Second, compared with German, Chinese is more distantly related to English, leading to the predominant advantage of utilizing target-side contexts in Chinese-English translation. Related Work In this work, we mainly focus on how to exploit bidirectional decoding to refine translation, which has always been a research focus in machine translation. In SMT, many approaches through backward language model (BLM) or target-bidirectional decoding have been explored to capture right-to-left target-side contexts for translation. For example, Watanabe and Sumita (2002) explored

7 Source Reference Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model yīyuè kāishǐ, zǒngwùshěng iāng yǒu liù míng zhíyuán yī zhōu zhìshǎo yī tiān bù xūyào ìn bàngōngshì, kěyǐ zài iā lǐ, dàxué huò túshūguǎn tòuguò gāosù wǎnglùo fúwù gōngzuò. starting from anuary, the ministry of internal affairs and communications will have six employees who do n t need to go to their offices at least one day a week ; instead they may work from home, universities or libraries through high - speed internet services. since anuary, there will be six staff members a week for least one day in office, they can at home, university or through high - speed internet library services. as early as anuary, six staff members will not be required to enter office at least one day in one week, which can be done through high - speed internet services through high - speed internet services. beginning in anuary, least six staff members have to go to the office for least one week and can work at home, and university or library through high - speed internet services. at the beginning of anuary, there will be six staff members to go to office least one week, which can be done through high - speed internet services at home and university or libraries. at least six staff members will leave office for least one week at least one week, and can work at home and university or library through high - speed internet services. in anuary, there will be six staff members who are required to enter offices for at least one day at least one day, and we can work at home, university or library through high - speed internet services. starting in anuary, six staff members will not need to enter the office at least one day in one week, and they can work at home, universities or libraries through high - speed internet services. Table 3: Translation examples of different systems. Texts highlighted in wavy lines are incorrectly translated. Please note that the translations produced by RNNSearch and RNNSearch(R2L) are complementary to each other, and the translation generated by our model is the most accurate and complete. two decoding methods: one is the right-to-left decoding based on the left-to-right beam search algorithm; the other decodes in both directions and merges the two hypothesized partial sentences into one. Finch and Sumita (2009) integrated both mono-directional approaches to reduce the effects caused by language specificity. Particularly, they integrated the BLM to their reverse translation decoder. Beyond left-to-right decoding, Zhang et al. (2013) studied the effects of multiple decomposition structures as well as dynamic bidirectional decomposition on SMT. When it comes to NMT, the dominant RNN-based NMT models also perform translation in a left-to-right manner, leading to the same drawback of underutilization of targetside contexts. To address this issue, Liu et al. (2016) first ointly train both directional LSTM models, and then in testing they try to search for target-side translations which are supported by both models. Similarly, Sennrich et al. (2016a) attempted to re-rank the left-to-right decoding results by right-to-left decoding, leading to diversified translation results. Recently, Hoang et al. (2017) proposed an approximate inference framework based on continuous optimization that enables decoding bidirectional translation models. Finally, it is noteworthy that our work is also related to pre-translation (Niehues et al. 2016; Zhou et al. 2017) and neural automatic post-editing (Pal et al. 2017; Dowmunt and Grundkiewicz 2017) for NMT, because our model involves two stages of translation. Overall, the most relevant models include (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Our model significantly differs from these works in the following aspects: 1) The motivation of our work varies from theirs. Specifically, in this work, we aim to fully exploit the reverse target-side contexts encoded by right-to-left hidden state vectors to improve NMT with left-to-right decoding. In contrast, Liu et al. (2016), Sennrich et al. (2016a), Hoang et al. (2017) investigated how to exploit bidirectional decoding scores to produce better translations, both Niehues et al. (2016) and Zhou et al. (2017) intended to combine the advantages of both NMT and SMT, and in the work of (Pal et al. 2017; Dowmunt and Grundkiewicz 2017), they explored multiple neural architectures for the task of automatic post-editing of machine translation output. 2) Our model attends to right-toleft hidden state vectors, while (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017) considered the raw best output of machine translation system instead. 3) Our model is an end-to-end NMT model, while the bidirectional decoders adopted in (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017) were independent from each other, and the component used to produce the raw translation was independent from the NMT model in (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Conclusions and Future Work In this paper, we have equipped the conventional attentional encoder-decoder NMT model with a backward decoder. In our model, the backward decoder first produces hidden state vectors encoding reverse target-side contexts. Then, two individual hidden state sequences generated by the encoder and the backward decoder are simultaneously exploited via attention mechanism by the forward decoder for translation. Compared with the previous models, ours is an end-to-end NMT model that fully utilizes reverse target-side contexts for translation. Experimental results on Chinese-English and English-German translation tasks demonstrate the effective-

8 ness of our model. Our model is generally applicable to other models with RNN-based decoder. Therefore, the effectiveness of our approach on other tasks related to RNN-based decoder modeling, such as image captioning, will be investigated in future research. Moreover, in our work, the attention mechanisms acting on the encoder and the backward decoder are independent from each other. However, intuitively, these two mechanisms should be closely associated with each other. Therefore, we are interested in exploring better attention mechanism combination to further refine our model. Acknowledgments The authors were supported by National Natural Science Foundation of China (Nos , and ), Scientific Research Proect of National Language Committee of China (Grant No. YB135-49), Natural Science Foundation of Fuian Province of China (No. 2016J05161), and National Key R&D Program of China (Nos. 2017YFC and 2016YFB ). We also thank the reviewers for their insightful comments. References Bahdanau, D.; Cho, K.; and Bengio, Y Neural machine translation by ointly learning to align and translate. In Proc. of ICLR2015. Chiang, D Hierarchical phrase-based translation. Computational Linguistics 33: Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proc. of EMNLP2014, Chung, J.; Cho, K.; and Bengio, Y A character-level decoder without explicit segmentation for neural machine translation. In Proc. of ACL2016, Dowmunt, M. J., and Grundkiewicz, R An exploration of neural sequence-to-sequence architectures for automatic post-editing. In arxiv: v1. Finch, A., and Sumita, E Bidirectional phrase-based statistical machine translation. In Proc. of EMNLP2009, Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y A convolutional encoder model for neural machine translation. In Proc. of ACL2017, Graves, A Generating sequences with recurrent neural networks. In arxiv: v5. Hoang, C. D. V.; Haffari, G.; and Cohn, T Decoding as continuous optimization in neural machine translation. In arxiv Hochreiter, S., and Schmidhuber, J Long short-term memory. Neural Computation Kalchbrenner, N., and Blunsom, P Recurrent continuous translation models. In Proc. of EMNLP2013, Koehn, P.; Och, F. J.; and Marcu, D Statistical phrasebased translation. In Proc. of NAACL2003, Liu, L.; Utiyama, M.; Finch, A.; and Sumita, E Agreement on target-bidirectional neural machine translation. In Proc. of NAACL2016, Niehues, J.; Cho, E.; Ha, T.-L.; and Waibel, A Pretranslation for neural machine translation. In Proc. of COL- ING2016, Pal, S.; Naskar, S. K.; Vela, M.; Liu, Q.; and van Genabith, J Neural automatic post-editing using prior alignment and reranking. In Proc. of EACL2017, Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W Bleu: A method for automatic evaluation of machine translation. In Proc. of ACL2002, Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Edinburgh neural machine translation systems for wmt 16. In arxiv: v2. Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neural machine translation of rare words with subword units. In Proc. of ACL2016, Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y Minimum risk training for neural machine translation. In Proc. of ACL2016, Sutskever, I.; Vinyals, O.; and Le, Q. V Sequence to sequence learning with neural networks. In Proc. of NIPS2014, Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H Modeling coverage for neural machine translation. In Proc. of ACL2016, Wang, M.; Lu, Z.; Li, H.; and Liu, Q Memoryenhanced decoder for neural machine translation. In Proc. of EMNLP2016, Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q Deep neural machine translation with linear associative unit. In Proc. of ACL2017, Watanabe, T., and Sumita, E Bidirectional decoding for statistical machine translation. In Proc. of COLING 2002, Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A Neural machine translation with recurrent attention modeling. In Proc. of EACL2017, Zhang, H.; Toutanova, K.; Quirk, C.; and Gao, J Beyond left-to-right: Multiple decomposition structures for smt. In Proc. of NAACL2013, Zhang, J.; Wang, M.; Liu, Q.; and Zhou, J Incorporating word reordering knowledge into attention-based neural machine translation. In Proc. of ACL 2017, Zhou, L.; Hu, W.; Zhang, J.; and Zong, C Neural system combination for machine translation. In Proc. of ACL2017, Zoph, B., and Knight, K Multi-source neural translation. In Proc. of NAACL2016,

Asynchronous Bidirectional Decoding for Neural Machine Translation

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang, 1 Jinsong Su, 1 Yue Qin, 1 Yang Liu, 2 Rongrong