arxiv: v1 [cs.cl] 16 Jan 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.cl] 16 Jan 2018"

Transcription

1 Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang 1, Jinsong Su 1, Yue Qin 1, Yang Liu 2, Rongrong Ji 1, Hongi Wang 1 Xiamen University, Xiamen, China 1 Tsinghua University, Beiing, China 2 xwzhang@stu.xmu.edu.cn, ssu@xmu.edu.cn, qinyue@stu.xmu.edu.cn liuyang2011@tsinghua.edu.cn, rri@xmu.edu.cn, hw@xmu.edu.cn arxiv: v1 [cs.cl] 16 Jan 2018 Abstract The dominant neural machine translation (NMT) models apply unified attentional encoder-decoder neural networks for translation. Traditionally, the NMT decoders adopt recurrent neural networks (RNNs) to perform translation in a left-toright manner, leaving the target-side contexts generated from right to left unexploited during translation. In this paper, we equip the conventional attentional encoder-decoder NMT framework with a backward decoder, in order to explore bidirectional decoding for NMT. Attending to the hidden state sequence produced by the encoder, our backward decoder first learns to generate the target-side hidden state sequence from right to left. Then, the forward decoder performs translation in the forward direction, while in each translation prediction timestep, it simultaneously applies two attention models to consider the source-side and reverse target-side hidden states, respectively. With this new architecture, our model is able to fully exploit source- and target-side contexts to improve translation quality altogether. Experimental results on NIST Chinese-English and WMT English-German translation tasks demonstrate that our model achieves substantial improvements over the conventional NMT by 3.14 and 1.38 BLEU points, respectively. The source code of this work can be obtained from NMT. Introduction Recently, end-to-end neural machine translation (NMT) (Kalchbrenner and Blunsom 2013; Sutskever, Vinyals, and Le 2014; Cho et al. 2014) has achieved promising results and gained increasing attention. Compared with conventional statistical machine translation (SMT) (Koehn, Och, and Marcu 2003; Chiang 2007) which needs to explicitly design features to capture translation regularities, NMT aims to construct a unified encoder-decoder framework based on neural networks to model the entire translation process. Further, the introduction of the attention mechanism (Bahdanau, Cho, and Bengio 2015) enhances the capability of NMT in capturing long-distance dependencies. Despite being a relatively new framework, the attentional encoder-decoder NMT quickly become the de facto method. Corresponding author. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. Source Reference L2R R2L rì fángwèitīng zhǎngguān : bú wàng ūnguó lìshǐ zūnzhòng línguó zūnyán apan defense chief : never forget militaristic history, respect neighboring nations dignity apan s defense agency chief : death of militarism respects its neighbors dignity apanese defense agency has never forgotten militarism s history to respect the dignity of neighboring countries Table 1: Translation examples of NMT systems with different decoding manners. L2R/R2L denotes the translation produced by the NMT system with left-to-right/right-to-left decoding. Texts highlighted in wavy/dashed lines are incorrect/correct translations, respectively. Generally, most NMT decoders are based on recurrent neural networks (RNNs) and generate translations in a leftto-right manner. Thus, despite the advantage of encoding unbounded target words predicted previously for the prediction at each time step, these decoders are incapable of capturing the reverse target-side context for translation. Once errors occur in previous predictions, the quality of subsequent predictions would be undermined due to the negative impact of the noisy forward encoded target-side contexts. Intuitively, the reverse target-side contexts are also crucial for translation predictions, since they not only provide complementary signals but also bring different biases to NMT model (Hoang, Haffari, and Cohn 2017). Take the example in Table 1 into consideration. The latter half of the Chinese sentence, misinterpreted by the conventional NMT system, is accurately translated by the NMT system with right-toleft decoding. Therefore, it is important to investigate how to integrate reverse target-side contexts into the decoder to improve translation performance of NMT. To this end, many researchers resorted to introducing bidirectional decoding into NMT (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017). Most of them re-ranked candidate translations using bidirectional decoding scores together, in order to select a translation with both proper prefixes and suffixes. However, such methods also come with some drawbacks limiting the potential of bidirectional decoding in NMT. On the one hand,

2 due to the limited search space and search errors of beam search, the generated 1-best translation is often far from satisfactory and thus it fails to provide sufficient information as a complement for the other decoder. On the other hand, because the bidirectional decoders are often independent from each other during the translation, the unidirectional decoder is unable to fully exploit target-side contexts produced by the other decoder, and consequently the generated candidate translations are still undesirable. Therefore, how to effectively exert the influence of bidirectional decoding on NMT is still worthy of further study. In this paper, we significantly extend the conventional attentional encoder-decoder NMT framework by introducing a backward decoder, for the purpose of fully exploiting reverse target-side contexts to improve NMT. As shown in Fig. 1, along with our novel asynchronous bidirectional decoders, the proposed model remains an end-to-end attentional NMT framework, which mainly consists of three components: 1) an encoder embedding the input source sentence into bidirectional hidden states; 2) a backward decoder that is similar to the conventional NMT decoder but performs translation in the right-to-left manner, where the generated hidden states encode the reverse target-side contexts; 3) a forward decoder that generates the final translation from left to right and introduces two attention models simultaneously considering the source-side bidirectional and target-side reverse hidden state vectors for translation prediction. Compared with the previous related NMT models, our model has the following advantages: 1) The backward decoder learns to produce hidden state vectors that essentially encode semantics of potential hypotheses, allowing the following forward decoder to utilize richer target-side contexts for translation. 2) By integrating right-to-left target-side context modeling and left-toright translation generation into an end-to-end oint framework, our model alleviates the error propagation of reverse target-side context modeling to some extent. The maor contributions of this paper are concluded as follows: We thoroughly analyze and point out the existing drawbacks of researches on NMT with bidirectional decoding. We introduce a backward decoder to encode the left-toright target-side contexts, as a supplement to the conventional context modeling mechanism of NMT. To the best of our knowledge, this is the first attempt to investigate the effectiveness of the end-to-end attentional NMT model with asynchronous bidirectional decoders. Experiments on Chinese-English and English-German translation show that our model achieves significant improvements over the conventional NMT model. Our Model As described above, our model mainly includes three components: 1) a neural encoder with parameter set θ e ; 2) a neural backward decoder with parameter set θ b ; and 3) a neural forward decoder with parameter set θ f, which will be elaborated in the following subsections. Particularly, we choose Gated Recurrent Unit (GRU) (Cho et al. 2014) to build the encoder and decoders, as it is widely used in the NMT literature with relatively few parameters required. However, it should be noted that our model is also applicable to other RNNs, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997). The Neural Encoder The neural encoder of our model is identical to that of the dominant NMT model, which is modeled using a bidirectional RNN. The forward RNN reads a source sentence x=x 1, x 2...x N in a left-to-right order. At each timestep, we apply a recurrent activation function φ( ) to learn the semantic representation of the word sequence x 1:i as h i =φ( h i 1, x i ). Likewise, the backward RNN scans the source sentence in the reverse order and generates the semantic representation h i of the word sequence x i:n. Finally, we concatenate the hidden states of these two RNNs to form an annotation sequence h = {h 1, h 2,...h i..., h N }, where h i = [ h T i, h T i ]T encodes information about the i-th word with respect to all the other surrounding words in the source sentence. In our model, these annotations will provide source-side contexts for not only the forward decoder but also the backward one via different attention models. The Neural Backward Decoder The neural backward decoder of our model is also similar to the decoder of the dominant NMT model, while the only difference is that it performs decoding in a right-to-left way. Given the source-side hidden state vectors of the encoder and all target words generated previously, the backward decoder models how to reversely produce the next target word. Using this decoder, we calculate the conditional probability of the reverse translation y =(y 0, y 1, y 2,..., y M ) as follows P ( y x; θ e, θ b ) = = M =0 M =0 P (y y >, x; θ e, θ b ) g (y +1, s, m eb ), (1) where g ( ) is a non-linear function, s and m eb denote the decoding state and the source-side context vector at the -th time step, respectively, and M indicates the length of the reverse translation. Among s and m eb, s is computed by the GRU activation function f( ): s =f( s +1, y +1, m eb ), and m eb is defined by a encoder-backward decoder attention model as the weighted sum of the source annotations {h i }: m eb = N i=1 α eb,i h i, (2) α eb exp(e eb,i =,i ) N i =1 exp(eeb,i ), (3) e eb,i = (va eb ) T tanh(wa eb s +1 + Ua eb h i ), (4)

3 Figure 1: The architecture of the proposed NMT model. Note that the forward decoder directly attends to the reverse hidden state sequence s ={ s 0, s 1,... s M } rather than the word sequence produced by the backward decoder. where va eb, Wa eb and Ua eb are the parameters of the encoderbackward decoder attention model. In doing so, the decoder is also able to automatically select the effective source words to reversely predict target words. By introducing this backward decoder, our NMT model is able to better exploit target-side contexts for translation prediction. In addition to the generation of target word sequence, more importantly, our backward decoder will produce target-side hidden states s, which essentially captures richer reverse target-side contexts for the further use of the forward decoder. The Neural Forward Decoder The neural forward decoder of our model is extended from the decoder of the dominant NMT model. It performs decoding in a left-to-right manner under the semantic guides of source-side and reverse target-side contexts, which are separately captured by the encoder and the backward decoder. The forward decoder is trained to sequentially predict the next target word given the source-side hidden state vectors of the encoder, the reverse target-side hidden state sequence generated by the backward encoder, and all target words generated previously. Formally, the conditional probability of the translation y=(y 0, y 1,..., y M ) is defined as follows: P (y x; θ e, θ b, θ f ) = = M P (y y <, x; θ e, θ b, θ f ) =0 M =0 g(y 1, s, m ef, mbf ), (5) where g( ) is a non-linear function, s is the decoding state, m ef and m bf denote the source-side and reverse target-side context vectors at the -th timestep, respectively. As illustrated in Fig. 1, we use the first hidden state of the reverse encoder, denoted as h 1, to initialize the first hidden state s 0 of the forward decoder. More importantly, we introduce two attention models to respectively capture the source-side and reverse target-side contexts: one is the encoder-forward decoder attention model that focuses on the source annotations and the other is the backward decoder-forward decoder attention model considering all reverse target-side hidden states. Specifically, we produce m ef from the hidden states {h i } of the encoder as follows: m ef = α ef,i = N i=1 e ef,i = (vef α ef,i h i, (6) exp(e ef,i ) N (7) i =1 exp(eef,i ), a ) T tanh(wa ef s 1 + U ef a h i ), (8) where va ef, Wa ef, and Ua ef are the parameters of the encoderforward decoder attention model. Note that we directly choose hidden state sequence rather than word sequence to model the target-side contexts, for the reason that the former enables our model to better avoid negative effect of translation prediction errors to some extent. Likewise, we define as a weighted sum of the hidden states { s } of the m bf

4 backward decoder: α bf m bf = M =0 α bf, s, (9), = exp(e bf, ) M (10) =1 exp(ebf, ), e bf, = (vbf a ) T tanh(wa bf s 1 + U bf a s ), (11) where va bf, Wa bf, and Ua bf are the parameters of the backward decoder-forward decoder attention model. Then, we incorporate m ef and m bf into the GRU hidden unit of the forward decoder. Formally, the hidden state s of the forward decoder is computed by s = (1 z d ) s 1 + z d s, s = tanh(w d v(y 1 ) + U d [r d s 1 ] (12) + C ef m ef + C bf m bf ), where W d, U d, C ef, and C bf are the weight matrices, z d and r d are update and reset gates of GRU, respectively, depending on y 1, s 1, m ef and m bf. Finally, we further define the probability of y as p(y y <, x; θ e, θ b, θ f ) exp(g(y 1, s, m ef, mbf )), (13) where y 1, s, m ef and m bf are concatenated and fed through a single feed-forward layer. Training and Testing Given a training corpus D={(x, y)}, we train the proposed model according to the following obective: J(D; θ e, θ b, θ f ) = 1 D arg max θ e,θ b,θ f (x,y) D (14) {λ logp (y x; θ e, θ b, θ f ) + (1 λ) logp ( y x; θ e, θ b )} where y is obtained by inverting y, and λ is a hyperparameter used to balance the preference between the two terms. The first term logp (y x; θ e, θ b, θ f ) models the translation procedure illustrated in Figure 1. To ensure the consistency between model training and testing, we perform beam search to generate reverse hidden states s when optimizing logp (y x; θ e, θ b, θ f ). In addition, to guarantee the s produced by beam search is of high quality, we further introduce the second term logp ( y x; θ e, θ b )} to maximize the conditional likelihood of y. Note that the beam search requires high time complexity, and therefore, we directly adopt greedy search to implement right-to-left decoding, while proves to be sufficiently effective in our experiments. Once the proposed model is trained, we adopt a two-phase scheme to translate the unseen input sentence x: First, we use the backward decoder with greedy search to sequentially generate s until the target-side start symbol s occurs with the highest probability. Then, we perform beam search on the forward decoder to find the best translation that approximately maximizes logp (y x; θ e, θ b, θ f ). Experiments We evaluated the proposed model on NIST Chinese-English and WMT English-German translation tasks. Setup For Chinese-English translation, the training data consists of 1.25M bilingual sentences with 27.9M Chinese words and 34.5M English words. These sentence pairs are mainly extracted from LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. We chose NIST 2002 (MT02) dataset as our development set, and the NIST 2003 (MT03), 2004 (MT04), 2005 (MT05), and 2006 (MT06) datasets as our test sets. Finally, we evaluated the translations using BLEU (Papineni et al. 2002). For English-German translation, we used WMT 2015 training data that contains 4.46M sentence pairs with 116.1M English words and 108.9M German words. Particularly, we segmented words via byte pair encoding (BPE) (Sennrich, Haddow, and Birch 2016b). The news-test 2013 was used as development set and the news-test 2015 as test set. To efficiently train NMT models, we trained each model with sentences of length up to 50 words. In doing so, 90.12% and 89.03% of the Chinese-English and English-German parallel sentences were covered in the experiments. Besides, we set the vocabulary size to 30K for Chinese-English translation, and 50K for English-German translation, and mapped all the out-of-vocabulary words in the Chinese-English corpus to a special token UNK. Finally, such vocabularies contained 97.4% Chinese words and 99.3% English words of the Chinese-English corpus, and almost 100.0% English words and 98.2% German words of the English-German corpus, respectively. We applied Rmsprop (Graves 2013) (momentum = 0, ρ = 0.95, and ɛ = ) to train models for 5 epochs and selected the best model parameters according to the model performance on the development set. During this procedure, we set the following hyper-parameters: word embedding dimension as 620, hidden layer size as 1000, learning rate as , batch size as 80, gradient norm as 1.0, and dropout rate as 0.3. All the other settings are the same as in (Bahdanau, Cho, and Bengio 2015). Baselines We compared the proposed model against the following state-of-the-art SMT and NMT systems: Moses 1 : an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data. Note that we used all data to train MOSES. RNNSearch: a re-implementation of the attention-based NMT system (Bahdanau, Cho, and Bengio 2015) with slight changes from dl4mt tutorial 2. RNNSearch(R2L): a variant of RNNSearch that produces translation in a right-to-left direction

5 BLEU Score SYSTEM MT03 MT04 MT05 MT06 Average COVERAGE MemDec DeepLAU DMAtten Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 2: Evaluation of the NIST Chinese-English translation task using case-insensitive BLEU scores (λ=0.7). Here we displayed the experimental results of the first four models reported in (Wang et al. 2017; Zhang et al. 2017). COVERAGE (Tu et al. 2016) is a basic NMT model with a coverage model. MemDec (Wang et al. 2016) improves translation quality with external memory. DeepLAU (Wang et al. 2017) reduces the gradient propagation length inside the recurrent unit of RNN-based NMT. DMAtten (Zhang et al. 2017) incorporates word reordering knowledge into attentional NMT. ATNMT: an attention-based NMT system with two directional decoders (Liu et al. 2016) which explores the agreement on target-bidirectional NMT. Using this model, we first run beam search for forward and backward models independently to obtain two k-best lists, and then re-score the combination of these two lists using the oint model to find the best candidate. Following (Liu et al. 2016), we set both beam sizes of two decoders as 10. Note that we replaced LSTM adopted in (Liu et al. 2016) with GRU to ensure fair comparison. NSC(RT): it is a variant of neural system combination framework proposed by Zhou et al. (2017). It first uses an attentional NMT model consisting of one standard encoder and one backward decoder to produce the best reverse translation. Finally, another attentional NMT model generates the final output from its standard encoder and a reverse translation encoder which embeds the best reverse translation, in a way similar to the multi-source NMT model (Zoph and Knight 2016). This model differs from ours in two aspects: (1) it is not an end-to-end model, and (2) it considers the embedded hidden states of the reverse translation, while our model considers the hidden states produced by the backward decoder. NSC(HS): it is similar to NSC(RT), with the only difference that it directly considers the reverse hidden states produced by the backward decoder. We set beam sizes of all above-mentioned models as 10, and the beam sizes of the backward and forward decoders of our model as 1 and 10, respectively. Results on Chinese-English Translation Parameters. RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT), NSC(HS) models have 85.6M, 85.6M, 171.2M, 120.0M and 130.0M parameters, respectively. By contrast, the parameter size of our model is about 130.0M. Speed. We used a single GPU device 1080Ti to train models. It takes one hour to train 6,500, 6,500, 6,500 and 4,700 and 3,708 minibatches for RNNSearch, RNNSearch(R2L), λ Figure 2: Experiment results on the development set using different λs. ATNMT, NSC(RT), NSC(HS) models, respectively. The training speed of the proposed model is relatively slow: about 1,758 mini-batches are processed in one hour. We first investigated the impact of the hyper-parameter λ (see Eq. (14)) on the development set. To this end, we gradually varied λ from 0.5 to 1.0 with an increment of 0.1 in each step. As shown in Fig. 2, we find that our model achieved the best performance when λ=0.7. Therefore, we set λ=0.7 for all experiments thereafter. The experimental results on Chinese-English translation are depicted in Table 2. We also displayed the performances of some dominant individual models such as COVERAGE (Tu et al. 2016), MemDec (Wang et al. 2016), DeepLAU (Wang et al. 2017) and DMAtten (Zhang et al. 2017) on the same data set. Specifically, the proposed model significantly outperforms Moses, RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT) and NSC(HS) by 7.38, 3.14, 3.26, 1.86, 2.34, and 1.92 BLEU points, respectively. Even when compared with (Tu et al. 2016; Wang et al. 2016; 2017; Zhang et al. 2017), our model still has better performance in the same setting. Moreover, we draw the following conclusions: (1) In contrast to RNNSearch and RNNSearch(R2L), our model exhibits much better performance. These results testify our hypothesis that the forward and backward decoders

6 BLEU Score [ 1, 1 0 ] [ 1 1, 2 0 ] [ 2 1, 3 0 ] [ 3 1, 4 0 ] [ 4 1, 5 0 ] [ 5 1,... ] Sentence Length RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Figure 3: BLEU scores on different translation groups divided according to source sentence length. are complementary to each other in target-side context modeling, and therefore, the simultaneous exploration of bidirectional decoders will lead to better translations. (2) On all test sets, our model outperforms ATNMT, which indicates that compared with k-best hypotheses rescoring (Liu et al. 2016), oint modeling with attending to reverse hidden states behaves better in exploiting reverse target-side contexts. The underlying reason is that the reverse hidden states encode richer target-side contexts than single translation. In addition, compared with the k-best hypotheses rescoring, our model could refine translation at a more fine-grained level via the attention mechanism. (3) Particularly, the fact that NSC(HS) outperforms NSC(RT) reveals the advantage of reverse hidden state representations of the backward decoder in overcoming data sparsity. Besides, our model behaves better than NSC(HS), which accords with our intuition that to some extent, oint model is able to alleviate the error propagation when encoding target-side contexts. (4) Note that the performance of our model is better than that of our model (RR). This result verifies our speculation that model training with the translations obtained by greedy search is superior due to the consistency during the training and testing procedure. Finally, based on the length of source sentences, we divided our test sets into different groups and then compared the system performances in each group. Fig. 3 illustrates the BLEU scores on these groups of test sets. We observe that our model achieves the best performance in all groups, although the performances of all systems drop with the increase of the length of source sentences. These results clearly demonstrate once again the effectiveness of our model. Case Study To better understand how our model outperforms others, we studied the 1-best translations using different models. Table 3 provides a Chinese-English translation example. We find that RNNSearch produces the translation with good prefix, while RNNSearch(R2L) generates the translation with desirable suffix. Although there are various models with bidirectional decoding that could exploit bidirectional SYSTEM TEST BPEChar RecAtten ConvEncoder Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 4: Evaluation of the WMT English-German translation task using case-sensitive BLEU scores (λ=0.8). We directly cited the experimental results of the first three models provided by (Gehring et al. 2017). BPEChar (Chung, Cho, and Bengio 2016) is an attentional NMT model with a character-level decoder. RecAtten (Yang et al. 2017) uses a recurrent attention model to explicitly model the dependence between attentions among target words. ConvEncoder (Gehring et al. 2017) introduces a convolutional encoder into NMT. contexts, most of them are unable to translate the whole sentence precisely and our model is currently the only one capable to produce a high quality translation in this circumstance. Results on English-German Translation To enhance the persuasion of our experiments, we also provided some experiments results on the same data set, including BPEChar (Chung, Cho, and Bengio 2016), RecAtten (Yang et al. 2017), and ConvEncoder (Gehring et al. 2017). We determined the optimal λ as 0.8 according to the performance of our model on the development set. Table 4 presents the results on English-German translation. Our model still significantly outperforms others including some dominant NMT systems with other improved techniques. We believe that our work can be applied to other architectures easily. It should be noted that the BLEU score gaps between our model and the others on English- German translation are much smaller than those on Chinese- English translation. The underlying reasons lie in the following two aspects, which have also been mentioned in (Shen et al. 2016). First, the Chinese-English datasets contain four reference translations for each sentence while the English- German dataset only have single reference. Second, compared with German, Chinese is more distantly related to English, leading to the predominant advantage of utilizing target-side contexts in Chinese-English translation. Related Work In this work, we mainly focus on how to exploit bidirectional decoding to refine translation, which has always been a research focus in machine translation. In SMT, many approaches through backward language model (BLM) or target-bidirectional decoding have been explored to capture right-to-left target-side contexts for translation. For example, Watanabe and Sumita (2002) explored

7 Source Reference Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model yīyuè kāishǐ, zǒngwùshěng iāng yǒu liù míng zhíyuán yī zhōu zhìshǎo yī tiān bù xūyào ìn bàngōngshì, kěyǐ zài iā lǐ, dàxué huò túshūguǎn tòuguò gāosù wǎnglùo fúwù gōngzuò. starting from anuary, the ministry of internal affairs and communications will have six employees who do n t need to go to their offices at least one day a week ; instead they may work from home, universities or libraries through high - speed internet services. since anuary, there will be six staff members a week for least one day in office, they can at home, university or through high - speed internet library services. as early as anuary, six staff members will not be required to enter office at least one day in one week, which can be done through high - speed internet services through high - speed internet services. beginning in anuary, least six staff members have to go to the office for least one week and can work at home, and university or library through high - speed internet services. at the beginning of anuary, there will be six staff members to go to office least one week, which can be done through high - speed internet services at home and university or libraries. at least six staff members will leave office for least one week at least one week, and can work at home and university or library through high - speed internet services. in anuary, there will be six staff members who are required to enter offices for at least one day at least one day, and we can work at home, university or library through high - speed internet services. starting in anuary, six staff members will not need to enter the office at least one day in one week, and they can work at home, universities or libraries through high - speed internet services. Table 3: Translation examples of different systems. Texts highlighted in wavy lines are incorrectly translated. Please note that the translations produced by RNNSearch and RNNSearch(R2L) are complementary to each other, and the translation generated by our model is the most accurate and complete. two decoding methods: one is the right-to-left decoding based on the left-to-right beam search algorithm; the other decodes in both directions and merges the two hypothesized partial sentences into one. Finch and Sumita (2009) integrated both mono-directional approaches to reduce the effects caused by language specificity. Particularly, they integrated the BLM to their reverse translation decoder. Beyond left-to-right decoding, Zhang et al. (2013) studied the effects of multiple decomposition structures as well as dynamic bidirectional decomposition on SMT. When it comes to NMT, the dominant RNN-based NMT models also perform translation in a left-to-right manner, leading to the same drawback of underutilization of targetside contexts. To address this issue, Liu et al. (2016) first ointly train both directional LSTM models, and then in testing they try to search for target-side translations which are supported by both models. Similarly, Sennrich et al. (2016a) attempted to re-rank the left-to-right decoding results by right-to-left decoding, leading to diversified translation results. Recently, Hoang et al. (2017) proposed an approximate inference framework based on continuous optimization that enables decoding bidirectional translation models. Finally, it is noteworthy that our work is also related to pre-translation (Niehues et al. 2016; Zhou et al. 2017) and neural automatic post-editing (Pal et al. 2017; Dowmunt and Grundkiewicz 2017) for NMT, because our model involves two stages of translation. Overall, the most relevant models include (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Our model significantly differs from these works in the following aspects: 1) The motivation of our work varies from theirs. Specifically, in this work, we aim to fully exploit the reverse target-side contexts encoded by right-to-left hidden state vectors to improve NMT with left-to-right decoding. In contrast, Liu et al. (2016), Sennrich et al. (2016a), Hoang et al. (2017) investigated how to exploit bidirectional decoding scores to produce better translations, both Niehues et al. (2016) and Zhou et al. (2017) intended to combine the advantages of both NMT and SMT, and in the work of (Pal et al. 2017; Dowmunt and Grundkiewicz 2017), they explored multiple neural architectures for the task of automatic post-editing of machine translation output. 2) Our model attends to right-toleft hidden state vectors, while (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017) considered the raw best output of machine translation system instead. 3) Our model is an end-to-end NMT model, while the bidirectional decoders adopted in (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017) were independent from each other, and the component used to produce the raw translation was independent from the NMT model in (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Conclusions and Future Work In this paper, we have equipped the conventional attentional encoder-decoder NMT model with a backward decoder. In our model, the backward decoder first produces hidden state vectors encoding reverse target-side contexts. Then, two individual hidden state sequences generated by the encoder and the backward decoder are simultaneously exploited via attention mechanism by the forward decoder for translation. Compared with the previous models, ours is an end-to-end NMT model that fully utilizes reverse target-side contexts for translation. Experimental results on Chinese-English and English-German translation tasks demonstrate the effective-

8 ness of our model. Our model is generally applicable to other models with RNN-based decoder. Therefore, the effectiveness of our approach on other tasks related to RNN-based decoder modeling, such as image captioning, will be investigated in future research. Moreover, in our work, the attention mechanisms acting on the encoder and the backward decoder are independent from each other. However, intuitively, these two mechanisms should be closely associated with each other. Therefore, we are interested in exploring better attention mechanism combination to further refine our model. Acknowledgments The authors were supported by National Natural Science Foundation of China (Nos , and ), Scientific Research Proect of National Language Committee of China (Grant No. YB135-49), Natural Science Foundation of Fuian Province of China (No. 2016J05161), and National Key R&D Program of China (Nos. 2017YFC and 2016YFB ). We also thank the reviewers for their insightful comments. References Bahdanau, D.; Cho, K.; and Bengio, Y Neural machine translation by ointly learning to align and translate. In Proc. of ICLR2015. Chiang, D Hierarchical phrase-based translation. Computational Linguistics 33: Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proc. of EMNLP2014, Chung, J.; Cho, K.; and Bengio, Y A character-level decoder without explicit segmentation for neural machine translation. In Proc. of ACL2016, Dowmunt, M. J., and Grundkiewicz, R An exploration of neural sequence-to-sequence architectures for automatic post-editing. In arxiv: v1. Finch, A., and Sumita, E Bidirectional phrase-based statistical machine translation. In Proc. of EMNLP2009, Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y A convolutional encoder model for neural machine translation. In Proc. of ACL2017, Graves, A Generating sequences with recurrent neural networks. In arxiv: v5. Hoang, C. D. V.; Haffari, G.; and Cohn, T Decoding as continuous optimization in neural machine translation. In arxiv Hochreiter, S., and Schmidhuber, J Long short-term memory. Neural Computation Kalchbrenner, N., and Blunsom, P Recurrent continuous translation models. In Proc. of EMNLP2013, Koehn, P.; Och, F. J.; and Marcu, D Statistical phrasebased translation. In Proc. of NAACL2003, Liu, L.; Utiyama, M.; Finch, A.; and Sumita, E Agreement on target-bidirectional neural machine translation. In Proc. of NAACL2016, Niehues, J.; Cho, E.; Ha, T.-L.; and Waibel, A Pretranslation for neural machine translation. In Proc. of COL- ING2016, Pal, S.; Naskar, S. K.; Vela, M.; Liu, Q.; and van Genabith, J Neural automatic post-editing using prior alignment and reranking. In Proc. of EACL2017, Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W Bleu: A method for automatic evaluation of machine translation. In Proc. of ACL2002, Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Edinburgh neural machine translation systems for wmt 16. In arxiv: v2. Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neural machine translation of rare words with subword units. In Proc. of ACL2016, Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y Minimum risk training for neural machine translation. In Proc. of ACL2016, Sutskever, I.; Vinyals, O.; and Le, Q. V Sequence to sequence learning with neural networks. In Proc. of NIPS2014, Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H Modeling coverage for neural machine translation. In Proc. of ACL2016, Wang, M.; Lu, Z.; Li, H.; and Liu, Q Memoryenhanced decoder for neural machine translation. In Proc. of EMNLP2016, Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q Deep neural machine translation with linear associative unit. In Proc. of ACL2017, Watanabe, T., and Sumita, E Bidirectional decoding for statistical machine translation. In Proc. of COLING 2002, Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A Neural machine translation with recurrent attention modeling. In Proc. of EACL2017, Zhang, H.; Toutanova, K.; Quirk, C.; and Gao, J Beyond left-to-right: Multiple decomposition structures for smt. In Proc. of NAACL2013, Zhang, J.; Wang, M.; Liu, Q.; and Zhou, J Incorporating word reordering knowledge into attention-based neural machine translation. In Proc. of ACL 2017, Zhou, L.; Hu, W.; Zhang, J.; and Zong, C Neural system combination for machine translation. In Proc. of ACL2017, Zoph, B., and Knight, K Multi-source neural translation. In Proc. of NAACL2016,

Asynchronous Bidirectional Decoding for Neural Machine Translation

Asynchronous Bidirectional Decoding for Neural Machine Translation The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang, 1 Jinsong Su, 1 Yue Qin, 1 Yang Liu, 2 Rongrong

More information

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu Minimum Risk Training For Neural Machine Translation Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu ACL 2016, Berlin, German, August 2016 Machine Translation MT: using computer

More information

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, Philip

More information

Beam Search Strategies for Neural Machine Translation

Beam Search Strategies for Neural Machine Translation Beam Search Strategies for Neural Machine Translation Markus Freitag and Yaser Al-Onaizan IBM T.J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {freitagm,onaizan}@us.ibm.com Abstract

More information

Context Gates for Neural Machine Translation

Context Gates for Neural Machine Translation Context Gates for Neural Machine Translation Zhaopeng Tu Yang Liu Zhengdong Lu Xiaohua Liu Hang Li Noah s Ark Lab, Huawei Technologies, Hong Kong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com

More information

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) Liang Huang and Kai Zhao and Mingbo Ma School of Electrical Engineering and Computer Science Oregon State University Corvallis,

More information

Deep Architectures for Neural Machine Translation

Deep Architectures for Neural Machine Translation Deep Architectures for Neural Machine Translation Antonio Valerio Miceli Barone Jindřich Helcl Rico Sennrich Barry Haddow Alexandra Birch School of Informatics, University of Edinburgh Faculty of Mathematics

More information

A GRU-Gated Attention Model for Neural Machine Translation

A GRU-Gated Attention Model for Neural Machine Translation A GRU-Gated Attention Model for Neural Machine Translation Biao Zhang 1, Deyi Xiong 2 and Jinsong Su 1 Xiamen University, Xiamen, China 361005 1 Soochow University, Suzhou, China 215006 2 zb@stu.xmu.edu.cn,

More information

Incorporating Word Reordering Knowledge into. attention-based Neural Machine Translation

Incorporating Word Reordering Knowledge into. attention-based Neural Machine Translation Incorporating Word Reordering Knowledge into Attention-based Neural Machine Translation Jinchao Zhang 1 Mingxuan Wang 1 Qun Liu 3,1 Jie Zhou 2 1 Key Laboratory of Intelligent Information Processing, Institute

More information

arxiv: v1 [cs.cl] 17 Oct 2016

arxiv: v1 [cs.cl] 17 Oct 2016 Interactive Attention for Neural Machine Translation Fandong Meng 1 Zhengdong Lu 2 Hang Li 2 Qun Liu 3,4 arxiv:1610.05011v1 [cs.cl] 17 Oct 2016 1 AI Platform Department, Tencent Technology Co., Ltd. fandongmeng@tencent.com

More information

An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation

An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation Raphael Shu, Hideki Nakayama shu@nlab.ci.i.u-tokyo.ac.jp, nakayama@ci.i.u-tokyo.ac.jp The University of Tokyo In

More information

Exploiting Pre-Ordering for Neural Machine Translation

Exploiting Pre-Ordering for Neural Machine Translation Exploiting Pre-Ordering for Neural Machine Translation Yang Zhao, Jiajun Zhang and Chengqing Zong National Laboratory of Pattern Recognition, Institute of Automation, CAS University of Chinese Academy

More information

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience. Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing

More information

arxiv: v1 [stat.ml] 23 Jan 2017

arxiv: v1 [stat.ml] 23 Jan 2017 Learning what to look in chest X-rays with a recurrent visual attention model arxiv:1701.06452v1 [stat.ml] 23 Jan 2017 Petros-Pavlos Ypsilantis Department of Biomedical Engineering King s College London

More information

Neural Response Generation for Customer Service based on Personality Traits

Neural Response Generation for Customer Service based on Personality Traits Neural Response Generation for Customer Service based on Personality Traits Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Sandbank and David Konopnicki IBM Research - Haifa Haifa 31905, Israel {hjon,shmueli,tommy,davidko}@il.ibm.com

More information

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture

More information

A HMM-based Pre-training Approach for Sequential Data

A HMM-based Pre-training Approach for Sequential Data A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University

More information

Neural Machine Translation with Key-Value Memory-Augmented Attention

Neural Machine Translation with Key-Value Memory-Augmented Attention Neural Machine Translation with Key-Value Memory-Augmented Attention Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang Tencent AI Lab {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com

More information

Edinburgh s Neural Machine Translation Systems

Edinburgh s Neural Machine Translation Systems Edinburgh s Neural Machine Translation Systems Barry Haddow University of Edinburgh October 27, 2016 Barry Haddow Edinburgh s NMT Systems 1 / 20 Collaborators Rico Sennrich Alexandra Birch Barry Haddow

More information

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning Inferring Clinical Correlations from EEG Reports with Deep Neural Learning Methods for Identification, Classification, and Association using EHR Data S23 Travis R. Goodwin (Presenter) & Sanda M. Harabagiu

More information

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability Jonathan H. Clark Chris Dyer Alon Lavie Noah A. Smith Language Technologies Institute Carnegie Mellon

More information

Deep Diabetologist: Learning to Prescribe Hypoglycemia Medications with Hierarchical Recurrent Neural Networks

Deep Diabetologist: Learning to Prescribe Hypoglycemia Medications with Hierarchical Recurrent Neural Networks Deep Diabetologist: Learning to Prescribe Hypoglycemia Medications with Hierarchical Recurrent Neural Networks Jing Mei a, Shiwan Zhao a, Feng Jin a, Eryu Xia a, Haifeng Liu a, Xiang Li a a IBM Research

More information

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records Deep Learning based Information Extraction Framework on Chinese Electronic Health Records Bing Tian Yong Zhang Kaixin Liu Chunxiao Xing RIIT, Beijing National Research Center for Information Science and

More information

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language Muhammad Faisal Information Technology University Lahore m.faisal@itu.edu.pk Abstract Human lip-reading is a challenging task.

More information

Massive Exploration of Neural Machine Translation Architectures

Massive Exploration of Neural Machine Translation Architectures Massive Exploration of Neural Machine Translation Architectures Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc V. Le {dennybritz,agoldie,thangluong,qvl}@google.com Google Brain Abstract Neural Machine

More information

Convolutional Neural Networks for Text Classification

Convolutional Neural Networks for Text Classification Convolutional Neural Networks for Text Classification Sebastian Sierra MindLab Research Group July 1, 2016 ebastian Sierra (MindLab Research Group) NLP Summer Class July 1, 2016 1 / 32 Outline 1 What is

More information

Recurrent Neural Networks

Recurrent Neural Networks CS 2750: Machine Learning Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2017 One Motivation: Descriptive Text for Images It was an arresting face, pointed of chin,

More information

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks 1 Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Jing Ma 1, Wei Gao 2, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand

More information

Efficient Attention using a Fixed-Size Memory Representation

Efficient Attention using a Fixed-Size Memory Representation Efficient Attention using a Fixed-Size Memory Representation Denny Britz and Melody Y. Guan and Minh-Thang Luong Google Brain dennybritz,melodyguan,thangluong@google.com Abstract The standard content-based

More information

Sequential Predictions Recurrent Neural Networks

Sequential Predictions Recurrent Neural Networks CS 2770: Computer Vision Sequential Predictions Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 One Motivation: Descriptive Text for Images It was an arresting

More information

Exploiting Patent Information for the Evaluation of Machine Translation

Exploiting Patent Information for the Evaluation of Machine Translation Exploiting Patent Information for the Evaluation of Machine Translation Atsushi Fujii University of Tsukuba Masao Utiyama National Institute of Information and Communications Technology Mikio Yamamoto

More information

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR Medical Knowledge Attention Enhanced Neural Model for Named Entity Recognition in Chinese EMR Zhichang Zhang, Yu Zhang, Tong Zhou College of Computer Science and Engineering, Northwest Normal University,

More information

arxiv: v4 [cs.cl] 30 Sep 2018

arxiv: v4 [cs.cl] 30 Sep 2018 Adversarial Neural Machine Translation arxiv:1704.06933v4 [cs.cl] 30 Sep 2018 Lijun Wu 1, Yingce Xia 2, Li Zhao 3, Fei Tian 3, Tao Qin 3, Jianhuang Lai 1,4 and Tie-Yan Liu 3 1 School of Data and Computer

More information

arxiv: v1 [cs.ai] 28 Nov 2017

arxiv: v1 [cs.ai] 28 Nov 2017 : a better way of the parameters of a Deep Neural Network arxiv:1711.10177v1 [cs.ai] 28 Nov 2017 Guglielmo Montone Laboratoire Psychologie de la Perception Université Paris Descartes, Paris montone.guglielmo@gmail.com

More information

Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets

Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets Zhen Yang 1,2, Wei Chen 1, Feng Wang 1,2, Bo Xu 1 1 Institute of Automation, Chinese Academy of Sciences 2 University

More information

Adversarial Neural Machine Translation

Adversarial Neural Machine Translation Proceedings of Machine Learning Research 95:534-549, 2018 ACML 2018 Adversarial Neural Machine Translation Lijun Wu Sun Yat-sen University Yingce Xia University of Science and Technology of China Fei Tian

More information

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Menno van Zaanen and Simon Zwarts Division of Information and Communication Sciences Department of Computing

More information

Attention Correctness in Neural Image Captioning

Attention Correctness in Neural Image Captioning Attention Correctness in Neural Image Captioning Chenxi Liu 1 Junhua Mao 2 Fei Sha 2,3 Alan Yuille 1,2 Johns Hopkins University 1 University of California, Los Angeles 2 University of Southern California

More information

arxiv: v1 [cs.lg] 8 Feb 2016

arxiv: v1 [cs.lg] 8 Feb 2016 Predicting Clinical Events by Combining Static and Dynamic Information Using Recurrent Neural Networks Cristóbal Esteban 1, Oliver Staeck 2, Yinchong Yang 1 and Volker Tresp 1 1 Siemens AG and Ludwig Maximilian

More information

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation Biyi Fang Michigan State University ACM SenSys 17 Nov 6 th, 2017 Biyi Fang (MSU) Jillian Co (MSU) Mi Zhang

More information

arxiv: v1 [cs.cv] 12 Dec 2016

arxiv: v1 [cs.cv] 12 Dec 2016 Text-guided Attention Model for Image Captioning Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr arxiv:1612.03557v1

More information

arxiv: v2 [cs.lg] 1 Jun 2018

arxiv: v2 [cs.lg] 1 Jun 2018 Shagun Sodhani 1 * Vardaan Pahuja 1 * arxiv:1805.11016v2 [cs.lg] 1 Jun 2018 Abstract Self-play (Sukhbaatar et al., 2017) is an unsupervised training procedure which enables the reinforcement learning agents

More information

Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks

Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks Stefan Glüge, Ronald Böck and Andreas Wendemuth Faculty of Electrical Engineering and Information Technology Cognitive Systems Group,

More information

Unpaired Image Captioning by Language Pivoting

Unpaired Image Captioning by Language Pivoting Unpaired Image Captioning by Language Pivoting Jiuxiang Gu 1, Shafiq Joty 2, Jianfei Cai 2, Gang Wang 3 1 ROSE Lab, Nanyang Technological University, Singapore 2 SCSE, Nanyang Technological University,

More information

Chittron: An Automatic Bangla Image Captioning System

Chittron: An Automatic Bangla Image Captioning System Chittron: An Automatic Bangla Image Captioning System Motiur Rahman 1, Nabeel Mohammed 2, Nafees Mansoor 3 and Sifat Momen 4 1,3 Department of Computer Science and Engineering, University of Liberal Arts

More information

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports Ramon Maldonado, BS, Travis Goodwin, PhD Sanda M. Harabagiu, PhD The University

More information

Flexible, High Performance Convolutional Neural Networks for Image Classification

Flexible, High Performance Convolutional Neural Networks for Image Classification Flexible, High Performance Convolutional Neural Networks for Image Classification Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, Jürgen Schmidhuber IDSIA, USI and SUPSI Manno-Lugano,

More information

Attend and Diagnose: Clinical Time Series Analysis using Attention Models

Attend and Diagnose: Clinical Time Series Analysis using Attention Models Attend and Diagnose: Clinical Time Series Analysis using Attention Models Huan Song, Deepta Rajan, Jayaraman J. Thiagarajan, Andreas Spanias SenSIP Center, School of ECEE, Arizona State University, Tempe,

More information

Audiovisual to Sign Language Translator

Audiovisual to Sign Language Translator Technical Disclosure Commons Defensive Publications Series July 17, 2018 Audiovisual to Sign Language Translator Manikandan Gopalakrishnan Follow this and additional works at: https://www.tdcommons.org/dpubs_series

More information

Joint Inference for Heterogeneous Dependency Parsing

Joint Inference for Heterogeneous Dependency Parsing Joint Inference for Heterogeneous Dependency Parsing Guangyou Zhou and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road,

More information

arxiv: v1 [cs.cl] 11 Aug 2017

arxiv: v1 [cs.cl] 11 Aug 2017 Improved Abusive Comment Moderation with User Embeddings John Pavlopoulos Prodromos Malakasiotis Juli Bakagianni Straintek, Athens, Greece {ip, mm, jb}@straintek.com Ion Androutsopoulos Department of Informatics

More information

Social Image Captioning: Exploring Visual Attention and User Attention

Social Image Captioning: Exploring Visual Attention and User Attention sensors Article Social Image Captioning: Exploring and User Leiquan Wang 1 ID, Xiaoliang Chu 1, Weishan Zhang 1, Yiwei Wei 1, Weichen Sun 2,3 and Chunlei Wu 1, * 1 College of Computer & Communication Engineering,

More information

arxiv: v3 [cs.cl] 14 Sep 2017

arxiv: v3 [cs.cl] 14 Sep 2017 Emotional Chatting Machine: Emotional Conversation Generation with nternal and External Memory Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, Bing Liu State Key Laboratory of ntelligent Technology

More information

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab Intelligent Machines That Act Rationally Hang Li Toutiao AI Lab Four Definitions of Artificial Intelligence Building intelligent machines (i.e., intelligent computers) Thinking humanly Acting humanly Thinking

More information

Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record

Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record Date of publication 10, 2018, date of current version 10, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2875677 arxiv:1810.04793v3 [q-bio.qm] 25 Oct 2018 Patient2Vec: A Personalized Interpretable

More information

Recurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation

Recurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation Recurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation Rudra P K Poudel, Pablo Lamata and Giovanni Montana Department of Biomedical Engineering, King s College London, SE1

More information

arxiv: v1 [cs.cl] 8 Sep 2018

arxiv: v1 [cs.cl] 8 Sep 2018 Generating Distractors for Reading Comprehension Questions from Real Examinations Yifan Gao 1, Lidong Bing 2, Piji Li 2, Irwin King 1, Michael R. Lyu 1 1 The Chinese University of Hong Kong 2 Tencent AI

More information

Differential Attention for Visual Question Answering

Differential Attention for Visual Question Answering Differential Attention for Visual Question Answering Badri Patro and Vinay P. Namboodiri IIT Kanpur { badri,vinaypn }@iitk.ac.in Abstract In this paper we aim to answer questions based on images when provided

More information

Vector Learning for Cross Domain Representations

Vector Learning for Cross Domain Representations Vector Learning for Cross Domain Representations Shagan Sah, Chi Zhang, Thang Nguyen, Dheeraj Kumar Peri, Ameya Shringi, Raymond Ptucha Rochester Institute of Technology, Rochester, NY 14623, USA arxiv:1809.10312v1

More information

Overview of the Patent Translation Task at the NTCIR-7 Workshop

Overview of the Patent Translation Task at the NTCIR-7 Workshop Overview of the Patent Translation Task at the NTCIR-7 Workshop Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, Takehito Utsuro University of Tsukuba National Institute of Information and Communications

More information

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence To understand the network paradigm also requires examining the history

More information

Efficient Deep Model Selection

Efficient Deep Model Selection Efficient Deep Model Selection Jose Alvarez Researcher Data61, CSIRO, Australia GTC, May 9 th 2017 www.josemalvarez.net conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction???????? Num Classes

More information

arxiv: v3 [cs.lg] 15 Feb 2019

arxiv: v3 [cs.lg] 15 Feb 2019 David R. So 1 Chen Liang 1 Quoc V. Le 1 arxiv:1901.11117v3 [cs.lg] 15 Feb 2019 Abstract Recent works have highlighted the strengths of the Transformer architecture for dealing with sequence tasks. At the

More information

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Michael Denkowski and Alon Lavie Language Technologies Institute School of

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 12: Time Sequence Data, Recurrent Neural Networks (RNNs), Long Short-Term Memories (s), and Image Captioning Peter Belhumeur Computer Science Columbia University

More information

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

Using stigmergy to incorporate the time into artificial neural networks

Using stigmergy to incorporate the time into artificial neural networks Using stigmergy to incorporate the time into artificial neural networks Federico A. Galatolo, Mario G.C.A. Cimino, and Gigliola Vaglini Department of Information Engineering, University of Pisa, 56122

More information

Multi-attention Guided Activation Propagation in CNNs

Multi-attention Guided Activation Propagation in CNNs Multi-attention Guided Activation Propagation in CNNs Xiangteng He and Yuxin Peng (B) Institute of Computer Science and Technology, Peking University, Beijing, China pengyuxin@pku.edu.cn Abstract. CNNs

More information

Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced CRF

Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced CRF Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced Wangjin Lee 1 and Jinwook Choi 1,2,3 * 1 Interdisciplinary Program for Bioengineering,

More information

arxiv: v3 [stat.ml] 27 Mar 2018

arxiv: v3 [stat.ml] 27 Mar 2018 ATTACKING THE MADRY DEFENSE MODEL WITH L 1 -BASED ADVERSARIAL EXAMPLES Yash Sharma 1 and Pin-Yu Chen 2 1 The Cooper Union, New York, NY 10003, USA 2 IBM Research, Yorktown Heights, NY 10598, USA sharma2@cooper.edu,

More information

arxiv: v2 [cs.cv] 10 Aug 2017

arxiv: v2 [cs.cv] 10 Aug 2017 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering arxiv:1707.07998v2 [cs.cv] 10 Aug 2017 Peter Anderson 1, Xiaodong He 2, Chris Buehler 2, Damien Teney 3 Mark Johnson

More information

arxiv: v1 [cs.cv] 19 Jan 2018

arxiv: v1 [cs.cv] 19 Jan 2018 Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli arxiv:1802.02210v1 [cs.cv] 19 Jan 2018 Eri Matsuo Ichiro Kobayashi Ochanomizu University 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610,

More information

Deep Interest Evolution Network for Click-Through Rate Prediction

Deep Interest Evolution Network for Click-Through Rate Prediction Deep Interest Evolution Network for Click-Through Rate Prediction Guorui Zhou *, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu and Kun Gai Alibaba Inc, Beijing, China {guorui.xgr, mouna.mn,

More information

Language to Logical Form with Neural Attention

Language to Logical Form with Neural Attention Language to Logical Form with Neural Attention August 8, 2016 Li Dong and Mirella Lapata Semantic Parsing Transform natural language to logical form Human friendly -> computer friendly What is the highest

More information

arxiv: v2 [cs.cv] 19 Dec 2017

arxiv: v2 [cs.cv] 19 Dec 2017 An Ensemble of Deep Convolutional Neural Networks for Alzheimer s Disease Detection and Classification arxiv:1712.01675v2 [cs.cv] 19 Dec 2017 Jyoti Islam Department of Computer Science Georgia State University

More information

arxiv: v1 [cs.cv] 30 Aug 2018

arxiv: v1 [cs.cv] 30 Aug 2018 Deep Chronnectome Learning via Full Bidirectional Long Short-Term Memory Networks for MCI Diagnosis arxiv:1808.10383v1 [cs.cv] 30 Aug 2018 Weizheng Yan 1,2,3, Han Zhang 3, Jing Sui 1,2, and Dinggang Shen

More information

Factoid Question Answering

Factoid Question Answering Factoid Question Answering CS 898 Project June 12, 2017 Salman Mohammed David R. Cheriton School of Computer Science University of Waterloo Motivation Source: https://www.apple.com/newsroom/2017/01/hey-siri-whos-going-to-win-the-super-bowl/

More information

Sparse Coding in Sparse Winner Networks

Sparse Coding in Sparse Winner Networks Sparse Coding in Sparse Winner Networks Janusz A. Starzyk 1, Yinyin Liu 1, David Vogel 2 1 School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701 {starzyk, yliu}@bobcat.ent.ohiou.edu

More information

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks Textbook, Sections 6.2.2, 6.3, 7.9, 7.11-7.13, 9.1-9.5 COMP9444 17s2 Convolutional Networks 1 Outline Geometry of Hidden Unit Activations

More information

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan UT Austin Huijuan Xu UMass. Lowell Jeff Donahue UC Berkeley Marcus Rohrbach UC Berkeley Subhashini Venugopalan

More information

Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration

Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, Pantelis Georgiou Department of Electronic and Electrical Engineering,

More information

arxiv: v2 [cs.ai] 27 Nov 2017

arxiv: v2 [cs.ai] 27 Nov 2017 ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation Chang Zhou 1, Jinze Bai 2, Junshuai Song 2, Xiaofei Liu 1, Zhengchao Zhao 1, Xiusi Chen 2, Jun Gao 2 1 Alibaba Group 2 Key

More information

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix Segmentation of Membrane and Nucleus by Improving Pix2pix Masaya Sato 1, Kazuhiro Hotta 1, Ayako Imanishi 2, Michiyuki Matsuda 2 and Kenta Terai 2 1 Meijo University, Siogamaguchi, Nagoya, Aichi, Japan

More information

ERA: Architectures for Inference

ERA: Architectures for Inference ERA: Architectures for Inference Dan Hammerstrom Electrical And Computer Engineering 7/28/09 1 Intelligent Computing In spite of the transistor bounty of Moore s law, there is a large class of problems

More information

Character-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions

Character-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Character-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions Mandy Korpusik, Zachary Collins, and

More information

Modeling Scientific Influence for Research Trending Topic Prediction

Modeling Scientific Influence for Research Trending Topic Prediction Modeling Scientific Influence for Research Trending Topic Prediction Chengyao Chen 1, Zhitao Wang 1, Wenjie Li 1, Xu Sun 2 1 Department of Computing, The Hong Kong Polytechnic University, Hong Kong 2 Department

More information

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns 1. Introduction Vasily Morzhakov, Alexey Redozubov morzhakovva@gmail.com, galdrd@gmail.com Abstract Cortical

More information

Synthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer s Disease Diagnosis

Synthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer s Disease Diagnosis Synthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer s Disease Diagnosis Yongsheng Pan 1,2, Mingxia Liu 2, Chunfeng Lian 2, Tao Zhou 2,YongXia 1(B), and

More information

Predicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network

Predicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network Predicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network Qingnan Sun, Marko V. Jankovic, Lia Bally, Stavroula G. Mougiakakou, Member IEEE Abstract A deep learning network was used to

More information

arxiv: v2 [cs.cl] 4 Sep 2018

arxiv: v2 [cs.cl] 4 Sep 2018 Training Deeper Neural Machine Translation Models with Transparent Attention Ankur Bapna Mia Xu Chen Orhan Firat Yuan Cao ankurbpn,miachen,orhanf,yuancao@google.com Google AI Yonghui Wu arxiv:1808.07561v2

More information

Comparison of Two Approaches for Direct Food Calorie Estimation

Comparison of Two Approaches for Direct Food Calorie Estimation Comparison of Two Approaches for Direct Food Calorie Estimation Takumi Ege and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka, Chofu-shi, Tokyo

More information

An Analysis on the Emotion in the Field of Translator's Subjectivity. Wei Yuehong1, a

An Analysis on the Emotion in the Field of Translator's Subjectivity. Wei Yuehong1, a International Conference on Education, E-learning and Management Technology (EEMT 2016) An Analysis on the Emotion in the Field of Translator's Subjectivity Wei Yuehong1, a Department of English, North

More information

Deep Learning Models for Time Series Data Analysis with Applications to Health Care

Deep Learning Models for Time Series Data Analysis with Applications to Health Care Deep Learning Models for Time Series Data Analysis with Applications to Health Care Yan Liu Computer Science Department University of Southern California Email: yanliu@usc.edu Yan Liu (USC) Deep Health

More information

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University,

More information

Toward the Evaluation of Machine Translation Using Patent Information

Toward the Evaluation of Machine Translation Using Patent Information Toward the Evaluation of Machine Translation Using Patent Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba Mikio Yamamoto Graduate School of Systems

More information

arxiv: v2 [cs.lg] 3 Apr 2019

arxiv: v2 [cs.lg] 3 Apr 2019 ALLEVIATING CATASTROPHIC FORGETTING USING CONTEXT-DEPENDENT GATING AND SYNAPTIC STABILIZATION arxiv:1802.01569v2 [cs.lg] 3 Apr 2019 Nicolas Y. Masse Department of Neurobiology The University of Chicago

More information

arxiv: v1 [cs.cv] 13 Mar 2018

arxiv: v1 [cs.cv] 13 Mar 2018 RESOURCE AWARE DESIGN OF A DEEP CONVOLUTIONAL-RECURRENT NEURAL NETWORK FOR SPEECH RECOGNITION THROUGH AUDIO-VISUAL SENSOR FUSION Matthijs Van keirsbilck Bert Moons Marian Verhelst MICAS, Department of

More information

CSC2541 Project Paper: Mood-based Image to Music Synthesis

CSC2541 Project Paper: Mood-based Image to Music Synthesis CSC2541 Project Paper: Mood-based Image to Music Synthesis Mary Elaine Malit Department of Computer Science University of Toronto elainemalit@cs.toronto.edu Jun Shu Song Department of Computer Science

More information