arxiv: v1 [cs.cl] 16 Jan 2018
|
|
- Elfreda Ashley Jennings
- 5 years ago
- Views:
Transcription
1 Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang 1, Jinsong Su 1, Yue Qin 1, Yang Liu 2, Rongrong Ji 1, Hongi Wang 1 Xiamen University, Xiamen, China 1 Tsinghua University, Beiing, China 2 xwzhang@stu.xmu.edu.cn, ssu@xmu.edu.cn, qinyue@stu.xmu.edu.cn liuyang2011@tsinghua.edu.cn, rri@xmu.edu.cn, hw@xmu.edu.cn arxiv: v1 [cs.cl] 16 Jan 2018 Abstract The dominant neural machine translation (NMT) models apply unified attentional encoder-decoder neural networks for translation. Traditionally, the NMT decoders adopt recurrent neural networks (RNNs) to perform translation in a left-toright manner, leaving the target-side contexts generated from right to left unexploited during translation. In this paper, we equip the conventional attentional encoder-decoder NMT framework with a backward decoder, in order to explore bidirectional decoding for NMT. Attending to the hidden state sequence produced by the encoder, our backward decoder first learns to generate the target-side hidden state sequence from right to left. Then, the forward decoder performs translation in the forward direction, while in each translation prediction timestep, it simultaneously applies two attention models to consider the source-side and reverse target-side hidden states, respectively. With this new architecture, our model is able to fully exploit source- and target-side contexts to improve translation quality altogether. Experimental results on NIST Chinese-English and WMT English-German translation tasks demonstrate that our model achieves substantial improvements over the conventional NMT by 3.14 and 1.38 BLEU points, respectively. The source code of this work can be obtained from NMT. Introduction Recently, end-to-end neural machine translation (NMT) (Kalchbrenner and Blunsom 2013; Sutskever, Vinyals, and Le 2014; Cho et al. 2014) has achieved promising results and gained increasing attention. Compared with conventional statistical machine translation (SMT) (Koehn, Och, and Marcu 2003; Chiang 2007) which needs to explicitly design features to capture translation regularities, NMT aims to construct a unified encoder-decoder framework based on neural networks to model the entire translation process. Further, the introduction of the attention mechanism (Bahdanau, Cho, and Bengio 2015) enhances the capability of NMT in capturing long-distance dependencies. Despite being a relatively new framework, the attentional encoder-decoder NMT quickly become the de facto method. Corresponding author. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. Source Reference L2R R2L rì fángwèitīng zhǎngguān : bú wàng ūnguó lìshǐ zūnzhòng línguó zūnyán apan defense chief : never forget militaristic history, respect neighboring nations dignity apan s defense agency chief : death of militarism respects its neighbors dignity apanese defense agency has never forgotten militarism s history to respect the dignity of neighboring countries Table 1: Translation examples of NMT systems with different decoding manners. L2R/R2L denotes the translation produced by the NMT system with left-to-right/right-to-left decoding. Texts highlighted in wavy/dashed lines are incorrect/correct translations, respectively. Generally, most NMT decoders are based on recurrent neural networks (RNNs) and generate translations in a leftto-right manner. Thus, despite the advantage of encoding unbounded target words predicted previously for the prediction at each time step, these decoders are incapable of capturing the reverse target-side context for translation. Once errors occur in previous predictions, the quality of subsequent predictions would be undermined due to the negative impact of the noisy forward encoded target-side contexts. Intuitively, the reverse target-side contexts are also crucial for translation predictions, since they not only provide complementary signals but also bring different biases to NMT model (Hoang, Haffari, and Cohn 2017). Take the example in Table 1 into consideration. The latter half of the Chinese sentence, misinterpreted by the conventional NMT system, is accurately translated by the NMT system with right-toleft decoding. Therefore, it is important to investigate how to integrate reverse target-side contexts into the decoder to improve translation performance of NMT. To this end, many researchers resorted to introducing bidirectional decoding into NMT (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017). Most of them re-ranked candidate translations using bidirectional decoding scores together, in order to select a translation with both proper prefixes and suffixes. However, such methods also come with some drawbacks limiting the potential of bidirectional decoding in NMT. On the one hand,
2 due to the limited search space and search errors of beam search, the generated 1-best translation is often far from satisfactory and thus it fails to provide sufficient information as a complement for the other decoder. On the other hand, because the bidirectional decoders are often independent from each other during the translation, the unidirectional decoder is unable to fully exploit target-side contexts produced by the other decoder, and consequently the generated candidate translations are still undesirable. Therefore, how to effectively exert the influence of bidirectional decoding on NMT is still worthy of further study. In this paper, we significantly extend the conventional attentional encoder-decoder NMT framework by introducing a backward decoder, for the purpose of fully exploiting reverse target-side contexts to improve NMT. As shown in Fig. 1, along with our novel asynchronous bidirectional decoders, the proposed model remains an end-to-end attentional NMT framework, which mainly consists of three components: 1) an encoder embedding the input source sentence into bidirectional hidden states; 2) a backward decoder that is similar to the conventional NMT decoder but performs translation in the right-to-left manner, where the generated hidden states encode the reverse target-side contexts; 3) a forward decoder that generates the final translation from left to right and introduces two attention models simultaneously considering the source-side bidirectional and target-side reverse hidden state vectors for translation prediction. Compared with the previous related NMT models, our model has the following advantages: 1) The backward decoder learns to produce hidden state vectors that essentially encode semantics of potential hypotheses, allowing the following forward decoder to utilize richer target-side contexts for translation. 2) By integrating right-to-left target-side context modeling and left-toright translation generation into an end-to-end oint framework, our model alleviates the error propagation of reverse target-side context modeling to some extent. The maor contributions of this paper are concluded as follows: We thoroughly analyze and point out the existing drawbacks of researches on NMT with bidirectional decoding. We introduce a backward decoder to encode the left-toright target-side contexts, as a supplement to the conventional context modeling mechanism of NMT. To the best of our knowledge, this is the first attempt to investigate the effectiveness of the end-to-end attentional NMT model with asynchronous bidirectional decoders. Experiments on Chinese-English and English-German translation show that our model achieves significant improvements over the conventional NMT model. Our Model As described above, our model mainly includes three components: 1) a neural encoder with parameter set θ e ; 2) a neural backward decoder with parameter set θ b ; and 3) a neural forward decoder with parameter set θ f, which will be elaborated in the following subsections. Particularly, we choose Gated Recurrent Unit (GRU) (Cho et al. 2014) to build the encoder and decoders, as it is widely used in the NMT literature with relatively few parameters required. However, it should be noted that our model is also applicable to other RNNs, such as Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997). The Neural Encoder The neural encoder of our model is identical to that of the dominant NMT model, which is modeled using a bidirectional RNN. The forward RNN reads a source sentence x=x 1, x 2...x N in a left-to-right order. At each timestep, we apply a recurrent activation function φ( ) to learn the semantic representation of the word sequence x 1:i as h i =φ( h i 1, x i ). Likewise, the backward RNN scans the source sentence in the reverse order and generates the semantic representation h i of the word sequence x i:n. Finally, we concatenate the hidden states of these two RNNs to form an annotation sequence h = {h 1, h 2,...h i..., h N }, where h i = [ h T i, h T i ]T encodes information about the i-th word with respect to all the other surrounding words in the source sentence. In our model, these annotations will provide source-side contexts for not only the forward decoder but also the backward one via different attention models. The Neural Backward Decoder The neural backward decoder of our model is also similar to the decoder of the dominant NMT model, while the only difference is that it performs decoding in a right-to-left way. Given the source-side hidden state vectors of the encoder and all target words generated previously, the backward decoder models how to reversely produce the next target word. Using this decoder, we calculate the conditional probability of the reverse translation y =(y 0, y 1, y 2,..., y M ) as follows P ( y x; θ e, θ b ) = = M =0 M =0 P (y y >, x; θ e, θ b ) g (y +1, s, m eb ), (1) where g ( ) is a non-linear function, s and m eb denote the decoding state and the source-side context vector at the -th time step, respectively, and M indicates the length of the reverse translation. Among s and m eb, s is computed by the GRU activation function f( ): s =f( s +1, y +1, m eb ), and m eb is defined by a encoder-backward decoder attention model as the weighted sum of the source annotations {h i }: m eb = N i=1 α eb,i h i, (2) α eb exp(e eb,i =,i ) N i =1 exp(eeb,i ), (3) e eb,i = (va eb ) T tanh(wa eb s +1 + Ua eb h i ), (4)
3 Figure 1: The architecture of the proposed NMT model. Note that the forward decoder directly attends to the reverse hidden state sequence s ={ s 0, s 1,... s M } rather than the word sequence produced by the backward decoder. where va eb, Wa eb and Ua eb are the parameters of the encoderbackward decoder attention model. In doing so, the decoder is also able to automatically select the effective source words to reversely predict target words. By introducing this backward decoder, our NMT model is able to better exploit target-side contexts for translation prediction. In addition to the generation of target word sequence, more importantly, our backward decoder will produce target-side hidden states s, which essentially captures richer reverse target-side contexts for the further use of the forward decoder. The Neural Forward Decoder The neural forward decoder of our model is extended from the decoder of the dominant NMT model. It performs decoding in a left-to-right manner under the semantic guides of source-side and reverse target-side contexts, which are separately captured by the encoder and the backward decoder. The forward decoder is trained to sequentially predict the next target word given the source-side hidden state vectors of the encoder, the reverse target-side hidden state sequence generated by the backward encoder, and all target words generated previously. Formally, the conditional probability of the translation y=(y 0, y 1,..., y M ) is defined as follows: P (y x; θ e, θ b, θ f ) = = M P (y y <, x; θ e, θ b, θ f ) =0 M =0 g(y 1, s, m ef, mbf ), (5) where g( ) is a non-linear function, s is the decoding state, m ef and m bf denote the source-side and reverse target-side context vectors at the -th timestep, respectively. As illustrated in Fig. 1, we use the first hidden state of the reverse encoder, denoted as h 1, to initialize the first hidden state s 0 of the forward decoder. More importantly, we introduce two attention models to respectively capture the source-side and reverse target-side contexts: one is the encoder-forward decoder attention model that focuses on the source annotations and the other is the backward decoder-forward decoder attention model considering all reverse target-side hidden states. Specifically, we produce m ef from the hidden states {h i } of the encoder as follows: m ef = α ef,i = N i=1 e ef,i = (vef α ef,i h i, (6) exp(e ef,i ) N (7) i =1 exp(eef,i ), a ) T tanh(wa ef s 1 + U ef a h i ), (8) where va ef, Wa ef, and Ua ef are the parameters of the encoderforward decoder attention model. Note that we directly choose hidden state sequence rather than word sequence to model the target-side contexts, for the reason that the former enables our model to better avoid negative effect of translation prediction errors to some extent. Likewise, we define as a weighted sum of the hidden states { s } of the m bf
4 backward decoder: α bf m bf = M =0 α bf, s, (9), = exp(e bf, ) M (10) =1 exp(ebf, ), e bf, = (vbf a ) T tanh(wa bf s 1 + U bf a s ), (11) where va bf, Wa bf, and Ua bf are the parameters of the backward decoder-forward decoder attention model. Then, we incorporate m ef and m bf into the GRU hidden unit of the forward decoder. Formally, the hidden state s of the forward decoder is computed by s = (1 z d ) s 1 + z d s, s = tanh(w d v(y 1 ) + U d [r d s 1 ] (12) + C ef m ef + C bf m bf ), where W d, U d, C ef, and C bf are the weight matrices, z d and r d are update and reset gates of GRU, respectively, depending on y 1, s 1, m ef and m bf. Finally, we further define the probability of y as p(y y <, x; θ e, θ b, θ f ) exp(g(y 1, s, m ef, mbf )), (13) where y 1, s, m ef and m bf are concatenated and fed through a single feed-forward layer. Training and Testing Given a training corpus D={(x, y)}, we train the proposed model according to the following obective: J(D; θ e, θ b, θ f ) = 1 D arg max θ e,θ b,θ f (x,y) D (14) {λ logp (y x; θ e, θ b, θ f ) + (1 λ) logp ( y x; θ e, θ b )} where y is obtained by inverting y, and λ is a hyperparameter used to balance the preference between the two terms. The first term logp (y x; θ e, θ b, θ f ) models the translation procedure illustrated in Figure 1. To ensure the consistency between model training and testing, we perform beam search to generate reverse hidden states s when optimizing logp (y x; θ e, θ b, θ f ). In addition, to guarantee the s produced by beam search is of high quality, we further introduce the second term logp ( y x; θ e, θ b )} to maximize the conditional likelihood of y. Note that the beam search requires high time complexity, and therefore, we directly adopt greedy search to implement right-to-left decoding, while proves to be sufficiently effective in our experiments. Once the proposed model is trained, we adopt a two-phase scheme to translate the unseen input sentence x: First, we use the backward decoder with greedy search to sequentially generate s until the target-side start symbol s occurs with the highest probability. Then, we perform beam search on the forward decoder to find the best translation that approximately maximizes logp (y x; θ e, θ b, θ f ). Experiments We evaluated the proposed model on NIST Chinese-English and WMT English-German translation tasks. Setup For Chinese-English translation, the training data consists of 1.25M bilingual sentences with 27.9M Chinese words and 34.5M English words. These sentence pairs are mainly extracted from LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. We chose NIST 2002 (MT02) dataset as our development set, and the NIST 2003 (MT03), 2004 (MT04), 2005 (MT05), and 2006 (MT06) datasets as our test sets. Finally, we evaluated the translations using BLEU (Papineni et al. 2002). For English-German translation, we used WMT 2015 training data that contains 4.46M sentence pairs with 116.1M English words and 108.9M German words. Particularly, we segmented words via byte pair encoding (BPE) (Sennrich, Haddow, and Birch 2016b). The news-test 2013 was used as development set and the news-test 2015 as test set. To efficiently train NMT models, we trained each model with sentences of length up to 50 words. In doing so, 90.12% and 89.03% of the Chinese-English and English-German parallel sentences were covered in the experiments. Besides, we set the vocabulary size to 30K for Chinese-English translation, and 50K for English-German translation, and mapped all the out-of-vocabulary words in the Chinese-English corpus to a special token UNK. Finally, such vocabularies contained 97.4% Chinese words and 99.3% English words of the Chinese-English corpus, and almost 100.0% English words and 98.2% German words of the English-German corpus, respectively. We applied Rmsprop (Graves 2013) (momentum = 0, ρ = 0.95, and ɛ = ) to train models for 5 epochs and selected the best model parameters according to the model performance on the development set. During this procedure, we set the following hyper-parameters: word embedding dimension as 620, hidden layer size as 1000, learning rate as , batch size as 80, gradient norm as 1.0, and dropout rate as 0.3. All the other settings are the same as in (Bahdanau, Cho, and Bengio 2015). Baselines We compared the proposed model against the following state-of-the-art SMT and NMT systems: Moses 1 : an open source phrase-based translation system with default configuration and a 4-gram language model trained on the target portion of training data. Note that we used all data to train MOSES. RNNSearch: a re-implementation of the attention-based NMT system (Bahdanau, Cho, and Bengio 2015) with slight changes from dl4mt tutorial 2. RNNSearch(R2L): a variant of RNNSearch that produces translation in a right-to-left direction
5 BLEU Score SYSTEM MT03 MT04 MT05 MT06 Average COVERAGE MemDec DeepLAU DMAtten Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 2: Evaluation of the NIST Chinese-English translation task using case-insensitive BLEU scores (λ=0.7). Here we displayed the experimental results of the first four models reported in (Wang et al. 2017; Zhang et al. 2017). COVERAGE (Tu et al. 2016) is a basic NMT model with a coverage model. MemDec (Wang et al. 2016) improves translation quality with external memory. DeepLAU (Wang et al. 2017) reduces the gradient propagation length inside the recurrent unit of RNN-based NMT. DMAtten (Zhang et al. 2017) incorporates word reordering knowledge into attentional NMT. ATNMT: an attention-based NMT system with two directional decoders (Liu et al. 2016) which explores the agreement on target-bidirectional NMT. Using this model, we first run beam search for forward and backward models independently to obtain two k-best lists, and then re-score the combination of these two lists using the oint model to find the best candidate. Following (Liu et al. 2016), we set both beam sizes of two decoders as 10. Note that we replaced LSTM adopted in (Liu et al. 2016) with GRU to ensure fair comparison. NSC(RT): it is a variant of neural system combination framework proposed by Zhou et al. (2017). It first uses an attentional NMT model consisting of one standard encoder and one backward decoder to produce the best reverse translation. Finally, another attentional NMT model generates the final output from its standard encoder and a reverse translation encoder which embeds the best reverse translation, in a way similar to the multi-source NMT model (Zoph and Knight 2016). This model differs from ours in two aspects: (1) it is not an end-to-end model, and (2) it considers the embedded hidden states of the reverse translation, while our model considers the hidden states produced by the backward decoder. NSC(HS): it is similar to NSC(RT), with the only difference that it directly considers the reverse hidden states produced by the backward decoder. We set beam sizes of all above-mentioned models as 10, and the beam sizes of the backward and forward decoders of our model as 1 and 10, respectively. Results on Chinese-English Translation Parameters. RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT), NSC(HS) models have 85.6M, 85.6M, 171.2M, 120.0M and 130.0M parameters, respectively. By contrast, the parameter size of our model is about 130.0M. Speed. We used a single GPU device 1080Ti to train models. It takes one hour to train 6,500, 6,500, 6,500 and 4,700 and 3,708 minibatches for RNNSearch, RNNSearch(R2L), λ Figure 2: Experiment results on the development set using different λs. ATNMT, NSC(RT), NSC(HS) models, respectively. The training speed of the proposed model is relatively slow: about 1,758 mini-batches are processed in one hour. We first investigated the impact of the hyper-parameter λ (see Eq. (14)) on the development set. To this end, we gradually varied λ from 0.5 to 1.0 with an increment of 0.1 in each step. As shown in Fig. 2, we find that our model achieved the best performance when λ=0.7. Therefore, we set λ=0.7 for all experiments thereafter. The experimental results on Chinese-English translation are depicted in Table 2. We also displayed the performances of some dominant individual models such as COVERAGE (Tu et al. 2016), MemDec (Wang et al. 2016), DeepLAU (Wang et al. 2017) and DMAtten (Zhang et al. 2017) on the same data set. Specifically, the proposed model significantly outperforms Moses, RNNSearch, RNNSearch(R2L), ATNMT, NSC(RT) and NSC(HS) by 7.38, 3.14, 3.26, 1.86, 2.34, and 1.92 BLEU points, respectively. Even when compared with (Tu et al. 2016; Wang et al. 2016; 2017; Zhang et al. 2017), our model still has better performance in the same setting. Moreover, we draw the following conclusions: (1) In contrast to RNNSearch and RNNSearch(R2L), our model exhibits much better performance. These results testify our hypothesis that the forward and backward decoders
6 BLEU Score [ 1, 1 0 ] [ 1 1, 2 0 ] [ 2 1, 3 0 ] [ 3 1, 4 0 ] [ 4 1, 5 0 ] [ 5 1,... ] Sentence Length RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Figure 3: BLEU scores on different translation groups divided according to source sentence length. are complementary to each other in target-side context modeling, and therefore, the simultaneous exploration of bidirectional decoders will lead to better translations. (2) On all test sets, our model outperforms ATNMT, which indicates that compared with k-best hypotheses rescoring (Liu et al. 2016), oint modeling with attending to reverse hidden states behaves better in exploiting reverse target-side contexts. The underlying reason is that the reverse hidden states encode richer target-side contexts than single translation. In addition, compared with the k-best hypotheses rescoring, our model could refine translation at a more fine-grained level via the attention mechanism. (3) Particularly, the fact that NSC(HS) outperforms NSC(RT) reveals the advantage of reverse hidden state representations of the backward decoder in overcoming data sparsity. Besides, our model behaves better than NSC(HS), which accords with our intuition that to some extent, oint model is able to alleviate the error propagation when encoding target-side contexts. (4) Note that the performance of our model is better than that of our model (RR). This result verifies our speculation that model training with the translations obtained by greedy search is superior due to the consistency during the training and testing procedure. Finally, based on the length of source sentences, we divided our test sets into different groups and then compared the system performances in each group. Fig. 3 illustrates the BLEU scores on these groups of test sets. We observe that our model achieves the best performance in all groups, although the performances of all systems drop with the increase of the length of source sentences. These results clearly demonstrate once again the effectiveness of our model. Case Study To better understand how our model outperforms others, we studied the 1-best translations using different models. Table 3 provides a Chinese-English translation example. We find that RNNSearch produces the translation with good prefix, while RNNSearch(R2L) generates the translation with desirable suffix. Although there are various models with bidirectional decoding that could exploit bidirectional SYSTEM TEST BPEChar RecAtten ConvEncoder Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model Table 4: Evaluation of the WMT English-German translation task using case-sensitive BLEU scores (λ=0.8). We directly cited the experimental results of the first three models provided by (Gehring et al. 2017). BPEChar (Chung, Cho, and Bengio 2016) is an attentional NMT model with a character-level decoder. RecAtten (Yang et al. 2017) uses a recurrent attention model to explicitly model the dependence between attentions among target words. ConvEncoder (Gehring et al. 2017) introduces a convolutional encoder into NMT. contexts, most of them are unable to translate the whole sentence precisely and our model is currently the only one capable to produce a high quality translation in this circumstance. Results on English-German Translation To enhance the persuasion of our experiments, we also provided some experiments results on the same data set, including BPEChar (Chung, Cho, and Bengio 2016), RecAtten (Yang et al. 2017), and ConvEncoder (Gehring et al. 2017). We determined the optimal λ as 0.8 according to the performance of our model on the development set. Table 4 presents the results on English-German translation. Our model still significantly outperforms others including some dominant NMT systems with other improved techniques. We believe that our work can be applied to other architectures easily. It should be noted that the BLEU score gaps between our model and the others on English- German translation are much smaller than those on Chinese- English translation. The underlying reasons lie in the following two aspects, which have also been mentioned in (Shen et al. 2016). First, the Chinese-English datasets contain four reference translations for each sentence while the English- German dataset only have single reference. Second, compared with German, Chinese is more distantly related to English, leading to the predominant advantage of utilizing target-side contexts in Chinese-English translation. Related Work In this work, we mainly focus on how to exploit bidirectional decoding to refine translation, which has always been a research focus in machine translation. In SMT, many approaches through backward language model (BLM) or target-bidirectional decoding have been explored to capture right-to-left target-side contexts for translation. For example, Watanabe and Sumita (2002) explored
7 Source Reference Moses RNNSearch RNNSearch(R2L) ATNMT NSC(RT) NSC(HS) Our Model yīyuè kāishǐ, zǒngwùshěng iāng yǒu liù míng zhíyuán yī zhōu zhìshǎo yī tiān bù xūyào ìn bàngōngshì, kěyǐ zài iā lǐ, dàxué huò túshūguǎn tòuguò gāosù wǎnglùo fúwù gōngzuò. starting from anuary, the ministry of internal affairs and communications will have six employees who do n t need to go to their offices at least one day a week ; instead they may work from home, universities or libraries through high - speed internet services. since anuary, there will be six staff members a week for least one day in office, they can at home, university or through high - speed internet library services. as early as anuary, six staff members will not be required to enter office at least one day in one week, which can be done through high - speed internet services through high - speed internet services. beginning in anuary, least six staff members have to go to the office for least one week and can work at home, and university or library through high - speed internet services. at the beginning of anuary, there will be six staff members to go to office least one week, which can be done through high - speed internet services at home and university or libraries. at least six staff members will leave office for least one week at least one week, and can work at home and university or library through high - speed internet services. in anuary, there will be six staff members who are required to enter offices for at least one day at least one day, and we can work at home, university or library through high - speed internet services. starting in anuary, six staff members will not need to enter the office at least one day in one week, and they can work at home, universities or libraries through high - speed internet services. Table 3: Translation examples of different systems. Texts highlighted in wavy lines are incorrectly translated. Please note that the translations produced by RNNSearch and RNNSearch(R2L) are complementary to each other, and the translation generated by our model is the most accurate and complete. two decoding methods: one is the right-to-left decoding based on the left-to-right beam search algorithm; the other decodes in both directions and merges the two hypothesized partial sentences into one. Finch and Sumita (2009) integrated both mono-directional approaches to reduce the effects caused by language specificity. Particularly, they integrated the BLM to their reverse translation decoder. Beyond left-to-right decoding, Zhang et al. (2013) studied the effects of multiple decomposition structures as well as dynamic bidirectional decomposition on SMT. When it comes to NMT, the dominant RNN-based NMT models also perform translation in a left-to-right manner, leading to the same drawback of underutilization of targetside contexts. To address this issue, Liu et al. (2016) first ointly train both directional LSTM models, and then in testing they try to search for target-side translations which are supported by both models. Similarly, Sennrich et al. (2016a) attempted to re-rank the left-to-right decoding results by right-to-left decoding, leading to diversified translation results. Recently, Hoang et al. (2017) proposed an approximate inference framework based on continuous optimization that enables decoding bidirectional translation models. Finally, it is noteworthy that our work is also related to pre-translation (Niehues et al. 2016; Zhou et al. 2017) and neural automatic post-editing (Pal et al. 2017; Dowmunt and Grundkiewicz 2017) for NMT, because our model involves two stages of translation. Overall, the most relevant models include (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Our model significantly differs from these works in the following aspects: 1) The motivation of our work varies from theirs. Specifically, in this work, we aim to fully exploit the reverse target-side contexts encoded by right-to-left hidden state vectors to improve NMT with left-to-right decoding. In contrast, Liu et al. (2016), Sennrich et al. (2016a), Hoang et al. (2017) investigated how to exploit bidirectional decoding scores to produce better translations, both Niehues et al. (2016) and Zhou et al. (2017) intended to combine the advantages of both NMT and SMT, and in the work of (Pal et al. 2017; Dowmunt and Grundkiewicz 2017), they explored multiple neural architectures for the task of automatic post-editing of machine translation output. 2) Our model attends to right-toleft hidden state vectors, while (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017) considered the raw best output of machine translation system instead. 3) Our model is an end-to-end NMT model, while the bidirectional decoders adopted in (Liu et al. 2016; Sennrich, Haddow, and Birch 2016a; Hoang, Haffari, and Cohn 2017) were independent from each other, and the component used to produce the raw translation was independent from the NMT model in (Niehues et al. 2016; Zhou et al. 2017; Pal et al. 2017; Dowmunt and Grundkiewicz 2017). Conclusions and Future Work In this paper, we have equipped the conventional attentional encoder-decoder NMT model with a backward decoder. In our model, the backward decoder first produces hidden state vectors encoding reverse target-side contexts. Then, two individual hidden state sequences generated by the encoder and the backward decoder are simultaneously exploited via attention mechanism by the forward decoder for translation. Compared with the previous models, ours is an end-to-end NMT model that fully utilizes reverse target-side contexts for translation. Experimental results on Chinese-English and English-German translation tasks demonstrate the effective-
8 ness of our model. Our model is generally applicable to other models with RNN-based decoder. Therefore, the effectiveness of our approach on other tasks related to RNN-based decoder modeling, such as image captioning, will be investigated in future research. Moreover, in our work, the attention mechanisms acting on the encoder and the backward decoder are independent from each other. However, intuitively, these two mechanisms should be closely associated with each other. Therefore, we are interested in exploring better attention mechanism combination to further refine our model. Acknowledgments The authors were supported by National Natural Science Foundation of China (Nos , and ), Scientific Research Proect of National Language Committee of China (Grant No. YB135-49), Natural Science Foundation of Fuian Province of China (No. 2016J05161), and National Key R&D Program of China (Nos. 2017YFC and 2016YFB ). We also thank the reviewers for their insightful comments. References Bahdanau, D.; Cho, K.; and Bengio, Y Neural machine translation by ointly learning to align and translate. In Proc. of ICLR2015. Chiang, D Hierarchical phrase-based translation. Computational Linguistics 33: Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proc. of EMNLP2014, Chung, J.; Cho, K.; and Bengio, Y A character-level decoder without explicit segmentation for neural machine translation. In Proc. of ACL2016, Dowmunt, M. J., and Grundkiewicz, R An exploration of neural sequence-to-sequence architectures for automatic post-editing. In arxiv: v1. Finch, A., and Sumita, E Bidirectional phrase-based statistical machine translation. In Proc. of EMNLP2009, Gehring, J.; Auli, M.; Grangier, D.; and Dauphin, Y A convolutional encoder model for neural machine translation. In Proc. of ACL2017, Graves, A Generating sequences with recurrent neural networks. In arxiv: v5. Hoang, C. D. V.; Haffari, G.; and Cohn, T Decoding as continuous optimization in neural machine translation. In arxiv Hochreiter, S., and Schmidhuber, J Long short-term memory. Neural Computation Kalchbrenner, N., and Blunsom, P Recurrent continuous translation models. In Proc. of EMNLP2013, Koehn, P.; Och, F. J.; and Marcu, D Statistical phrasebased translation. In Proc. of NAACL2003, Liu, L.; Utiyama, M.; Finch, A.; and Sumita, E Agreement on target-bidirectional neural machine translation. In Proc. of NAACL2016, Niehues, J.; Cho, E.; Ha, T.-L.; and Waibel, A Pretranslation for neural machine translation. In Proc. of COL- ING2016, Pal, S.; Naskar, S. K.; Vela, M.; Liu, Q.; and van Genabith, J Neural automatic post-editing using prior alignment and reranking. In Proc. of EACL2017, Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W Bleu: A method for automatic evaluation of machine translation. In Proc. of ACL2002, Sennrich, R.; Haddow, B.; and Birch, A. 2016a. Edinburgh neural machine translation systems for wmt 16. In arxiv: v2. Sennrich, R.; Haddow, B.; and Birch, A. 2016b. Neural machine translation of rare words with subword units. In Proc. of ACL2016, Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y Minimum risk training for neural machine translation. In Proc. of ACL2016, Sutskever, I.; Vinyals, O.; and Le, Q. V Sequence to sequence learning with neural networks. In Proc. of NIPS2014, Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H Modeling coverage for neural machine translation. In Proc. of ACL2016, Wang, M.; Lu, Z.; Li, H.; and Liu, Q Memoryenhanced decoder for neural machine translation. In Proc. of EMNLP2016, Wang, M.; Lu, Z.; Zhou, J.; and Liu, Q Deep neural machine translation with linear associative unit. In Proc. of ACL2017, Watanabe, T., and Sumita, E Bidirectional decoding for statistical machine translation. In Proc. of COLING 2002, Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; and Smola, A Neural machine translation with recurrent attention modeling. In Proc. of EACL2017, Zhang, H.; Toutanova, K.; Quirk, C.; and Gao, J Beyond left-to-right: Multiple decomposition structures for smt. In Proc. of NAACL2013, Zhang, J.; Wang, M.; Liu, Q.; and Zhou, J Incorporating word reordering knowledge into attention-based neural machine translation. In Proc. of ACL 2017, Zhou, L.; Hu, W.; Zhang, J.; and Zong, C Neural system combination for machine translation. In Proc. of ACL2017, Zoph, B., and Knight, K Multi-source neural translation. In Proc. of NAACL2016,
Asynchronous Bidirectional Decoding for Neural Machine Translation
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Asynchronous Bidirectional Decoding for Neural Machine Translation Xiangwen Zhang, 1 Jinsong Su, 1 Yue Qin, 1 Yang Liu, 2 Rongrong
More informationMinimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu
Minimum Risk Training For Neural Machine Translation Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu ACL 2016, Berlin, German, August 2016 Machine Translation MT: using computer
More informationSmaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017
Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, Philip
More informationBeam Search Strategies for Neural Machine Translation
Beam Search Strategies for Neural Machine Translation Markus Freitag and Yaser Al-Onaizan IBM T.J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {freitagm,onaizan}@us.ibm.com Abstract
More informationContext Gates for Neural Machine Translation
Context Gates for Neural Machine Translation Zhaopeng Tu Yang Liu Zhengdong Lu Xiaohua Liu Hang Li Noah s Ark Lab, Huawei Technologies, Hong Kong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com
More informationWhen to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)
When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) Liang Huang and Kai Zhao and Mingbo Ma School of Electrical Engineering and Computer Science Oregon State University Corvallis,
More informationDeep Architectures for Neural Machine Translation
Deep Architectures for Neural Machine Translation Antonio Valerio Miceli Barone Jindřich Helcl Rico Sennrich Barry Haddow Alexandra Birch School of Informatics, University of Edinburgh Faculty of Mathematics
More informationA GRU-Gated Attention Model for Neural Machine Translation
A GRU-Gated Attention Model for Neural Machine Translation Biao Zhang 1, Deyi Xiong 2 and Jinsong Su 1 Xiamen University, Xiamen, China 361005 1 Soochow University, Suzhou, China 215006 2 zb@stu.xmu.edu.cn,
More informationIncorporating Word Reordering Knowledge into. attention-based Neural Machine Translation
Incorporating Word Reordering Knowledge into Attention-based Neural Machine Translation Jinchao Zhang 1 Mingxuan Wang 1 Qun Liu 3,1 Jie Zhou 2 1 Key Laboratory of Intelligent Information Processing, Institute
More informationarxiv: v1 [cs.cl] 17 Oct 2016
Interactive Attention for Neural Machine Translation Fandong Meng 1 Zhengdong Lu 2 Hang Li 2 Qun Liu 3,4 arxiv:1610.05011v1 [cs.cl] 17 Oct 2016 1 AI Platform Department, Tencent Technology Co., Ltd. fandongmeng@tencent.com
More informationAn Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation
An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation Raphael Shu, Hideki Nakayama shu@nlab.ci.i.u-tokyo.ac.jp, nakayama@ci.i.u-tokyo.ac.jp The University of Tokyo In
More informationExploiting Pre-Ordering for Neural Machine Translation
Exploiting Pre-Ordering for Neural Machine Translation Yang Zhao, Jiajun Zhang and Chengqing Zong National Laboratory of Pattern Recognition, Institute of Automation, CAS University of Chinese Academy
More informationMotivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.
Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing
More informationarxiv: v1 [stat.ml] 23 Jan 2017
Learning what to look in chest X-rays with a recurrent visual attention model arxiv:1701.06452v1 [stat.ml] 23 Jan 2017 Petros-Pavlos Ypsilantis Department of Biomedical Engineering King s College London
More informationNeural Response Generation for Customer Service based on Personality Traits
Neural Response Generation for Customer Service based on Personality Traits Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Sandbank and David Konopnicki IBM Research - Haifa Haifa 31905, Israel {hjon,shmueli,tommy,davidko}@il.ibm.com
More informationImage Captioning using Reinforcement Learning. Presentation by: Samarth Gupta
Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture
More informationA HMM-based Pre-training Approach for Sequential Data
A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University
More informationNeural Machine Translation with Key-Value Memory-Augmented Attention
Neural Machine Translation with Key-Value Memory-Augmented Attention Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang Tencent AI Lab {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com
More informationEdinburgh s Neural Machine Translation Systems
Edinburgh s Neural Machine Translation Systems Barry Haddow University of Edinburgh October 27, 2016 Barry Haddow Edinburgh s NMT Systems 1 / 20 Collaborators Rico Sennrich Alexandra Birch Barry Haddow
More informationInferring Clinical Correlations from EEG Reports with Deep Neural Learning
Inferring Clinical Correlations from EEG Reports with Deep Neural Learning Methods for Identification, Classification, and Association using EHR Data S23 Travis R. Goodwin (Presenter) & Sanda M. Harabagiu
More informationBetter Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability Jonathan H. Clark Chris Dyer Alon Lavie Noah A. Smith Language Technologies Institute Carnegie Mellon
More informationDeep Diabetologist: Learning to Prescribe Hypoglycemia Medications with Hierarchical Recurrent Neural Networks
Deep Diabetologist: Learning to Prescribe Hypoglycemia Medications with Hierarchical Recurrent Neural Networks Jing Mei a, Shiwan Zhao a, Feng Jin a, Eryu Xia a, Haifeng Liu a, Xiang Li a a IBM Research
More informationDeep Learning based Information Extraction Framework on Chinese Electronic Health Records
Deep Learning based Information Extraction Framework on Chinese Electronic Health Records Bing Tian Yong Zhang Kaixin Liu Chunxiao Xing RIIT, Beijing National Research Center for Information Science and
More informationDeep Learning for Lip Reading using Audio-Visual Information for Urdu Language
Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language Muhammad Faisal Information Technology University Lahore m.faisal@itu.edu.pk Abstract Human lip-reading is a challenging task.
More informationMassive Exploration of Neural Machine Translation Architectures
Massive Exploration of Neural Machine Translation Architectures Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc V. Le {dennybritz,agoldie,thangluong,qvl}@google.com Google Brain Abstract Neural Machine
More informationConvolutional Neural Networks for Text Classification
Convolutional Neural Networks for Text Classification Sebastian Sierra MindLab Research Group July 1, 2016 ebastian Sierra (MindLab Research Group) NLP Summer Class July 1, 2016 1 / 32 Outline 1 What is
More informationRecurrent Neural Networks
CS 2750: Machine Learning Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2017 One Motivation: Descriptive Text for Images It was an arresting face, pointed of chin,
More informationRumor Detection on Twitter with Tree-structured Recursive Neural Networks
1 Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Jing Ma 1, Wei Gao 2, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand
More informationEfficient Attention using a Fixed-Size Memory Representation
Efficient Attention using a Fixed-Size Memory Representation Denny Britz and Melody Y. Guan and Minh-Thang Luong Google Brain dennybritz,melodyguan,thangluong@google.com Abstract The standard content-based
More informationSequential Predictions Recurrent Neural Networks
CS 2770: Computer Vision Sequential Predictions Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 One Motivation: Descriptive Text for Images It was an arresting
More informationExploiting Patent Information for the Evaluation of Machine Translation
Exploiting Patent Information for the Evaluation of Machine Translation Atsushi Fujii University of Tsukuba Masao Utiyama National Institute of Information and Communications Technology Mikio Yamamoto
More informationMedical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR
Medical Knowledge Attention Enhanced Neural Model for Named Entity Recognition in Chinese EMR Zhichang Zhang, Yu Zhang, Tong Zhou College of Computer Science and Engineering, Northwest Normal University,
More informationarxiv: v4 [cs.cl] 30 Sep 2018
Adversarial Neural Machine Translation arxiv:1704.06933v4 [cs.cl] 30 Sep 2018 Lijun Wu 1, Yingce Xia 2, Li Zhao 3, Fei Tian 3, Tao Qin 3, Jianhuang Lai 1,4 and Tie-Yan Liu 3 1 School of Data and Computer
More informationarxiv: v1 [cs.ai] 28 Nov 2017
: a better way of the parameters of a Deep Neural Network arxiv:1711.10177v1 [cs.ai] 28 Nov 2017 Guglielmo Montone Laboratoire Psychologie de la Perception Université Paris Descartes, Paris montone.guglielmo@gmail.com
More informationImproving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets
Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets Zhen Yang 1,2, Wei Chen 1, Feng Wang 1,2, Bo Xu 1 1 Institute of Automation, Chinese Academy of Sciences 2 University
More informationAdversarial Neural Machine Translation
Proceedings of Machine Learning Research 95:534-549, 2018 ACML 2018 Adversarial Neural Machine Translation Lijun Wu Sun Yat-sen University Yingce Xia University of Science and Technology of China Fei Tian
More informationUnsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation
Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Menno van Zaanen and Simon Zwarts Division of Information and Communication Sciences Department of Computing
More informationAttention Correctness in Neural Image Captioning
Attention Correctness in Neural Image Captioning Chenxi Liu 1 Junhua Mao 2 Fei Sha 2,3 Alan Yuille 1,2 Johns Hopkins University 1 University of California, Los Angeles 2 University of Southern California
More informationarxiv: v1 [cs.lg] 8 Feb 2016
Predicting Clinical Events by Combining Static and Dynamic Information Using Recurrent Neural Networks Cristóbal Esteban 1, Oliver Staeck 2, Yinchong Yang 1 and Volker Tresp 1 1 Siemens AG and Ludwig Maximilian
More informationDeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation
DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation Biyi Fang Michigan State University ACM SenSys 17 Nov 6 th, 2017 Biyi Fang (MSU) Jillian Co (MSU) Mi Zhang
More informationarxiv: v1 [cs.cv] 12 Dec 2016
Text-guided Attention Model for Image Captioning Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr arxiv:1612.03557v1
More informationarxiv: v2 [cs.lg] 1 Jun 2018
Shagun Sodhani 1 * Vardaan Pahuja 1 * arxiv:1805.11016v2 [cs.lg] 1 Jun 2018 Abstract Self-play (Sukhbaatar et al., 2017) is an unsupervised training procedure which enables the reinforcement learning agents
More informationAuto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks
Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks Stefan Glüge, Ronald Böck and Andreas Wendemuth Faculty of Electrical Engineering and Information Technology Cognitive Systems Group,
More informationUnpaired Image Captioning by Language Pivoting
Unpaired Image Captioning by Language Pivoting Jiuxiang Gu 1, Shafiq Joty 2, Jianfei Cai 2, Gang Wang 3 1 ROSE Lab, Nanyang Technological University, Singapore 2 SCSE, Nanyang Technological University,
More informationChittron: An Automatic Bangla Image Captioning System
Chittron: An Automatic Bangla Image Captioning System Motiur Rahman 1, Nabeel Mohammed 2, Nafees Mansoor 3 and Sifat Momen 4 1,3 Department of Computer Science and Engineering, University of Liberal Arts
More informationMemory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports
Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports Ramon Maldonado, BS, Travis Goodwin, PhD Sanda M. Harabagiu, PhD The University
More informationFlexible, High Performance Convolutional Neural Networks for Image Classification
Flexible, High Performance Convolutional Neural Networks for Image Classification Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, Jürgen Schmidhuber IDSIA, USI and SUPSI Manno-Lugano,
More informationAttend and Diagnose: Clinical Time Series Analysis using Attention Models
Attend and Diagnose: Clinical Time Series Analysis using Attention Models Huan Song, Deepta Rajan, Jayaraman J. Thiagarajan, Andreas Spanias SenSIP Center, School of ECEE, Arizona State University, Tempe,
More informationAudiovisual to Sign Language Translator
Technical Disclosure Commons Defensive Publications Series July 17, 2018 Audiovisual to Sign Language Translator Manikandan Gopalakrishnan Follow this and additional works at: https://www.tdcommons.org/dpubs_series
More informationJoint Inference for Heterogeneous Dependency Parsing
Joint Inference for Heterogeneous Dependency Parsing Guangyou Zhou and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road,
More informationarxiv: v1 [cs.cl] 11 Aug 2017
Improved Abusive Comment Moderation with User Embeddings John Pavlopoulos Prodromos Malakasiotis Juli Bakagianni Straintek, Athens, Greece {ip, mm, jb}@straintek.com Ion Androutsopoulos Department of Informatics
More informationSocial Image Captioning: Exploring Visual Attention and User Attention
sensors Article Social Image Captioning: Exploring and User Leiquan Wang 1 ID, Xiaoliang Chu 1, Weishan Zhang 1, Yiwei Wei 1, Weichen Sun 2,3 and Chunlei Wu 1, * 1 College of Computer & Communication Engineering,
More informationarxiv: v3 [cs.cl] 14 Sep 2017
Emotional Chatting Machine: Emotional Conversation Generation with nternal and External Memory Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, Bing Liu State Key Laboratory of ntelligent Technology
More informationIntelligent Machines That Act Rationally. Hang Li Toutiao AI Lab
Intelligent Machines That Act Rationally Hang Li Toutiao AI Lab Four Definitions of Artificial Intelligence Building intelligent machines (i.e., intelligent computers) Thinking humanly Acting humanly Thinking
More informationPatient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record
Date of publication 10, 2018, date of current version 10, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2875677 arxiv:1810.04793v3 [q-bio.qm] 25 Oct 2018 Patient2Vec: A Personalized Interpretable
More informationRecurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation
Recurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation Rudra P K Poudel, Pablo Lamata and Giovanni Montana Department of Biomedical Engineering, King s College London, SE1
More informationarxiv: v1 [cs.cl] 8 Sep 2018
Generating Distractors for Reading Comprehension Questions from Real Examinations Yifan Gao 1, Lidong Bing 2, Piji Li 2, Irwin King 1, Michael R. Lyu 1 1 The Chinese University of Hong Kong 2 Tencent AI
More informationDifferential Attention for Visual Question Answering
Differential Attention for Visual Question Answering Badri Patro and Vinay P. Namboodiri IIT Kanpur { badri,vinaypn }@iitk.ac.in Abstract In this paper we aim to answer questions based on images when provided
More informationVector Learning for Cross Domain Representations
Vector Learning for Cross Domain Representations Shagan Sah, Chi Zhang, Thang Nguyen, Dheeraj Kumar Peri, Ameya Shringi, Raymond Ptucha Rochester Institute of Technology, Rochester, NY 14623, USA arxiv:1809.10312v1
More informationOverview of the Patent Translation Task at the NTCIR-7 Workshop
Overview of the Patent Translation Task at the NTCIR-7 Workshop Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, Takehito Utsuro University of Tsukuba National Institute of Information and Communications
More informationCognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence
Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence To understand the network paradigm also requires examining the history
More informationEfficient Deep Model Selection
Efficient Deep Model Selection Jose Alvarez Researcher Data61, CSIRO, Australia GTC, May 9 th 2017 www.josemalvarez.net conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction???????? Num Classes
More informationarxiv: v3 [cs.lg] 15 Feb 2019
David R. So 1 Chen Liang 1 Quoc V. Le 1 arxiv:1901.11117v3 [cs.lg] 15 Feb 2019 Abstract Recent works have highlighted the strengths of the Transformer architecture for dealing with sequence tasks. At the
More informationExploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk
Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Michael Denkowski and Alon Lavie Language Technologies Institute School of
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 12: Time Sequence Data, Recurrent Neural Networks (RNNs), Long Short-Term Memories (s), and Image Captioning Peter Belhumeur Computer Science Columbia University
More informationCSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil
CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
More informationUsing stigmergy to incorporate the time into artificial neural networks
Using stigmergy to incorporate the time into artificial neural networks Federico A. Galatolo, Mario G.C.A. Cimino, and Gigliola Vaglini Department of Information Engineering, University of Pisa, 56122
More informationMulti-attention Guided Activation Propagation in CNNs
Multi-attention Guided Activation Propagation in CNNs Xiangteng He and Yuxin Peng (B) Institute of Computer Science and Technology, Peking University, Beijing, China pengyuxin@pku.edu.cn Abstract. CNNs
More informationConnecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced CRF
Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced Wangjin Lee 1 and Jinwook Choi 1,2,3 * 1 Interdisciplinary Program for Bioengineering,
More informationarxiv: v3 [stat.ml] 27 Mar 2018
ATTACKING THE MADRY DEFENSE MODEL WITH L 1 -BASED ADVERSARIAL EXAMPLES Yash Sharma 1 and Pin-Yu Chen 2 1 The Cooper Union, New York, NY 10003, USA 2 IBM Research, Yorktown Heights, NY 10598, USA sharma2@cooper.edu,
More informationarxiv: v2 [cs.cv] 10 Aug 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering arxiv:1707.07998v2 [cs.cv] 10 Aug 2017 Peter Anderson 1, Xiaodong He 2, Chris Buehler 2, Damien Teney 3 Mark Johnson
More informationarxiv: v1 [cs.cv] 19 Jan 2018
Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli arxiv:1802.02210v1 [cs.cv] 19 Jan 2018 Eri Matsuo Ichiro Kobayashi Ochanomizu University 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610,
More informationDeep Interest Evolution Network for Click-Through Rate Prediction
Deep Interest Evolution Network for Click-Through Rate Prediction Guorui Zhou *, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu and Kun Gai Alibaba Inc, Beijing, China {guorui.xgr, mouna.mn,
More informationLanguage to Logical Form with Neural Attention
Language to Logical Form with Neural Attention August 8, 2016 Li Dong and Mirella Lapata Semantic Parsing Transform natural language to logical form Human friendly -> computer friendly What is the highest
More informationarxiv: v2 [cs.cv] 19 Dec 2017
An Ensemble of Deep Convolutional Neural Networks for Alzheimer s Disease Detection and Classification arxiv:1712.01675v2 [cs.cv] 19 Dec 2017 Jyoti Islam Department of Computer Science Georgia State University
More informationarxiv: v1 [cs.cv] 30 Aug 2018
Deep Chronnectome Learning via Full Bidirectional Long Short-Term Memory Networks for MCI Diagnosis arxiv:1808.10383v1 [cs.cv] 30 Aug 2018 Weizheng Yan 1,2,3, Han Zhang 3, Jing Sui 1,2, and Dinggang Shen
More informationFactoid Question Answering
Factoid Question Answering CS 898 Project June 12, 2017 Salman Mohammed David R. Cheriton School of Computer Science University of Waterloo Motivation Source: https://www.apple.com/newsroom/2017/01/hey-siri-whos-going-to-win-the-super-bowl/
More informationSparse Coding in Sparse Winner Networks
Sparse Coding in Sparse Winner Networks Janusz A. Starzyk 1, Yinyin Liu 1, David Vogel 2 1 School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701 {starzyk, yliu}@bobcat.ent.ohiou.edu
More informationCOMP9444 Neural Networks and Deep Learning 5. Convolutional Networks
COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks Textbook, Sections 6.2.2, 6.3, 7.9, 7.11-7.13, 9.1-9.5 COMP9444 17s2 Convolutional Networks 1 Outline Geometry of Hidden Unit Activations
More informationTranslating Videos to Natural Language Using Deep Recurrent Neural Networks
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan UT Austin Huijuan Xu UMass. Lowell Jeff Donahue UC Berkeley Marcus Rohrbach UC Berkeley Subhashini Venugopalan
More informationDilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration
Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, Pantelis Georgiou Department of Electronic and Electrical Engineering,
More informationarxiv: v2 [cs.ai] 27 Nov 2017
ATRank: An Attention-Based User Behavior Modeling Framework for Recommendation Chang Zhou 1, Jinze Bai 2, Junshuai Song 2, Xiaofei Liu 1, Zhengchao Zhao 1, Xiusi Chen 2, Jun Gao 2 1 Alibaba Group 2 Key
More informationSegmentation of Cell Membrane and Nucleus by Improving Pix2pix
Segmentation of Membrane and Nucleus by Improving Pix2pix Masaya Sato 1, Kazuhiro Hotta 1, Ayako Imanishi 2, Michiyuki Matsuda 2 and Kenta Terai 2 1 Meijo University, Siogamaguchi, Nagoya, Aichi, Japan
More informationERA: Architectures for Inference
ERA: Architectures for Inference Dan Hammerstrom Electrical And Computer Engineering 7/28/09 1 Intelligent Computing In spite of the transistor bounty of Moore s law, there is a large class of problems
More informationCharacter-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Character-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions Mandy Korpusik, Zachary Collins, and
More informationModeling Scientific Influence for Research Trending Topic Prediction
Modeling Scientific Influence for Research Trending Topic Prediction Chengyao Chen 1, Zhitao Wang 1, Wenjie Li 1, Xu Sun 2 1 Department of Computing, The Hong Kong Polytechnic University, Hong Kong 2 Department
More informationAn Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns
An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns 1. Introduction Vasily Morzhakov, Alexey Redozubov morzhakovva@gmail.com, galdrd@gmail.com Abstract Cortical
More informationSynthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer s Disease Diagnosis
Synthesizing Missing PET from MRI with Cycle-consistent Generative Adversarial Networks for Alzheimer s Disease Diagnosis Yongsheng Pan 1,2, Mingxia Liu 2, Chunfeng Lian 2, Tao Zhou 2,YongXia 1(B), and
More informationPredicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network
Predicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network Qingnan Sun, Marko V. Jankovic, Lia Bally, Stavroula G. Mougiakakou, Member IEEE Abstract A deep learning network was used to
More informationarxiv: v2 [cs.cl] 4 Sep 2018
Training Deeper Neural Machine Translation Models with Transparent Attention Ankur Bapna Mia Xu Chen Orhan Firat Yuan Cao ankurbpn,miachen,orhanf,yuancao@google.com Google AI Yonghui Wu arxiv:1808.07561v2
More informationComparison of Two Approaches for Direct Food Calorie Estimation
Comparison of Two Approaches for Direct Food Calorie Estimation Takumi Ege and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka, Chofu-shi, Tokyo
More informationAn Analysis on the Emotion in the Field of Translator's Subjectivity. Wei Yuehong1, a
International Conference on Education, E-learning and Management Technology (EEMT 2016) An Analysis on the Emotion in the Field of Translator's Subjectivity Wei Yuehong1, a Department of English, North
More informationDeep Learning Models for Time Series Data Analysis with Applications to Health Care
Deep Learning Models for Time Series Data Analysis with Applications to Health Care Yan Liu Computer Science Department University of Southern California Email: yanliu@usc.edu Yan Liu (USC) Deep Health
More informationCase-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials
Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University,
More informationToward the Evaluation of Machine Translation Using Patent Information
Toward the Evaluation of Machine Translation Using Patent Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba Mikio Yamamoto Graduate School of Systems
More informationarxiv: v2 [cs.lg] 3 Apr 2019
ALLEVIATING CATASTROPHIC FORGETTING USING CONTEXT-DEPENDENT GATING AND SYNAPTIC STABILIZATION arxiv:1802.01569v2 [cs.lg] 3 Apr 2019 Nicolas Y. Masse Department of Neurobiology The University of Chicago
More informationarxiv: v1 [cs.cv] 13 Mar 2018
RESOURCE AWARE DESIGN OF A DEEP CONVOLUTIONAL-RECURRENT NEURAL NETWORK FOR SPEECH RECOGNITION THROUGH AUDIO-VISUAL SENSOR FUSION Matthijs Van keirsbilck Bert Moons Marian Verhelst MICAS, Department of
More informationCSC2541 Project Paper: Mood-based Image to Music Synthesis
CSC2541 Project Paper: Mood-based Image to Music Synthesis Mary Elaine Malit Department of Computer Science University of Toronto elainemalit@cs.toronto.edu Jun Shu Song Department of Computer Science
More information