Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

Size: px

Start display at page:

Download "Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu"

Lewis Hoover
6 years ago
Views:

1 Minimum Risk Training For Neural Machine Translation Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu ACL 2016, Berlin, German, August 2016

2 Machine Translation MT: using computer to translate natural languages Bush held a talk with Sharon 2

3 Our Work A new training criterion for NMT Eliminating the discrepancy between training and testing Significant improvement on NIST03 dataset: System Training BLEU TER SMT Moses MERT NMT RNNSearch MLE MRT (Koehn and Hoang, 2007; Bandana et al., 2015) 3

4 Outline Introduction to Neural MT Maximum Likelihood Estimation Minimum Risk Training Experiments Conclusion 4

5 Modeling Key problem: how to model the translation process? 5

6 Modeling SMT: describing the translation process via latent structures (Brown et al., 1993) 6

7 Modeling NMT: describing the translation process via neural networks (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bandana et al., 2015) 7

8 Attentional NMT Bush held a talk with Sharon (Bandana et al., 2015) 8

9 Outline Introduction to Neural MT Maximum Likelihood Estimation Minimum Risk Training Experiments Conclusion 9

10 Maximum Likelihood Estimation MLE is the standard training criterion for NMT training data: objective: optimization: {hx (s), y (s) i} S s=1 SX L( ) = log P (y (s) x (s) ; ) = s=1 SX s=1 N (s) X n=1 nl( ) o ˆ MLE = argmax log P (y n (s) x (s), y <n; (s) ) (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bandana et al., 2015) 10

11 Drawbacks word-level loss function (Ranzato et al., 2015) 11

based on training data generating target words

12 Drawbacks word-level loss function exposure bias training testing generating target words are based on training data generating target words are based on model predictions (Ranzato et al., 2015) 12

13 Outline Introduction to Neural MT Maximum Likelihood Estimation Minimum Risk Training Experiments Conclusion 13

14 Minimum Risk Training MRT aims to minimize expected loss on training data training data: objective: optimization: {hx (s), y (s) i} S s=1 SX X J ( ) = = s=1 SX s=1 y2y(x (s) ) E y x (s) ; nj o ˆ MRT = argmin ( ) h P (y x (s) ; ) (y, y (s) ) i (y, y (s) ) (Och, 2003; Smith and Eisner, 2006; He and Deng, 2012) 14

15 Challenge It is intractable to calculate partial ( i = SX s=1 E y x (s) ; " (y, y (s) ) N X (s) (y n x (s), y <n ; )/@ i P (y n x (s), y <n ; ) # the search space is exponential the loss function is usually non-decomposable Hard to design efficient DP algorithms (Och, 2003; Smith and Eisner, 2006; He and Deng, 2012) 15

16 Approximation We approximate the true distribution with a sampled space J ( ) = SX X P (y x (s) ; ) Py 0 2S(x (s) ) P (y0 x (s) ; ) (y, y(s) ) s=1 y2s(x (s) ) = = SX s=1 SX s=1 X y2s(x (s) ) E y x (s) ;, Q(y x (s) ;, ) (y, y (s) ) h i (y, y (s) ) (Och, 2003) 16

17 Advantages of MRT Directly optimize with respect to evaluation metrics better correlation with the final objective Allow arbitrary loss functions both decomposable and non-decomposable Allow arbitrary end-to-end architectures any neural MT models other NLP tasks 17

18 Outline Introduction to Neural MT Maximum Likelihood Estimation Minimum Risk Training Experiments Conclusion 18

19 Setup Language pairs ZH-EN: 2.56M sentence pairs (67.5M+74.8M words) EN-FR: 12M sentence pairs (348M+304M words) EN-DE: 4M sentence pairs (91M+87M words) Evaluation: BLEU, TER, Subjective evaluation 19

20 Effect of Loss Functions criterion loss BLEU TER NIST MLE N/A sbleu MRT ster snist Effect of loss functions on the Chinese-English validation set 20

21 Effect of Effect of α on the Chinese-English validation set 21

22 Chinese-English Translation Moses RNNSearch+MLE RNNSearch+MRT compared to Moses: up to +8.6 points compared to MLE: up to +7.2 points Nist06(Dev) Nist02 Nist03 Nist04 Nist05 Nist08 Evaluation: case-insensitive BLEU 22

23 Chinese-English Translation 70 Moses RNNSearch+MLE RNNSearch+MRT compared to Moses: up to points compared to MLE: up to -8.3 points Nist06(Dev) Nist02 Nist03 Nist04 Nist05 Nist08 Evaluation: case-insensitive TER 23

24 Subjective Evaluation 60 MLE Vs. MRT Worse Equal Better The two human evaluators made close judgements: around 54% of MLE translations are worse than MRT, 23% are equal, and 23% are better. 24

25 Example Input Reference the u.s. delegation includes a china expert from stanford university, two senate foreign policy aides and a former state department official in charge of dealing with pyongyang authorities 25

26 Example Input Moses the united states to members of the delegation include representatives from the stanford university, a chinese expert, two assistant senate foreign policy and a responsible for dealing with pyongyang before the officials of the state council. 26

27 Example Input RNNSearch (MLE) the us delegation comprises a chinese expert from stanford university, a chinese foreign office assistant policy assistant and a former official who is responsible for dealing with the pyongyang authorities. 27

28 Example Input RNNSearch (MRT) the us delegation included a chinese expert from the stanford university, two senate foreign policy assistants, and a former state department official who had dealings with the pyongyang authorities. 28

29 English-French Translation System Architecture Training Vocab BLEU Bahdanau et al. (2015) gated RNN with search 30K 28.5 Jean et al. (2015) gated RNN with search 30K 30.0 Jean et al. (2015) gated RNN with search + PosUnk 30K 33.1 Sutskever et al. (2014) LSTM with 4 layers 80K 30.6 MLE Luong et al. (2015) LSTM with 4 layers 40K 29.5 Luong et al. (2015) LSTM with 4 layers + PosUnk 40K 31.8 Luong et al. (2015) LSTM with 6 layers 40K 30.4 Luong et al. (2015) LSTM with 6 layers + PosUnk 40K 32.7 gated RNN with search MLE 30K 29.9 this work gated RNN with search MRT 30K 31.3 gated RNN with search + PosUnk MRT 30K 34.2 Dev set: news-test 2012 & 2013, Test set: news-test 2014 Evaluation: case-sensitive BLEU 29

30 English-German Translation System Architecture Training BLEU Jean et al. (2015) gated RNN with search 16.5 Jean et al. (2015) gated RNN with search + PosUnk 19.0 MLE Jean et al. (2015) gated RNN with search + LV + PosUnk 19.4 Luong et al. (2015b) LSTM w/ 4 layers + dropout + local att. + PosUnk 20.9 gated RNN with search MLE 16.5 this work gated RNN with search MRT 18.0 gated RNN with search + PosUnk MRT 20.5 Dev set: news-test 2012 & 2013, Test set: news-test 2014 Evaluation: case-sensitive BLEU 30

31 Outline Introduction to Neural MT Maximum Likelihood Estimation Minimum Risk Training Experiments Conclusion 31

32 Conclusion Neural MT has become increasingly hot in the recent two years Conventional maximum likelihood estimation (MLE) suffers from some drawbacks Minimum risk training (MRT) significantly improves NMT over MLE MRT can be applied to other end-to-end architectures for NLP tasks 32

33 Thank you! 33

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) Liang Huang and Kai Zhao and Mingbo Ma School of Electrical Engineering and Computer Science Oregon State University Corvallis,