Deep Architectures for Neural Machine Translation

Size: px

Start display at page:

Download "Deep Architectures for Neural Machine Translation"

Bernadette Perkins
5 years ago
Views:

University of Edinburgh Faculty of Mathematics and Physics, Charles University

1 Deep Architectures for Neural Machine Translation Antonio Valerio Miceli Barone Jindřich Helcl Rico Sennrich Barry Haddow Alexandra Birch School of Informatics, University of Edinburgh Faculty of Mathematics and Physics, Charles University September 8, 2017 Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 1 / 16

2 Deep architectures What is the depth of a recurrent neural network? Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 1 / 16

3 Deep architectures What is the depth of a recurrent neural network? [Pascanu et al., 2014] Transition depth Used for LM [Zilly et al., 2016] Stacked depth Baidu [Zhou et al., 2016] Google [Wu et al., 2016] Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 1 / 16

4 Deep architectures What is the depth of a recurrent neural network? [Pascanu et al., 2014] Transition depth Used for LM [Zilly et al., 2016] Stacked depth Baidu [Zhou et al., 2016] Google [Wu et al., 2016] This work Provide a systematic comparison of deep architectures for NMT Investigate the effects of different kinds of depth Propose a combined "BiDeep" architecture Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 1 / 16

5 Deep transition encoder (x i, h i 1,Ls ) GRU 3 GRU 3 GRU 3 h i,k = GRU k (0, ) h i,k 1 h i,1 = GRU 1 GRU 2 GRU 2 GRU 2 for 1 < k L s GRU 1 GRU 1 GRU 1 x 1 x 2 x N Bidirectional encoder Compute the next state using a deep feed-forward network made of multiple GRU transition blocks GRU blocks not individually recurrent, recurrence at time-step level Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 3 / 16

6 Deep transition decoder GRU 3 GRU 3 GRU 3 s j,2 = GRU 2 (ATT(C, s j,1 ), s j,1 ) s j,1 = GRU 1 (y j 1, s j 1,Lt ) GRU 2 GRU 2 GRU 2 ATT ATT ATT s j,k = GRU k (0, s j,k 1 ) for 2 < k L t GRU 1 GRU 1 GRU 1 y 0 y 1 y M Attention mechanism between 1st and 2nd layers (Nematus) Minimum transition depth is 2 even in the baseline Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 4 / 16

7 Alternating stacked encoder x 1 x N Baidu [Zhou et al., 2016] Multiple levels of individually recurrent GRU cells Residual connections between levels Bidirectional: two columns with opposing scanning directions Each level inverts scanning direction of the previous one Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 5 / 16

8 Biunidirectional stacked encoder x 1 x N Google [Wu et al., 2016] Multiple levels of individually recurrent GRU cells Residual connections between levels First levels are bidirectional, then states are merged Higher levels are unidirectional left-to-right Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 6 / 16

9 Stacked decoder y 0 y 1 y M Multiple levels of individually recurrent GRU cells Residual connections between levels Different variations depending on how attention is used in the higher layers (details in the paper) Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 7 / 16

10 BiDeep Architectures Our proposal Combine the two kinds of depth Stacked levels of recurrent cells, each with multiple layers of transition depth Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 8 / 16

11 Experimental setup Data Training: WMT-2017 English-to-German Validation: newstest 2013 Test: newstest (we report averages) System: Nematus [Sennrich et al., 2017b] GRU sequence-to-sequence with attention [Bahdanau et al., 2015] Layer normalization [Ba et al., 2016] Training on single Titan X (Pascal) GPU Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 9 / 16

12 Results: Deep Encoders Bleu shallow ,900 biunidirectional ,500 alternating ,600 deep transition ,900 Test Bleu Train speed 800 1,500 2,200 2,900 3,600 tokens/sec Depth-4 encoders improve translation quality Deep transition fastest and most accurate Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 10 / 16

13 Results: Deep Decoders Bleu shallow ,900 rgru (reuse att.) ,150 cgru (multi att.) 1, deep transition ,200 Test Bleu Training speed 800 1,500 2,200 2,900 3,600 tokens/sec Depth-4 decoders improve translation quality Deep transition fastest and second most accurate Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 11 / 16

14 Results: Deep Encoders and Decoders Bleu deep tran. enc. alternating + rgru 1, , biunidir. + rgru deep tran. enc/dec ,280 1, Test Bleu Training speed BiDeep 1, ,500 2,200 2,900 3,600 tokens/sec Depth-4 on both encoder and decoder yields small improvement Biunidirectional+rGRU [Wu et al., 2016] performs the worst Alternating+rGRU [Zhou et al., 2016] and BiDeep are tied at this depth Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 12 / 16

15 Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 12 / 16

16 Results: Deep Encoders and Decoders (depth 8) Bleu shallow ,900 depth-4 altern. + rgru 1, depth-8 altern. + rgru depth-4 BiDeep 1, depth-8 BiDeep Test Bleu Training speed 800 1,500 2,200 2,900 3,600 tokens/sec Stacked-only plateaus BiDeep keeps improving Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 13 / 16

17 Error analysis Deep transition decoders have a longer information path In principle, might cause fading memory and vanishing gradients Does this affect long-distance dependencies? Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 14 / 16

18 Error analysis Deep transition decoders have a longer information path In principle, might cause fading memory and vanishing gradients Does this affect long-distance dependencies? Lingeval97 subject-verb agreement [Sennrich, 2017] Contrastive evaluation accuracy shallow GRU stacked GRU deep transition GRU BiDeep GRU distance Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 14 / 16

19 Conclusions Findings Depth improves translation BLEU especially in the encoder Alternating stacked encoders outperform Biunidirectional Deep transition encoders performs better or equal BiDeep architectures perform the best We validated these findings on the WMT-17 news translation task Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 15 / 16

20 Conclusions Findings Depth improves translation BLEU especially in the encoder Alternating stacked encoders outperform Biunidirectional Deep transition encoders performs better or equal BiDeep architectures perform the best We validated these findings on the WMT-17 news translation task Recommendations Use deep transition for speed and model size Use BiDeep for maximum quality Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 15 / 16

21 Conclusions Code in the main Nematus repository Scripts and paper: Thanks for your attention Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 16 / 16

Conclusions Code in the main Nematus repository Scripts and paper: https://git.

22 Conclusions Code in the main Nematus repository Scripts and paper: Thanks for your attention Questions? Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 16 / 16

23 Results: WMT 2017 news translation task ZH EN TR EN RU EN LV EN DE EN CS EN EN ZH EN TR EN RU EN LV EN DE EN CS shallow 20.5 depth Transition depth: Czech: stacked Improvement on all language pairs except Latvian English Bleu Miceli Barone, Helcl, Sennrich, Haddow, Birch Deep Architectures for NMT 16 / 16

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, Philip