On Training of Deep Neural Network. Lornechen

Size: px

Start display at page:

Download "On Training of Deep Neural Network. Lornechen"

Stephen Brown
5 years ago
Views:

1 On Training of Deep Neural Network Lornechen

2 Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 2

3 Neural Network Components Layer: input layer + hidden layer + output layer Neuron: # in each layer and activation function Objective function: LogLoss, CE, RMSE Optimization method: SGD, Nesterov, AdaDelta 3

4 Difficulties of Training DNN Availability of data labeled data is often scarce prone to overfitting Highly non-convex full of bad local optima gradient descent no longer work well Diffusion of gradient the gradients that are propagated backwards rapidly diminish in magnitude as the depth of the network increases. 4

5 Difficulties of Training DNN NN of 2-3 hidden layers are commonly used in the past. However, NN of ten to hundred even to thousand hidden layers are (widely) used nowadays. How? 5

6 Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 6

7 Supervised Version First, train one layer at a time Fix Fix Train 7

8 Supervised Version First, train one layer at a time Fix Train Fix 8

9 Supervised Version First, train one layer at a time Train Fix Fix 9

10 Supervised Version Finally, fine-tune the whole network BP Forward 10

11 Unsupervised Version 11

12 The Reasoning Pre-training offers a better local optima as a start point compared to randomized initialization. Fine-tuning subsequently adjusts the weights according to the task at hand (guided with objective function). 12

13 However Layer-wise pre-training is rarely used nowadays: requires more training time may lead to a poorer local optimum Just train the network as whole from scratch. 13

14 Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 14

15 Traditional Activation Function Sigmoid f x = 1 1+e x saturate and kill gradients Tanh f x = ex e x e x +e x similar as sigmoid 15

16 Recap of Back Propagation (BP) dj(w) dw ij (l) = δ i (l+1) f(zj (l) ) δ i (l) = ( j=1 W ji (l) δj (l+1) ) f (z j (l) ) δ i (l) : error of l-th layer W ji (l) : weight between l-th and (l+1)-th layers f(z j (l) ): activation of l-th layer f (z j (l) ): derivative of l-th layer 16

17 Traditional Activation Function f (z j (l) ) < 1 : gradients diminish rapidly when propagated backwards (contractive everywhere) 17

18 Rectified Linear Unit (ReLU) ReLU f x = max(0, x) pros: faster convergence minimum computation cost cons "dead" units 18

19 ReLU Family Leaky ReLU f x = max 0, x + a min(0, x) a is fixed at (very) small value, e.g., 0.01 or 0.3 Parametric ReLU similar as Leaky ReLU but a is learnable Randomized Leaky ReLU similar as Leaky ReLU but a is sampled from a uniform distribution U(l, u) at training, while being fixed at (l+u)/2 at prediction 19

20 ReLU Family 20

21 Exponential Linear Unit (ELU) ELU f x = max 0, x + a min(0, exp x 1) a controls the value to which an ELU saturates for negative net inputs 21

22 What Neuron Type Should I Use? Use the ReLU, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU or Maxout. 22

23 Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 23

24 Traditional Filler Uniform Filler U(-w, w) Gaussian Filler N(0, std), e.g., std = 0.01 difficult to converge for #layers > 8 24

25 Traditional Filler Xavier Filler investigate the variance of responses in each layer assuming linear activation function N(0, std), where std = sqrt(1/n), n = (fan-in + fan-out) / 2 difficult to converge for #layers > 30 (stall) 25

26 MSRA Filler Problem of Xavier Filler assuming linear activation function invalid for ReLU and PReLU New derived filler a = 0, it becomes the ReLU case a = 1, it becomes the linear (the same as Xavier) 26

27 MSRA Filler vs. Xavier Filler Xavier s std will be sqrt(1/2^l) of MSRA Filler if there are L layers. So the signal will be diminishing. Xavier is difficult to converge for #layers >

28 LSUV Filler 28

29 Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 29

30 Intermediate Supervised Layer overall loss = a * loss0 + b * loss1 + c * loss2 30

31 Deeply Supervised Nets overall loss = a * loss0 + b * loss1 + c * loss2 31

32 Batch Normalization Layer 32

33 Batch Normalization Layer 33

34 Deep Residual Nets Is learning better networks as easy as stacking more layers? a deeper model should produce no higher training error than its shallower counterpart 34

35 Deep Residual Nets Residual Learning Reformation The desired underlying mapping: H(x) Let the stacked nonlinear fit another mapping of F(x) := H(x) x, i.e., residual The original mapping (H) is recast into F(x) + x 35

36 Deep Residual Nets 36

37 Deep Residual Nets 37

38 Reference Xavier Filler: Understanding the Difficulty of Training Deep Feedforward Neural Networks PReLU & MSRA Filler: Delving Deep into Rectifiers Surpassing Human- Level Performance on ImageNet Classification RReLU: Empirical Evaluation of Rectified Activations in Convolution Network ELU: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) LSUV Filler: All You Need Is A Good Init GoogleNet: Going Deeper with Convolutions DSN: Deeply-Supervised Nets BatchNorm: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift ResNet: Deep Residual Learning for Image Recognition 38

39 Thanks 39

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,