On Training of Deep Neural Network Lornechen 2016.04.20 1
Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 2
Neural Network Components Layer: input layer + hidden layer + output layer Neuron: # in each layer and activation function Objective function: LogLoss, CE, RMSE Optimization method: SGD, Nesterov, AdaDelta 3
Difficulties of Training DNN Availability of data labeled data is often scarce prone to overfitting Highly non-convex full of bad local optima gradient descent no longer work well Diffusion of gradient the gradients that are propagated backwards rapidly diminish in magnitude as the depth of the network increases. 4
Difficulties of Training DNN NN of 2-3 hidden layers are commonly used in the past. However, NN of ten to hundred even to thousand hidden layers are (widely) used nowadays. How? 5
Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 6
Supervised Version First, train one layer at a time Fix Fix Train 7
Supervised Version First, train one layer at a time Fix Train Fix 8
Supervised Version First, train one layer at a time Train Fix Fix 9
Supervised Version Finally, fine-tune the whole network BP Forward 10
Unsupervised Version 11
The Reasoning Pre-training offers a better local optima as a start point compared to randomized initialization. Fine-tuning subsequently adjusts the weights according to the task at hand (guided with objective function). 12
However Layer-wise pre-training is rarely used nowadays: requires more training time may lead to a poorer local optimum Just train the network as whole from scratch. 13
Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 14
Traditional Activation Function Sigmoid f x = 1 1+e x saturate and kill gradients Tanh f x = ex e x e x +e x similar as sigmoid 15
Recap of Back Propagation (BP) dj(w) dw ij (l) = δ i (l+1) f(zj (l) ) δ i (l) = ( j=1 W ji (l) δj (l+1) ) f (z j (l) ) δ i (l) : error of l-th layer W ji (l) : weight between l-th and (l+1)-th layers f(z j (l) ): activation of l-th layer f (z j (l) ): derivative of l-th layer 16
Traditional Activation Function f (z j (l) ) < 1 : gradients diminish rapidly when propagated backwards (contractive everywhere) 17
Rectified Linear Unit (ReLU) ReLU f x = max(0, x) pros: faster convergence minimum computation cost cons "dead" units 18
ReLU Family Leaky ReLU f x = max 0, x + a min(0, x) a is fixed at (very) small value, e.g., 0.01 or 0.3 Parametric ReLU similar as Leaky ReLU but a is learnable Randomized Leaky ReLU similar as Leaky ReLU but a is sampled from a uniform distribution U(l, u) at training, while being fixed at (l+u)/2 at prediction 19
ReLU Family 20
Exponential Linear Unit (ELU) ELU f x = max 0, x + a min(0, exp x 1) a controls the value to which an ELU saturates for negative net inputs 21
What Neuron Type Should I Use? Use the ReLU, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU or Maxout. 22
Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 23
Traditional Filler Uniform Filler U(-w, w) Gaussian Filler N(0, std), e.g., std = 0.01 difficult to converge for #layers > 8 24
Traditional Filler Xavier Filler investigate the variance of responses in each layer assuming linear activation function N(0, std), where std = sqrt(1/n), n = (fan-in + fan-out) / 2 difficult to converge for #layers > 30 (stall) 25
MSRA Filler Problem of Xavier Filler assuming linear activation function invalid for ReLU and PReLU New derived filler a = 0, it becomes the ReLU case a = 1, it becomes the linear (the same as Xavier) 26
MSRA Filler vs. Xavier Filler Xavier s std will be sqrt(1/2^l) of MSRA Filler if there are L layers. So the signal will be diminishing. Xavier is difficult to converge for #layers > 30. 27
LSUV Filler 28
Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 29
Intermediate Supervised Layer overall loss = a * loss0 + b * loss1 + c * loss2 30
Deeply Supervised Nets overall loss = a * loss0 + b * loss1 + c * loss2 31
Batch Normalization Layer 32
Batch Normalization Layer 33
Deep Residual Nets Is learning better networks as easy as stacking more layers? a deeper model should produce no higher training error than its shallower counterpart 34
Deep Residual Nets Residual Learning Reformation The desired underlying mapping: H(x) Let the stacked nonlinear fit another mapping of F(x) := H(x) x, i.e., residual The original mapping (H) is recast into F(x) + x 35
Deep Residual Nets 36
Deep Residual Nets 37
Reference Xavier Filler: Understanding the Difficulty of Training Deep Feedforward Neural Networks PReLU & MSRA Filler: Delving Deep into Rectifiers Surpassing Human- Level Performance on ImageNet Classification RReLU: Empirical Evaluation of Rectified Activations in Convolution Network ELU: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) LSUV Filler: All You Need Is A Good Init GoogleNet: Going Deeper with Convolutions DSN: Deeply-Supervised Nets BatchNorm: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift ResNet: Deep Residual Learning for Image Recognition 38
Thanks 39