On Training of Deep Neural Network. Lornechen

Similar documents
CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Retinopathy Net. Alberto Benavides Robert Dadashi Neel Vadoothker

Summary and discussion of: Why Does Unsupervised Pre-training Help Deep Learning?

CS 453X: Class 18. Jacob Whitehill

Reduction of Overfitting in Diabetes Prediction Using Deep Learning Neural Network

Efficient Deep Model Selection

Cost-aware Pre-training for Multiclass Cost-sensitive Deep Learning

Multi-attention Guided Activation Propagation in CNNs

Factoid Question Answering

Learning in neural networks

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

The Impact of Visual Saliency Prediction in Image Classification

Neurons and neural networks II. Hopfield network

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

A CONVOLUTION NEURAL NETWORK ALGORITHM FOR BRAIN TUMOR IMAGE SEGMENTATION

Convolutional Neural Networks for Text Classification

Understanding and Improving Deep Learning Algorithms

Flexible, High Performance Convolutional Neural Networks for Image Classification

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

arxiv: v2 [cs.cv] 7 Jun 2018

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Elad Hoffer*, Itay Hubara*, Daniel Soudry

Convolutional Neural Networks for Estimating Left Ventricular Volume

Large-Batch Training and Generalization Gap. Nitish Shirish Keskar Salesforce Research Slides: keskarnitish.github.io

Convolutional Neural Networks (CNN)

A Comparison of Deep Neural Network Training Methods for Large Vocabulary Speech Recognition

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

Understanding Convolutional Neural

TIME SERIES MODELING USING ARTIFICIAL NEURAL NETWORKS 1 P.Ram Kumar, 2 M.V.Ramana Murthy, 3 D.Eashwar, 4 M.Venkatdas

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Why did the network make this prediction?

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Question 1 Multiple Choice (8 marks)

Synthesis of Gadolinium-enhanced MRI for Multiple Sclerosis patients using Generative Adversarial Network

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Computational Intelligence Lecture 21: Integrating Fuzzy Systems and Neural Networks

A HMM-based Pre-training Approach for Sequential Data

Conditional Activation for Diverse Neurons in Heterogeneous Networks

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

EECS 433 Statistical Pattern Recognition

Learning Convolutional Neural Networks for Graphs

STDP-based spiking deep convolutional neural networks for object recognition

Tumor Cellularity Assessment. Rene Bidart

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

arxiv: v2 [cs.lg] 1 Jun 2018

Speech Enhancement Using Deep Neural Network

Speech Enhancement Based on Deep Neural Networks

Supplementary method. a a a. a a a. h h,w w, then the size of C will be ( h h + 1) ( w w + 1). The matrix K is

Object Detectors Emerge in Deep Scene CNNs

Figure 1: MRI Scanning [2]

Acoustic Signal Processing Based on Deep Neural Networks

Deep Learning for Predicting In Hospital Mortality

Using stigmergy to incorporate the time into artificial neural networks

Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks

An Ensemble CNN2ELM for Age Estimation. ,KenliLi,Senior Member, IEEE, and Keqin Li, Fellow, IEEE

arxiv: v1 [cs.ne] 5 Dec 2018

Deep Learning of Brain Lesion Patterns for Predicting Future Disease Activity in Patients with Early Symptoms of Multiple Sclerosis

Learning and Adaptive Behavior, Part II

Exploratory Study on Direct Prediction of Diabetes using Deep Residual Networks

Chapter 1. Introduction

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis.

Noise Cancellation using Adaptive Filters Algorithms

Supplementary Online Content

arxiv: v2 [cs.cv] 22 Mar 2018

Aristomenis Kotsakis,Matthias Nübling, Nikolaos P. Bakas, George Pelekanakis, John Thanopoulos

Network Dissection: Quantifying Interpretability of Deep Visual Representation

Evidence that the ventral stream codes the errors used in hierarchical inference and learning Running title: Error coding in the ventral stream

Artificial Neural Networks and Near Infrared Spectroscopy - A case study on protein content in whole wheat grain

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing

Task 1: Machine Learning with Spike-Timing-Dependent Plasticity (STDP)

Computational Cognitive Neuroscience

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Lung Nodule Segmentation Using 3D Convolutional Neural Networks

Final Report: Automated Semantic Segmentation of Volumetric Cardiovascular Features and Disease Assessment

Katsunari Shibata and Tomohiko Kawano

Magnetic Resonance Contrast Prediction Using Deep Learning

Interpretable & Transparent Deep Learning

A Novel Method using Convolutional Neural Network for Segmenting Brain Tumor in MRI Images

J2.6 Imputation of missing data with nonlinear relationships

c Copyright 2017 Cody Burkard

Generative Adversarial Networks.

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

Search e Fall /18/15

10-1 MMSE Estimation S. Lall, Stanford

Rich feature hierarchies for accurate object detection and semantic segmentation

Semi-Supervised Disentangling of Causal Factors. Sargur N. Srihari

Convolutional and LSTM Neural Networks

B657: Final Project Report Holistically-Nested Edge Detection

CS6501: Deep Learning for Visual Recognition. GenerativeAdversarial Networks (GANs)

Skin cancer reorganization and classification with deep neural network

A Deep Learning Approach for Subject Independent Emotion Recognition from Facial Expressions

Classification of breast cancer histology images using transfer learning

The Importance of Time in Visual Attention Models

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Transcription:

On Training of Deep Neural Network Lornechen 2016.04.20 1

Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 2

Neural Network Components Layer: input layer + hidden layer + output layer Neuron: # in each layer and activation function Objective function: LogLoss, CE, RMSE Optimization method: SGD, Nesterov, AdaDelta 3

Difficulties of Training DNN Availability of data labeled data is often scarce prone to overfitting Highly non-convex full of bad local optima gradient descent no longer work well Diffusion of gradient the gradients that are propagated backwards rapidly diminish in magnitude as the depth of the network increases. 4

Difficulties of Training DNN NN of 2-3 hidden layers are commonly used in the past. However, NN of ten to hundred even to thousand hidden layers are (widely) used nowadays. How? 5

Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 6

Supervised Version First, train one layer at a time Fix Fix Train 7

Supervised Version First, train one layer at a time Fix Train Fix 8

Supervised Version First, train one layer at a time Train Fix Fix 9

Supervised Version Finally, fine-tune the whole network BP Forward 10

Unsupervised Version 11

The Reasoning Pre-training offers a better local optima as a start point compared to randomized initialization. Fine-tuning subsequently adjusts the weights according to the task at hand (guided with objective function). 12

However Layer-wise pre-training is rarely used nowadays: requires more training time may lead to a poorer local optimum Just train the network as whole from scratch. 13

Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 14

Traditional Activation Function Sigmoid f x = 1 1+e x saturate and kill gradients Tanh f x = ex e x e x +e x similar as sigmoid 15

Recap of Back Propagation (BP) dj(w) dw ij (l) = δ i (l+1) f(zj (l) ) δ i (l) = ( j=1 W ji (l) δj (l+1) ) f (z j (l) ) δ i (l) : error of l-th layer W ji (l) : weight between l-th and (l+1)-th layers f(z j (l) ): activation of l-th layer f (z j (l) ): derivative of l-th layer 16

Traditional Activation Function f (z j (l) ) < 1 : gradients diminish rapidly when propagated backwards (contractive everywhere) 17

Rectified Linear Unit (ReLU) ReLU f x = max(0, x) pros: faster convergence minimum computation cost cons "dead" units 18

ReLU Family Leaky ReLU f x = max 0, x + a min(0, x) a is fixed at (very) small value, e.g., 0.01 or 0.3 Parametric ReLU similar as Leaky ReLU but a is learnable Randomized Leaky ReLU similar as Leaky ReLU but a is sampled from a uniform distribution U(l, u) at training, while being fixed at (l+u)/2 at prediction 19

ReLU Family 20

Exponential Linear Unit (ELU) ELU f x = max 0, x + a min(0, exp x 1) a controls the value to which an ELU saturates for negative net inputs 21

What Neuron Type Should I Use? Use the ReLU, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU or Maxout. 22

Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 23

Traditional Filler Uniform Filler U(-w, w) Gaussian Filler N(0, std), e.g., std = 0.01 difficult to converge for #layers > 8 24

Traditional Filler Xavier Filler investigate the variance of responses in each layer assuming linear activation function N(0, std), where std = sqrt(1/n), n = (fan-in + fan-out) / 2 difficult to converge for #layers > 30 (stall) 25

MSRA Filler Problem of Xavier Filler assuming linear activation function invalid for ReLU and PReLU New derived filler a = 0, it becomes the ReLU case a = 1, it becomes the linear (the same as Xavier) 26

MSRA Filler vs. Xavier Filler Xavier s std will be sqrt(1/2^l) of MSRA Filler if there are L layers. So the signal will be diminishing. Xavier is difficult to converge for #layers > 30. 27

LSUV Filler 28

Outline Introduction Layer-wise Pre-training & Fine-tuning Activation Function Initialization Method Advanced Layers and Nets 29

Intermediate Supervised Layer overall loss = a * loss0 + b * loss1 + c * loss2 30

Deeply Supervised Nets overall loss = a * loss0 + b * loss1 + c * loss2 31

Batch Normalization Layer 32

Batch Normalization Layer 33

Deep Residual Nets Is learning better networks as easy as stacking more layers? a deeper model should produce no higher training error than its shallower counterpart 34

Deep Residual Nets Residual Learning Reformation The desired underlying mapping: H(x) Let the stacked nonlinear fit another mapping of F(x) := H(x) x, i.e., residual The original mapping (H) is recast into F(x) + x 35

Deep Residual Nets 36

Deep Residual Nets 37

Reference Xavier Filler: Understanding the Difficulty of Training Deep Feedforward Neural Networks PReLU & MSRA Filler: Delving Deep into Rectifiers Surpassing Human- Level Performance on ImageNet Classification RReLU: Empirical Evaluation of Rectified Activations in Convolution Network ELU: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) LSUV Filler: All You Need Is A Good Init GoogleNet: Going Deeper with Convolutions DSN: Deeply-Supervised Nets BatchNorm: Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift ResNet: Deep Residual Learning for Image Recognition 38

Thanks 39