COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Similar documents
CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Convolutional Neural Networks for Text Classification

CS-E Deep Learning Session 4: Convolutional Networks

Understanding Convolutional Neural

A CONVOLUTION NEURAL NETWORK ALGORITHM FOR BRAIN TUMOR IMAGE SEGMENTATION

Convolutional and LSTM Neural Networks

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

On Training of Deep Neural Network. Lornechen

Flexible, High Performance Convolutional Neural Networks for Image Classification

Convolutional and LSTM Neural Networks

Learning Convolutional Neural Networks for Graphs

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Task 1: Machine Learning with Spike-Timing-Dependent Plasticity (STDP)

Retinopathy Net. Alberto Benavides Robert Dadashi Neel Vadoothker

Learning in neural networks

A Novel Method using Convolutional Neural Network for Segmenting Brain Tumor in MRI Images

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

arxiv: v1 [cs.ai] 28 Nov 2017

Figure 1: MRI Scanning [2]

Deep learning and non-negative matrix factorization in recognition of mammograms

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

CS 453X: Class 18. Jacob Whitehill

A HMM-based Pre-training Approach for Sequential Data

Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing

LUNG NODULE DETECTION SYSTEM

Supplementary Online Content

Summary and discussion of: Why Does Unsupervised Pre-training Help Deep Learning?

arxiv: v2 [cs.lg] 3 Apr 2019

Question 1 Multiple Choice (8 marks)

The Impact of Visual Saliency Prediction in Image Classification

Katsunari Shibata and Tomohiko Kawano

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Sparse Coding in Sparse Winner Networks

Visual semantics: image elements. Symbols Objects People Poses

Radiotherapy Outcomes

c Copyright 2017 Cody Burkard

Convolutional Neural Networks for Estimating Left Ventricular Volume

A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks

POC Brain Tumor Segmentation. vlife Use Case

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma

COMP9444 Neural Networks and Deep Learning 1. Neuroanatomy. COMP9444 c Alan Blair, 2018

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

Leukemia Blood Cell Image Classification Using Convolutional Neural Network

Convolutional Neural Networks (CNN)

AN EFFICIENT DIGITAL SUPPORT SYSTEM FOR DIAGNOSING BRAIN TUMOR

Efficient Deep Model Selection

Early Diagnosis of Autism Disease by Multi-channel CNNs

Automatic Classification of Perceived Gender from Facial Images

A Novel Capsule Neural Network Based Model For Drowsiness Detection Using Electroencephalography Signals

Factoid Question Answering

OPTIMIZING CHANNEL SELECTION FOR SEIZURE DETECTION

Learning and Adaptive Behavior, Part II

Neuromorphic convolutional recurrent neural network for road safety or safety near the road

Computational Cognitive Neuroscience

MASTER S THESIS. Ali El Adi. Deep neural networks to forecast cardiac and respiratory deterioration of intensive care patients

Networks and Hierarchical Processing: Object Recognition in Human and Computer Vision

Biologically-Inspired Control in Problem Solving. Thad A. Polk, Patrick Simen, Richard L. Lewis, & Eric Freedman

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

Why did the network make this prediction?

arxiv: v2 [cs.lg] 1 Jun 2018

Using stigmergy to incorporate the time into artificial neural networks

Cerebral Cortex. Edmund T. Rolls. Principles of Operation. Presubiculum. Subiculum F S D. Neocortex. PHG & Perirhinal. CA1 Fornix CA3 S D

Sleep Staging with Deep Learning: A convolutional model

CPSC81 Final Paper: Facial Expression Recognition Using CNNs

STDP-based spiking deep convolutional neural networks for object recognition

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Computer-aided diagnosis of subtle signs of breast cancer: Architectural distortion in prior mammograms

Perception & Attention

A Deep Siamese Neural Network Learns the Human-Perceived Similarity Structure of Facial Expressions Without Explicit Categories

Object Detectors Emerge in Deep Scene CNNs

Evolutionary Programming

Understanding Biological Visual Attention Using Convolutional Neural Networks

Holistically-Nested Edge Detection (HED)

Noise Cancellation using Adaptive Filters Algorithms

Target-to-distractor similarity can help visual search performance

CAS Seminar - Spiking Neurons Network (SNN) Jakob Kemi ( )

Can Generic Neural Networks Estimate Numerosity Like Humans?

1. Introduction 1.1. About the content

Y-Net: Joint Segmentation and Classification for Diagnosis of Breast Biopsy Images

1. Introduction 1.1. About the content. 1.2 On the origin and development of neurocomputing

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Chapter 1. Introduction

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Information Processing During Transient Responses in the Crayfish Visual System

Rolls,E.T. (2016) Cerebral Cortex: Principles of Operation. Oxford University Press.

Machine learning for neural decoding

A New Approach for Detection and Classification of Diabetic Retinopathy Using PNN and SVM Classifiers

Reactive agents and perceptual ambiguity

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Synthesis of Gadolinium-enhanced MRI for Multiple Sclerosis patients using Generative Adversarial Network

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Tumor Cellularity Assessment. Rene Bidart

Oscillatory Neural Network for Image Segmentation with Biased Competition for Attention

Multilayer Perceptron Neural Network Classification of Malignant Breast. Mass

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis.

B657: Final Project Report Holistically-Nested Edge Detection

Transcription:

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks Textbook, Sections 6.2.2, 6.3, 7.9, 7.11-7.13, 9.1-9.5

COMP9444 17s2 Convolutional Networks 1 Outline Geometry of Hidden Unit Activations Limitations of 2-layer networks Alternative transfer functions (6.3) Convolutional Networks (9.1-9.5) Dropout

COMP9444 17s2 Convolutional Networks 2 Encoder Networks Inputs Outputs 10000 10000 01000 01000 00100 00100 00010 00010 00001 00001 identity mapping through a bottleneck also called N M N task used to investigate hidden unit representations

COMP9444 17s2 Convolutional Networks 3 N 2 N Encoder HU Space:

COMP9444 17s2 Convolutional Networks 4 8 3 8 Encoder Exercise: Draw the hidden unit space for 2-2-2, 3-2-3, 4-2-4 and 5-2-5 encoders. Represent the input-to-hidden weights for each input unit by a point, and the hidden-to-output weights for each output unit by a line. Now consider the 8-3-8 encoder with its 3-dimensional hidden unit space. What shape would be formed by the 8 points representing the input-to-hidden weights for the 8 input units? What shape would be formed by the planes representing the hidden-to-output weights for each output unit? Hint: think of two platonic solids, which are dual to each other.

COMP9444 17s2 Convolutional Networks 5 Hinton Diagrams Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina used to visualize higher dimensions white = positive, black = negative

COMP9444 17s2 Convolutional Networks 6 Learning Face Direction

COMP9444 17s2 Convolutional Networks 7 Learning Face Direction

COMP9444 17s2 Convolutional Networks 8 Symmetries swap any pair of hidden nodes, overall function will be the same on any hidden node, reverse the sign of all incoming and outgoing weights (assuming symmetric transfer function) hidden nodes with identical input-to-hidden weights in theory would never separate; so, they all have to begin with different (small) random weights in practice, all hidden nodes try to do similar job at first, then gradually specialize.

COMP9444 17s2 Convolutional Networks 9 Controlled Nonlinearity for small weights, each layer implements an approximately linear function, so multiple layers also implement an approximately linear function. for large weights, transfer function approximates a step function, so computation becomes digital and learning becomes very slow. with typical weight values, two-layer neural network implements a function which is close to linear, but takes advantage of a limited degree of nonlinearity.

COMP9444 17s2 Convolutional Networks 10 Limitations of Two-Layer Neural Networks Some functions cannot be learned with a 2-layer sigmoidal network. 6 4 2 0 2 4 6 6 4 2 0 2 4 6 For example, this Twin Spirals problem cannot be learned with a 2-layer network, but it can be learned using a 3-layer network if we include shortcut connections between non-consecutive layers.

COMP9444 17s2 Convolutional Networks 11 Adding Hidden Layers Twin Spirals can be learned by 3-layer network with shortcut connections first hidden layer learns linearly separable features second hidden layer learns convex features output layer combines these to produce concave features training the 3-layer network is delicate learning rate and initial weight values must be very small otherwise, the network will converge to a local optimum

COMP9444 17s2 Convolutional Networks 12 Vanishing / Exploding Gradients Training by backpropagation in networks with many layers is difficult. When the weights are small, the differentials become smaller and smaller as we backpropagate through the layers, and end up having no effect. When the weights are large, the activations in the higher layers will saturate to extreme values. As a result, the gradients at those layers will become very small, and will not be propagated to the earlier layers. When the weights have intermediate values, the differentials will sometimes get multiplied many times is places where the transfer function is steep, causing them to blow up to large values.

COMP9444 17s2 Convolutional Networks 13 Solutions to Vanishing Gradients layerwise unsupervised pre-training long short term memory (LSTM) new activations functions LSTM is specifically for recurrent neural networks. We will discuss unsupervised pre-training and LSTM later in the course.

COMP9444 17s2 Convolutional Networks 14 Activation Functions (6.3) 4 4 3 3 2 2 1 1 0 0-1 -1-2 -4-2 0 2 4 Sigmoid -2-4 -2 0 2 4 Rectified Linear Unit (ReLU) 4 4 3 3 2 2 1 1 0 0-1 -1-2 -4-2 0 2 4 Hyperbolic Tangent -2-4 -2 0 2 4 Scaled Exponential Linear Unit (SELU)

COMP9444 17s2 Convolutional Networks 15 Activation Functions (6.3) Sigmoid and hyperbolic tangent traditionally used for 2-layer networks, but suffer from vanishing gradient problem in deeper networks. Rectified Linear Units (ReLUs) are popular for deep networks, including convolutional networks. Gradients don t vanish. But, their highly linear nature may cause other problems. Scaled Exponential Linear Units (SELUs) are a recent innovation which seems to work well for very deep networks.

COMP9444 17s2 Convolutional Networks 16 Convolutional Networks Suppose we want to classify an image as a bird, sunset, dog, cat, etc. If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image. We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image.

COMP9444 17s2 Convolutional Networks 17 Hubel and Weisel Visual Cortex cells in the visual cortex respond to lines at different angles cells in V2 respond to more sophisticated visual features Convolutional Neural Networks are inspired by this neuroanatomy CNN s can now be simulated with massive parallelism, using GPU s

COMP9444 17s2 Convolutional Networks 18 Convolutional Network Components convolution layers: extract shift-invariant features from the previous layer subsampling or pooling layers: combine the activations of multiple units from the previous layer into one unit fully connected layers: collect spatially diffuse information output layer: choose between classes

COMP9444 17s2 Convolutional Networks 19 MNIST Handwritten Digit Examples

COMP9444 17s2 Convolutional Networks 20 CIFAR Image Examples

COMP9444 17s2 Convolutional Networks 21 Convolutional Network Architecture There can be multiple steps of convolution followed by pooling, before reaching the fully connected layers. Note how pooling reduces the size of the feature map (usually, by half in each direction).

COMP9444 17s2 Convolutional Networks 22 Softmax (6.2.2) Consider a classification task with N classes, and assume z j is the output of the unit corresponding to class j. We assume the network s estimate of the probability of each class j is proportional to exp(z j ). Because the probabilites must add up to 1, we need to normalize by dividing by their sum: Prob(i)= exp(z i) N j=1 exp(z j) logprob(i)=z i log N j=1 exp(z j) If the correct class is i, we can treat logprob(i) as our cost function. The first term pushes up the correct class i, while the second term mainly pushes down the incorrect class j with the highest activation (if j i).

COMP9444 17s2 Convolutional Networks 23 Convolution Operator Continuous convolution Discrete convolution s(t)=(x w)(t)= x(a)w(t a)da s(t)=(x w)(t)= Two-dimensional convolution S( j,k)=(k I)( j,k)= m x(a)w(t a) a= K(m,n)I( j+ m,k+n) n Note: Theoreticians sometimes write I( j m, k n) so that the operator is commutative. But, computationally, it is easier to write it with a plus sign.

COMP9444 17s2 Convolutional Networks 24 Convolutional Neural Networks j k l Assume the original image is J K, with L channels. We apply an M N filter to these inputs to compute one hidden unit in the convolution layer. In this example J = 6,K = 7,L=3,M = 3,N = 3.

COMP9444 17s2 Convolutional Networks 25 Convolutional Neural Networks j j+m k k+n Z j,k i = g( b i + M 1 m=0 N 1 n=0 K l,m,n i V j+m, l k+n l The same weights are applied to the next M N block of inputs, to compute the next hidden unit in the convolution layer ( weight sharing ). l )

COMP9444 17s2 Convolutional Networks 26 Convolutional Neural Networks If the original image size is J K and the filter is size M N, the convolution layer will be(j+ 1 M) (K+ 1 N)

COMP9444 17s2 Convolutional Networks 27 Convolution with Zero Padding Sometimes, we treat the off-edge inputs as zero (or some other value).

COMP9444 17s2 Convolutional Networks 28 Convolution with Zero Padding This is known as Zero-Padding.

COMP9444 17s2 Convolutional Networks 29 Convolution with Zero Padding With Zero Padding, the convolution layer is the same size as the original image (or the previous layer).

COMP9444 17s2 Convolutional Networks 30 Pooling Layers Pooling layers compute either the average, median or, more commonly, the maximum of the values in a neighborhood. If the convolution layer detects a particular feature, then the pooling layer will indicate whether that feature is present in a slightly larger neigborhood.

COMP9444 17s2 Convolutional Networks 31 Max Pooling

COMP9444 17s2 Convolutional Networks 32 Example: LeNet trained on MNIST The 5 5 window of the first convolution layer extracts from the original 32 32 image a 28 28 array of features. Subsampling then halves this size to 14 14. The second Convolution layer uses another 5 5 window to extract a 10 10 array of features, which the second subsampling layer reduces to 5 5. These activations then pass through two fully connected layers into the 10 output units corresponding to the digits 0 to 9.

COMP9444 17s2 Convolutional Networks 33 Convolutional Filters First Layer Second Layer Third Layer

COMP9444 17s2 Convolutional Networks 34 Dropout (7.12) Nodes are randomly chosen to not be used, with some fixed probability (usually, one half).

COMP9444 17s2 Convolutional Networks 35 Dropout (7.12) When training is finished and the network is deployed, all nodes are used, but their activations are multiplied by the same probability that was used in the dropout. Thus, the activation received by each unit is the average value of what it would have received during training. Dropout forces the network to achieve redundancy because it must deal with situations where some features are missing. Another way to view dropout is that it implicitly (and efficiently) simulates an ensemble of different architectures.

COMP9444 17s2 Convolutional Networks 36 Ensembling Ensembling is a method where a number of different classifiers are trained on the same task, and the final class is decided by voting among them. In order to benefit from ensembling, we need to have diversity in the different classifiers. For example, we could train three neural networks with different architectures, three Support Vector Machines with different dimensions and kernels, as well as two other classifiers, and ensemble all of them to produce a final result. (Kaggle Competition entries are often done in this way).

COMP9444 17s2 Convolutional Networks 37 Bagging Diversity can also be achieved by training on different subsets of data. Suppose we are given N training items. Each time we train a new classifier, we choose N items from the training set with replacement. This means that some items will not be chosen, while others are chosen two or three times. There will be diversity among the resulting classifiers because they have each been trained on a different subset of data. They can be ensembled to produce a more accurate result than a single classifier.

COMP9444 17s2 Convolutional Networks 38 Dropout as an Implicit Ensemble In the case of dropout, the same data are used each time but a different architecture is created by removing the nodes that are dropped. The trick of multiplying the output of each node by the probability of dropout implicitly averages the output over all of these different models.