Elad Hoffer*, Itay Hubara*, Daniel Soudry

Similar documents
Large-Batch Training and Generalization Gap. Nitish Shirish Keskar Salesforce Research Slides: keskarnitish.github.io

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Retinopathy Net. Alberto Benavides Robert Dadashi Neel Vadoothker

arxiv: v1 [cs.ne] 5 Dec 2018

Efficient Deep Model Selection

A HMM-based Pre-training Approach for Sequential Data

On Training of Deep Neural Network. Lornechen

An Overview and Comparative Analysis on Major Generative Models

Final Report: Automated Semantic Segmentation of Volumetric Cardiovascular Features and Disease Assessment

Flexible, High Performance Convolutional Neural Networks for Image Classification

Comparison of Two Approaches for Direct Food Calorie Estimation

Deep CNNs for Diabetic Retinopathy Detection

Deep-Learning Based Semantic Labeling for 2D Mammography & Comparison of Complexity for Machine Learning Tasks

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

Convolutional Neural Networks (CNN)

Multi-attention Guided Activation Propagation in CNNs

Sign Language Recognition using Convolutional Neural Networks

Image-Based Estimation of Real Food Size for Accurate Food Calorie Estimation

The Impact of Visual Saliency Prediction in Image Classification

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

arxiv: v1 [stat.ml] 23 Jan 2017

Gradient Masking Is a Type of Overfitting

arxiv: v2 [cs.cv] 7 Jun 2018

arxiv: v1 [cs.cv] 21 Nov 2018

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

Implementing a Robust Explanatory Bias in a Person Re-identification Network

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

arxiv: v1 [cs.cv] 22 Apr 2018

A CONVOLUTION NEURAL NETWORK ALGORITHM FOR BRAIN TUMOR IMAGE SEGMENTATION

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

arxiv: v1 [cs.cr] 1 Nov 2018

10/19/2015. Multifactorial Designs

Factoid Question Answering

MAMMOGRAM AND TOMOSYNTHESIS CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORKS

Synthesis of Gadolinium-enhanced MRI for Multiple Sclerosis patients using Generative Adversarial Network

Reduction of Overfitting in Diabetes Prediction Using Deep Learning Neural Network

Deep Learning of Brain Lesion Patterns for Predicting Future Disease Activity in Patients with Early Symptoms of Multiple Sclerosis

Rival Plausible Explanations

Fine-grained Flower and Fungi Classification at CMP

arxiv: v1 [cs.cv] 2 Aug 2017

arxiv: v1 [cs.cv] 23 May 2018

Task 1: Machine Learning with Spike-Timing-Dependent Plasticity (STDP)

arxiv: v1 [cs.cv] 15 Aug 2018

arxiv: v2 [cs.lg] 3 Apr 2019

arxiv: v1 [cs.cv] 30 May 2018

Automatic Detection of Knee Joints and Quantification of Knee Osteoarthritis Severity using Convolutional Neural Networks

arxiv: v1 [q-bio.qm] 2 Jun 2016

arxiv: v2 [cs.cv] 19 Dec 2017

RSDNet: Learning to Predict Remaining Surgery Duration from Laparoscopic Videos Without Manual Annotations

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

Highly Accurate Brain Stroke Diagnostic System and Generative Lesion Model. Junghwan Cho, Ph.D. CAIDE Systems, Inc. Deep Learning R&D Team

Learning Convolutional Neural Networks for Graphs

Expert identification of visual primitives used by CNNs during mammogram classification

Assessment of lung damages from CT images using machine learning methods.

arxiv: v1 [cs.ai] 28 Nov 2017

Hierarchical Approach for Breast Cancer Histopathology Images Classification

HHS Public Access Author manuscript Med Image Comput Comput Assist Interv. Author manuscript; available in PMC 2018 January 04.

Real Time Image Saliency for Black Box Classifiers

Magnetic Resonance Contrast Prediction Using Deep Learning

arxiv: v2 [cs.lg] 1 Nov 2017

Why did the network make this prediction?

arxiv: v2 [cs.cv] 29 Jan 2019

Critical Learning Periods in Deep Neural Networks

Using Deep Learning to Detect Melanoma in Dermoscopy Images

A Deep Siamese Neural Network Learns the Human-Perceived Similarity Structure of Facial Expressions Without Explicit Categories

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Standard Operating Procedure for Prediction of VO2max Using a Modified Astrand (1960) Protocol

Revisiting RCNN: On Awakening the Classification Power of Faster RCNN

DCASE 2016 CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION

STDP-based spiking deep convolutional neural networks for object recognition

Conditional Activation for Diverse Neurons in Heterogeneous Networks

arxiv: v3 [stat.ml] 28 Oct 2017

arxiv: v4 [cs.cr] 21 Aug 2018

arxiv: v3 [cs.cv] 12 Feb 2017

IMPROVING THE GENERALIZATION OF ADVERSARIAL TRAINING WITH DOMAIN ADAPTATION

Deep Compression and EIE: Efficient Inference Engine on Compressed Deep Neural Network

arxiv: v3 [cs.cv] 26 May 2018

Classification of breast cancer histology images using transfer learning

arxiv: v1 [cs.cv] 17 Aug 2017

The Compact 3D Convolutional Neural Network for Medical Images

Count-ception: Counting by Fully Convolutional Redundant Counting

Attentional Masking for Pre-trained Deep Networks

Network Dissection: Quantifying Interpretability of Deep Visual Representation

Object Detectors Emerge in Deep Scene CNNs

arxiv: v1 [cs.cv] 21 Jul 2017

A Deep Learning Based Pipeline for Image Grading of Diabetic Retinopathy

22 Week ADVANCED MARATHON TRAINING PLAN

Deep Learning for Predicting In Hospital Mortality

Lung Nodule Segmentation Using 3D Convolutional Neural Networks

Dual Path Network and Its Applications

Laser Scar Detection in Fundus Images using Convolutional Neural Networks

Target-to-distractor similarity can help visual search performance

Summary and discussion of: Why Does Unsupervised Pre-training Help Deep Learning?

arxiv: v3 [stat.ml] 27 Mar 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

arxiv: v1 [cs.cv] 5 Dec 2014

arxiv: v1 [cs.ne] 10 Sep 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

DEEP LEARNING-BASED ASSESSMENT OF TUMOR-ASSOCIATED STROMA FOR DIAGNOSING BREAST CANCER IN HISTOPATHOLOGY IMAGES

Transcription:

Train longer, generalize better: closing the generalization gap in large batch training of neural networks Elad Hoffer*, Itay Hubara*, Daniel Soudry *Equal contribution

Better models - parallelization is crucial Model parallelism: Split model (same data) Data parallelism: Split data (same model) AlexNet [Krizhevsky et al. 2012]: model split on two GPUs SGD: weight update proportional to gradients averaged over mini batch Can we increase batch size and improve parallelization? 2

Large batch size hurts generalization? Dataset: CIFAR10, Architecture: Resnet44, Training: SGD + momentum (+ gradient clipping ) b=batch size Why? Learning rate decrease Generalization gap persisted in models trained without any budget or limits, until the loss function ceased to improve" [Keskar et al. 2017] 3

Observation Weight distances from initialization increase logarithmically with iterations Batch size increases Epochs Iterations (gradient updates) Why logarithmic behavior? Theory later 4

Experimental details We experiment with various datasets and models Optimizing using SGD + momentum + gradient clipping Usually generalize better than adaptive methods (e.g Adam) Grad clipping effectively creates a warm-up phase Noticeable generalization gap between small and large batch 5

Closing the generalization gap (2/4) Adapt learning rate. In CIFAR Idea: mimic small batch gradient statistics (dataset dependent) Noticeably improves generalization, the gap remains 6

Closing the generalization gap (3/4) Ghost batch norm Idea again: mimic small batch size statistics Also: reduces communication bandwidth Further improves generalization without incurring overhead x = x μ x σ x x 1 = x 1 μ x1 σ x 1 x k = x k μ xk σ x k 7

Graph indicates: not enough iterations? Using these modifications distance from initialization now better matched However, graph indicates: insufficient iterations with large batch Iterations 8

Train longer, generalize better With sufficient iterations in plateau region, generalization gap vanish: Learning rate decrease 9

Closing the generalization gap (4/4) Regime Adaptation train so that the number of iterations is fixed for all batch sizes (train longer number of epochs) Completely closes the generalization gap ImageNet (AlexNet): 10

Why weight distances increase logarithmically? Hypothesis: During initial high learning rate phase: "random walk on a random potential where Loss surface: Marinari et al., 1983: ultra-slow diffusion 11

Ultra-slow diffusion: Basic idea Time to pass tallest barrier: 12

Summary so far Q: Is there inherent generalization problem with large batches? A: Observed: no, just adjust training regime. Q: What is the mechanism behind training dynamics? A: Hypothesis: "random walk on a random potential Q: Can we reduce the total wall clock time? A: Yes, in some models 13

Significant speed-ups possible Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour Goyal et al. (Facebook whitepaper, two weeks after us) Large scale experiments: ResNet over ImageNet, 256 GPUs Similar methods, except learning rate X29 times faster than a single worker More followed: Large Batch Training of Convolutional Networks (You et al.) ImageNet Training in Minutes (You et al.) Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes (Akiba et al.) 14

Why Overfitting is good for generalization? In contrast to common practice: good generalization results from many gradient updates in an overfitting regime Loss Early stopping? %Misclassified examples Training Validation Norm of last layer Training Validation 1 10 100 1000 Epochs 1 10 100 1000 Epochs 1 10 100 1000 Epochs 15

Why Overfitting is good for generalization? Can be shown to happen for logistic regression on separable data! Explanation there (proved): Slow convergence to max-margin solution The Implicit Bias of Gradient Descent on Separable Data (Arxiv 2017) Daniel Soudry, Elad Hoffer, Nati Srebro 16

Thank you for your time! Questions? Code: https://github.com/eladhoffer/bigbatch 17