Efficient Deep Model Selection

Size: px

Start display at page:

Download "Efficient Deep Model Selection"

Anthony Hill
5 years ago
Views:

1 Efficient Deep Model Selection Jose Alvarez Researcher Data61, CSIRO, Australia GTC, May 9 th

3 conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction???????? Num Classes

4 Today s talk 1. It is possible to train more-efficient architectures without compromising accuracy 2. We can jointly learn the architecture and the parameters Additional benefits at train time.

5 Efficient Networks Convolutional Neural Networks (ConvNets)

6 150 Efficient Networks Num. Parameters (in Millions) LeNet AlexNet 2014 VGGNet-16

7 150 Efficient Networks Num. Parameters (in Millions) LeNet AlexNet 2014 VGGNet-16

8 150 Efficient Networks Num. Parameters (in Millions) LeNet AlexNet 2014 VGGNet-16

9 150 Efficient Networks Num. Parameters (in Millions) LeNet AlexNet 2014 VGGNet Residual-Nets More recent architectures? Residual Networks: Require 2 / 4 Titan-X (12GB) full capacity for training (memory requirements) 2 weeks training time (ResNet layers)

10 Efficient Networks TitanX GPU

11 Efficient Networks Embedded Platforms?

12 Efficient Networks Embedded Platforms?

13 Embedded Platforms Embedded devices with limited resources / power 2014 Jetson TK Jetson TX1

14 Efficient Networks TRAINING TESTING

15 Efficient Networks

16 Efficient Networks Jetson TK Limited resources / 2016 power / time TRAINING 2016 Jetson TX1 TESTING

17 Efficient Networks 2013 Larger data 2015 Spatio-temporal data video Hyper spectral Image Remote sensing 2016 TRAINING

18 Efficient Networks 2013 Larger data 2015 Spatio-temporal data video Hyper spectral Image Remote sensing 2016 Rapid prototyping TRAINING

19 Efficient Networks conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction Num Classes TRAINING

20 Efficient Networks conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction Num Classes TRAINING Additional TEST benefits

21 Talk Road Map Related work Efficient Networks DecomposeMe Model Selection Next Steps

22 Compacting ConvNets (related work)

23 Compacting ConvNets Test time Network distillation Network pruning Low rank approximations Train time Learning constrained filters Inception

24 Compacting ConvNets: At test time Network distillation Network pruning Low rank approximations

smaller nets using the output or soft layer Geoffrey Hinton, Oriol

25 Compacting ConvNets: At test time Network distillation Large network learns from data Generate labels using the trained network Train smaller nets using the output or soft layer Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the Knowledge in a Neural Network. NIPSw 2015

26 Compacting ConvNets: At test time Network distillation Network pruning Directly remove unimportant parameters during training (usually) Requires second derivatives. Remove parameters + quantification 1 Good compression rates (orthogonal to other approaches). S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR 2016

27 Compacting ConvNets: At test time Network distillation Network pruning Low rank approximations

28 Compacting ConvNets: At test time Low rank approximations Weights are approximated by a combination of rank 1 tensors. Max Jaderberg, Andrea Vedaldi, Andrew Zisserman Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014

29 Compacting ConvNets: At test time Weak-Points Needs a full-rank network completely trained Not all filters can be approximated Theoretical speeds-up with drop of performance. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014

30 Compacting ConvNets: At train time

31 Compacting ConvNets: At train time Learning constrained filters Same receptive field but less parameters 49C 2 vs. 3x(3x3)C 2 K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015

32 Compacting ConvNets: At train time Learning constrained filters Same receptive field but less parameters Deeper networks (more non-linearities) K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015

33 Compacting ConvNets: At train time Inception modules Fewer convolutions and an expansion layer Szegedy, Going Deeper with Convolutions, CVPR2014

34 Compacting ConvNets: At train time Inception v3 Szegedy et al., Rethinking the inception Architecture for Computer Vision. CVPR2016

35 Model Selection

36 Model Selection Common Approach: empirical Set-up Empirically set the number of neurons. Prune neurons as a post-processing.

37 Model Selection Common Approach: empirical Set-up Learning-based approaches (difficult to scale up) Optimal Brain Damage. LeCun et al. NIPS 1991 Learning Structured Sparsity in Deep Neural Networks. Wen, Wu, Wang, Chen, and Li. NIPS 2016 Convolutional Neural Fabrics. Saxena and Verbeek. NIPS 2016

38 DecomposeMe Filter Compositions for End-to-End Learning

39 Filter Compositions for End-to-End Learning F= v 1 h 1T +v 2 h 2T + + v k h k T Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

40 Filter Compositions for End-to-End Learning Convolution layer (F filters) L ~ F d h

41 Filter Compositions for End-to-End Learning L < F

42 Filter Compositions for End-to-End Learning Key properties: Filter restrictions during training (low-rank). Larger receptive fields. Deeper models (ReLU): increased capacity. Additional parameter sharing. Reduced within filter parameter redundancy. Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

43 Filter Compositions for End-to-End Learning What have we learned? DecomposeMe (without non-linearity) AlexNet Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

44 Filter Compositions for End-to-End Learning What have we learned? DecomposeMe (without non-linearity) DecomposeMe Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

45 Classification Results

46 The Architecture Dec1 Dec2 Dec3 Dec4 Dec5 Dec6 Dec7 Dec8 FC 1000

47 Quantitative Results: ImageNet ImageNet dataset: 1.2 million training images and for validation split in 1000 categories. Between 5000 and training images per class. No data augmentation (random flip).

48 Quantitative Results: ImageNet ImageNet dataset: 1.2 million training images and for validation split in 1000 categories. Between 5000 and training images per class No data augmentation (random flip). NETWORK NUMBER OF PARAMETERS NUMBER OF CONV. LAYERS TOP-1 ACCURACY (CENTER CROP) AlexNet OWT Bn 61M % B-NET (VGG-B) 133M % OURS* 7.1M % Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

49 Computational Cost

50 Computational Cost Number of parameters Input Channels Output Channels Intermediate channels Kernel dimension Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

51 Time Computational Cost (time) 400 Forward (Inference) Time on Two TitanX B-Net (VGG-B) AlexNet OWT Bn Ours Batch Size Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

52 Time Computational Cost (time) Forward-backward Time on Two TitanX B-Net (VGG-B) AlexNet OWT Bn Ours Rapid prototyping ~10 hours using 4-Tesla M Batch Size Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

53 Residual Networks

54 Quantitative Results: ImageNet Residual Net decomposed -- model: 256-d 1x1, 64 3x1, 64 1x3, 64 1x1, 256 relu relu relu

55 Quantitative Results: ImageNet ImageNet dataset: 256-d 1x1, 64 3x1, 64 1x3, 64 1x1, 256 relu relu relu NETWORK TOP-1 ACCURACY (CENTER CROP) TOP-5 ACCURACY (CENTER CROP) ResNet % 93.3% ResNet-152-DEC 77.7% 93.7% Alvarez and Petersson, DecomposeMe: Simplifying ConvNets for End-to-End Learning. Arxiv 2016

56 Computational Cost (time) ResNet-101 (Relative improvement): 256-d 1x1, 64 relu 3x1, 64 relu 1x3, 64 relu 1x1, 256 NETWORK PARAMS FWD TIME (BATCH 8) TOP-5 ACC ResNet-101_ % 8.45% -0.7% ResNet-101_ % 34.4% -1.0% ResNet-101_ % 40.56% -1.5%

57 Generalization to other applications

58 Semantic Segmentation Building Tree Vehicle Side-walk Road Romera, Alvarez et al., Efficient ConvNet for Real-Time Semantic Segmentation. To appear in IEEE-IV 2017,

59 Semantic Segmentation Romera, Alvarez et al., Efficient ConvNet for Real-Time Semantic Segmentation. To appear in IEEE-IV 2017,

60 Model Selection prediction

61 Learning the Number of Neurons Our Approach: Pruning-aware training

62 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Directly reduce (select the optimum) the number of neurons. Significant memory reductions with performance improvements Still start from an over-parameterized network to help training. 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

63 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Weight Decay (prevent weights with large values) 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

64 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Neuron Convolutional kernel Convolutional layer 5x1x3x3 Considers each parameter independently Weight Decay (prevent weights with large values) 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

65 Learning the Number of Neurons Our Approach: Pruning-aware training 2 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

66 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Removed To be kept Size of the group 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

67 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Removed To be kept Direct benefits at test time (the complete kernel is removed) Size of the group 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

68 Learning the Number of Neurons Our Approach: Pruning-aware training 2 Removed To be kept 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

69 Learning the Number of Neurons Our Approach: Pruning-aware training 2 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

70 Learning the Number of Neurons Training Process: Proximal Operator + SGD Take a step with respect to the normal loss and then apply the proximal operator of the regularizer 3. Incremental learning (SGD) over the dataset (1 EPOCH) 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS Simon, Friedman, Hastie, Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013

71 Learning the Number of Neurons Training Process: Proximal Operator + SGD Take a step with respect to the normal loss and then apply the proximal operator of the regularizer 3. Proximal operator for SGS 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS Simon, Friedman, Hastie, Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013

72 Learning the Number of Neurons Our Approach: Pruning-aware training Projection effect during training

73 Classification Results

74 Learning the Number of Neurons Quantitative Results on ImageNet Train an over-complete architecture up to 768 neurons per layer (Dec 8-768) Dec Dec Dec Dec Dec Dec Dec Dec FC Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

75 Learning the Number of Neurons Quantitative Results on ImageNet 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

76 Applicable to New DataSets?

77 Learning the Number of Neurons ICDAR2003: Character Recognition in camera captured images TESCO, Value Washing Up Liquid PEPSI 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016 The Rab Butler Building

78 Learning the Number of Neurons Quantitative Results on ICDAR2003 Train an over-complete architecture up to 512 neurons per layer (Dec 3 ) Over Parameterization Dec Dec Dec FC 36 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

79 Learning the Number of Neurons Results on ICDAR2003 Character Recognition Dataset 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

80 Computational Cost (test time)

81 Learning the Number of Neurons Additional benefits at test time 2 : speeds up and memory savings 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

82 Learning the Number of Neurons Additional benefits at test time 2 : Feature Extraction In the last layer 2 Alvarez and Salzmann, Learning the Number of Neurons in Deep Networks, NIPS 2016

83 Number of Layers

84 Learning the Layers Skip connection Dec1 Dec2 Dec3 Dec4 Dec5 Dec6 Dec7 Dec7-1 Dec7-2 Dec8 Dec8-1 Dec8-2 FC Skip connection 1000

85 Number of neurons Learning the Layers Skip connection Dec1 Dec2 Dec3 Dec4 Dec5 Dec6 Dec7 Dec7-1 Dec7-2 Dec8 Dec8-1 Dec8-2 FC Skip connection Initial number Learned number L1v L1h L2v L2h L3v L3h L4v L4h L5v L5h L6v L6h L7v L7hL7-1vL7-1hL7-2vL7-2hL8v L8hL8-1vL8-1hL8-2vL8-2h Layer Name

86 Number of neurons Learning the Layers Skip connection Dec1 Dec2 Dec3 Dec4 Dec5 Dec6 Dec7 Dec7-1 Dec7-2 Dec8 Dec8-1 Dec8-2 FC Skip connection Initial number Learned number No impact in performance L1v L1h L2v L2h L3v L3h L4v L4h L5v L5h L6v L6h L7v L7hL7-1vL7-1hL7-2vL7-2hL8v L8hL8-1vL8-1hL8-2vL8-2h Layer Name

87 Training Efficient

88 Improving Training Efficiency Projection effect during training Reload the model Change in the LR 70% train speed-up (ICDAR dataset)

89 Summary

90 Summary It is possible to train more-efficient architectures without compromising accuracy based on 1-D convolution kernels:

91 Summary It is possible to train more-efficient architectures without compromising accuracy based on 1-D convolution kernels. We can jointly learn the architecture and the parameters using structure-sparsity regularization: conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 softmax prediction Num Classes TRAINING Additional TEST benefits

92 Summary It is possible to train more-efficient architectures without compromising accuracy based on 1-D convolution kernels. We can jointly learn the architecture and the parameters using structure-sparsity regularization. Additional benefits at training time.

93 Next Steps

94 Next Steps Increase performance unfreezing neurons: Naive unfreezing all neurons at the end seems to be a Bad idea.

95 Next Steps Increase performance unfreezing neurons. Add post-processing steps to reduce computational costs: Additional L1-pruning: Additional pruning of the learned model shows further benefits.

96 Next Steps Increase performance unfreezing neurons Add post-processing steps to reduce computational costs: Weight quantization: We are currently able to reduce to 3 bits with minor lost in performance.

97 Next Steps Increase performance unfreezing neurons Add post-processing steps to reduce computational costs. Design tree-structures to learn more complex layer architectures.

98 Thank you Jose Alvarez Researcher Data61, CSIRO, Australia

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,