Towards The Deep Model: Understanding Visual Recognition Through Computational Models. Panqu Wang Dissertation Defense 03/23/2017

Similar documents
Are Face and Object Recognition Independent? A Neurocomputational Modeling Exploration

Experience Matters: Modeling the Relationship Between Face and Object Recognition

Ben Cipollini & Garrison Cottrell

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency

Object Detection. Honghui Shi IBM Research Columbia

Attentional Masking for Pre-trained Deep Networks

A Computational Model of the Development of Hemispheric Asymmetry of Face Processing

Eye movements, recognition, and memory

Impaired face discrimination in acquired prosopagnosia is associated with abnormal response to individual faces in the right middle fusiform gyrus

Hierarchical Convolutional Features for Visual Tracking

Rich feature hierarchies for accurate object detection and semantic segmentation

Virtual Brain Reading: A Connectionist Approach to Understanding fmri

arxiv: v1 [stat.ml] 23 Jan 2017

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Holistically-Nested Edge Detection (HED)

Highly Accurate Brain Stroke Diagnostic System and Generative Lesion Model. Junghwan Cho, Ph.D. CAIDE Systems, Inc. Deep Learning R&D Team

Object Detectors Emerge in Deep Scene CNNs

Y-Net: Joint Segmentation and Classification for Diagnosis of Breast Biopsy Images

Efficient Deep Model Selection

B657: Final Project Report Holistically-Nested Edge Detection

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Object Recognition: Conceptual Issues. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman

Network Dissection: Quantifying Interpretability of Deep Visual Representation

Networks and Hierarchical Processing: Object Recognition in Human and Computer Vision

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Lateral Geniculate Nucleus (LGN)

Putting Context into. Vision. September 15, Derek Hoiem

Multi-attention Guided Activation Propagation in CNNs

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Computational modeling of visual attention and saliency in the Smart Playroom

Shu Kong. Department of Computer Science, UC Irvine

using deep learning models to understand visual cortex

Shu Kong. Department of Computer Science, UC Irvine

arxiv: v2 [cs.cv] 19 Dec 2017

Ben Cipollini & Garrison Cottrell

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

The Impact of Visual Saliency Prediction in Image Classification

EDGE DETECTION. Edge Detectors. ICS 280: Visual Perception

arxiv: v2 [cs.cv] 22 Mar 2018

Object recognition and hierarchical computation

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

The functional organization of the ventral visual pathway and its relationship to object recognition

Flexible, High Performance Convolutional Neural Networks for Image Classification

Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

Just One View: Invariances in Inferotemporal Cell Tuning

Geography of the Forehead

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

Categories. Represent/store visual objects in terms of categories. What are categories? Why do we need categories?

Patch-based Head and Neck Cancer Subtype Classification

Beyond R-CNN detection: Learning to Merge Contextual Attribute

CS-E Deep Learning Session 4: Convolutional Networks

What Evidence Supports Special Processing for Faces? A Cautionary Tale for fmri Interpretation

Neuromorphic convolutional recurrent neural network for road safety or safety near the road

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

Experimental Design. Thomas Wolbers Space and Aging Laboratory Centre for Cognitive and Neural Systems

Summary. Multiple Body Representations 11/6/2016. Visual Processing of Bodies. The Body is:

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Deep Compression and EIE: Efficient Inference Engine on Compressed Deep Neural Network

What can computational models tell us about face processing? What can computational models tell us about face processing?

Skin cancer reorganization and classification with deep neural network

Multi-Scale Salient Object Detection with Pyramid Spatial Pooling

Chapter 5. Summary and Conclusions! 131

A Neurally-Inspired Model for Detecting and Localizing Simple Motion Patterns in Image Sequences

Convergence of the Visual Field Split: Hemispheric Modeling of Face and Object Recognition

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION

Convergence of the visual field split: Hemispheric modeling of face and object recognition

STDP-based spiking deep convolutional neural networks for object recognition

HHS Public Access Author manuscript Med Image Comput Comput Assist Interv. Author manuscript; available in PMC 2018 January 04.

Self-Organization and Segmentation with Laterally Connected Spiking Neurons

NMF-Density: NMF-Based Breast Density Classifier

Convolutional Neural Networks (CNN)

Improving the Interpretability of DEMUD on Image Data Sets

Identify these objects

THE human visual system has the ability to zero-in rapidly onto

A HMM-based Pre-training Approach for Sequential Data

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

The roles of visual expertise and visual input in the face inversion effect: Behavioral and neurocomputational evidence

arxiv: v2 [cs.cv] 7 Jun 2018

Neural circuits PSY 310 Greg Francis. Lecture 05. Rods and cones

Differentiating Tumor and Edema in Brain Magnetic Resonance Images Using a Convolutional Neural Network

Automatic Prostate Cancer Classification using Deep Learning. Ida Arvidsson Centre for Mathematical Sciences, Lund University, Sweden

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Deep Visual Attention Prediction

Perception of Faces and Bodies

Rethinking Cognitive Architecture!

On the Selective and Invariant Representation of DCNN for High-Resolution. Remote Sensing Image Recognition

PS3019 Cognitive and Clinical Neuropsychology

Short article Detecting objects is easier than categorizing them

arxiv: v3 [cs.cv] 1 Jul 2018

Object vision (Chapter 4)

Local Image Structures and Optic Flow Estimation

Sparse Coding in Sparse Winner Networks

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

Pattern Classification and Analysis of Brain Maps through FMRI data with Multiple Methods

Transcription:

Towards The Deep Model: Understanding Visual Recognition Through Computational Models Panqu Wang Dissertation Defense 03/23/2017

Summary Human Visual Recognition (face, object, scene) Simulate Explain Predict Model 2

Outline Introduction Cognitive Modeling: Explaining cognitive phenomena using computational models Modeling the relationship between face and object recognition Modeling hemispheric asymmetry of face processing Modeling the contribution of central versus peripheral vision on scene processing Engineering: Improving semantic segmentation Deep learning in pixel-level semantic segmentation 3

Introduction: Ventral Temporal Cortex (VTC) Faces Body Scene Words Objects Grill-Spector, Kalanit, and Kevin S. Weiner. "The functional architecture of the ventral temporal cortex and its role in categorization." Nature Reviews Neuroscience 15.8 (2014): 536-548. 4

Introduction: The Model (TM) Focused on modeling and explaining cognitive phenomena. Preprocessing with - Gabor Filters: V1 complex cells - PCA: feature compression and dimensionality reduction 5

Introduction: The Model (TM) Neural network learns tasks-specific features based on the output. The hidden layer can correspond to: - Fusiform Face Area (FFA) for face recognition - Lateral Occipital Complex (LOC) for general object recognition. 6

Introduction: The Model (TM) A variant of TM (mixture-of-expert structure) 7

Introduction: The Deep Model (TDM) Pathway 1 Pathway 2 TDM: TM gets deeper 8

Outline Introduction Cognitive Modeling: Explaining cognitive phenomena using computational models Modeling the relationship between face and object recognition Modeling hemispheric asymmetry of face processing Modeling the contribution of central versus peripheral vision on scene processing Engineering: Improving semantic segmentation Deep learning in pixel-level semantic segmentation 9

Modeling the relationship between face and object recognition 10

The Question Is face recognition independent from general object recognition? Kanwisher et al. (1997), Tsao et al. (2006): Faces are processed is a domainspecific module (FFA), which is not activated for non-face objects. Gauthier et al.(2000), McGugin et al. (2014), Xu (2005): FFA is activated by non-face object categories of expertise, such as birds and cars. Wilmer et al. (2010): Cambridge Face Memory Test (CFMT) score shared very little variance (6.7%) with a test of visual memory for abstract art. Gauthier et al. (2014): Experience ( E ) modulates the domain general ability ( v ) to discriminate face and non-face categories. When E is high for both categories, shared variance is 0.59. Support for independence Support for dependence 11

The Question Is face recognition independent from general object recognition? Gauthier et al. (2014): Experience (E) modulates the domain general ability ( v ) that is shared between both face and non-face object recognition. Performance cat v Experience cat 12

Face Recognition Accuracy Gauthier et al. (2014) Object Recognition Accuracy low high Experience with non-face object categories (VET) They showed that, as predicted, as experience (E) grows, the shared variance between the face and object recognition performance increased monotonically. Can our model (TM) account for this? 13

Modeling Procedure number of hidden units in the neural network Performance cat v Experience cat number of distinct training samples for each object category. We trained different network for each subject, based on that subject s data. We obtained v from the performance on the face recognition score, as human subjects have max E for face recognition: Performance face v. (Experience face = 100%!) 14

Modeling Procedure train first train together 15

Face Accuracy (CFMT) Modeling Result (Subordinate Level Categorization) Object Accuracy (VET) Object Accuracy 16

Does Experience Has to be in Subordinate Level? How about basic level categorization? Levels of visual categorization for chair 17

Modeling Result (Basic Level Categorization) No obvious experience moderation effect. Because the task is easy, not much experience is needed to get good performance. The type of experience matters - fine level discrimination of homogeneous categories. 18

Analysis: Role of Experience Network: Low E Accuracy Faces: 87.5% Non-faces: 16.67% P C 2 PC 1 Network: High E Accuracy Faces: 83.33% Non-faces: 87.5% P C 2 PC 1 More experience separates the objects to a greater extent, and helps the network achieve better performance. 19

Analysis: Role of Experience Entropy of hidden units as a function of training epochs. (Less experience means less diversity, not less training) 20

Conclusions Our experimental results support Gauthier et al. s hypothesis that face and object recognition are not independent, given sufficient experience. Our experiment predicts that the source of the experience moderation effect is not in regions of the brain that are sensitive only to category level, as opposed to regions that are associated with better performance in individuation for objects and faces, such as the FFA. The reason for this mediation is that the FFA embodies a transform that amplifies the differences between homogeneous objects. 21

Outline Introduction Cognitive Modeling: Explaining cognitive phenomena using computational models Modeling the relationship between face and object recognition Modeling hemispheric asymmetry of face processing Modeling the contribution of central versus peripheral vision on scene processing Engineering: Improving semantic segmentation Deep learning in pixel-level semantic segmentation 22

Modeling the development of hemispheric asymmetry of face processing 23

Introduction Face processing is right hemisphere(rh) biased. Neural activation to faces is stronger in RH. [Kanwisher et al., 1997] Lesion in right FFA results in prosopagnosia. [Bouvier & Engel, 2006] Can TM account for this phenomenon? 24

Modeling Procedure Two constraints: RH develops its function earlier than LH and dominates in infancy. [Chiron et al., 1997] Visual acuity changes radically over time. [Peterzell et al., 1995] [Atkinson et al., 1997] 25

Modeling Procedure RH predominance Acuity change 26

Modeling Procedure Four object categories: faces, books, cups, and digits. For each category, perform subordinate level classification task on 10 different individuals, with basic level classification on other 3 categories. Record the weight of the gating layer for each model hemisphere after training converges. 27

Result Module 1: RH Module 2: LH Face expert (control) Face expert with different learning rates ( waves of plasticity ) 28

Result Module 1: RH Module 2: LH Cup expert (control) Cup expert (split learning rate) Digit expert (control) Digit expert (split learning rate) 29

Result: Gating Node Value vs. Training Time 30

Discussion TM can account for the development of the RH lateralization of face processing. This lateralization rises naturally as a consequence of two realistic cognitive constraints: early development of the RH, and the change of acuity across time. We hypothesize that low spatial frequency (LSF) information is the key that contributes to this result. Results show that LSF generates higher accuracy than HSF in face processing. 31

Outline Introduction Cognitive Modeling: Explaining cognitive phenomena using computational models Modeling the relationship between face and object recognition Modeling hemispheric asymmetry of face processing Modeling the contribution of central versus peripheral vision on scene processing Engineering: Improving semantic segmentation Deep learning in pixel-level semantic segmentation 32

Modeling the contribution of central versus peripheral vision on scene processing 33

Introduction: Central and Peripheral Vision Central http://www.codeproject.com/articles/33648/polar-view-of-an-image 34

Introduction: Log Polar Transformation http://www.codeproject.com/articles/33648/polar-view-of-an-image 35

Larson & Loschky (2009) Explored the contribution of central versus peripheral vision in scene recognition using the Window and Scotoma paradigm. 36

Larson & Loschky (2009) Central Peripheral 37

Modeling Larson & Loschky (2009) AlexNet VGG-16 GoogLeNet 38

Modeling Cortical Magnification Foveated + log-polar transformed The dataset 39

Modeling Larson & Loschky (2009) Behavioral Model (Foveated) Model (Log-polar) Modeling result of visual angle vs. classification accuracy 40

Analysis Why is peripheral vision better? Hypothesis 1: Peripheral vision contains better features. To test hypothesis 1, we train a network using central or peripheral information only, and test using full image. If peripheral vision is better, its corresponding Scotoma condition test accuracy at 5 degrees should be higher than central vision (Window condition). 41

Analysis The network (a) contains only 2.3 million parameters, only 4% of AlexNet. The network is trained from scratch to avoid any interference of pre-training. Result: Table: Classification accuracy at 5 degrees Supports Hypothesis 1: peripheral vision contains.better features. 42

Analysis Why is peripheral vision better? Hypothesis 2: Peripheral vision creates better internal representations for scene recognition. To test hypothesis 2, we train TDM and visualize its internal representations. Better features will result in higher gating node values in TDM. 43

The Deep Model (TDM) Central Peripheral TDM: TM gets deeper 44

Analysis: Gating Node Value Central/Peripheral Central/Central Peripheral/Peripheral Comparison of gating node value. 45

Analysis: Gating Node Value Gating node value as a function of training iterations. When given a choice, the network uses the peripheral pathway. 46

Analysis: Hidden Unit Activation Central Peripheral Development of representations of layer fc6 units 47

Analysis: Hidden Unit Activation Peripheral advantage emerges naturally through network training (or human development). It generates a spreading transform that benefits scene recognition, and is stable throughout leaning. If we force equal gating value throughout training, the central pathway is totally turned off. 48

Conclusions TDM was successfully applied to simulate and explain the data from Larson and Loschky (2009), and support their findings of the peripheral advantage in human scene recognition. We proposed and validated two hypotheses that explains the peripheral advantage: 1) peripheral vision contains better features; and 2) peripheral vision generates better internal representations. 49

Outline Introduction Cognitive Modeling: Explaining cognitive phenomena using computational models Modeling the relationship between face and object recognition Modeling hemispheric asymmetry of face processing Modeling the contribution of central versus peripheral vision on scene processing Engineering: Improving semantic segmentation Deep learning in pixel-level semantic segmentation 50

Deep learning in pixel-level semantic segmentation 51

Introduction: Semantic Segmentation Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R.,... & Schiele, B. (2015). The cityscapes dataset. In CVPR Workshop. 52

Method We design better convolutional operations for Semantic Segmentation. Decoding: Dense Upsampling Convolution (DUC) - upsampling through learning. - achieves pixel-level decoding accuracy. Encoding: Hybrid Dilated Convolution (HDC) - motivated by increasingly large receptive field of the visual processing stream. - solving gridding problem caused by normal dilated convolution. 53

Dense Upsampling Convolution (DUC) Previous methods: 1. Bilinear Upsampling - not learnable - losing details 2. Deconvolution - inserting extra values (0s) - inserting extra layers 54

Dense Upsampling Convolution (DUC) Learning upsampling through convolution layer. No extra zero is inserted. Extremely easy to implement. Enables end-to-end training, compatible with all FCN structures. 55

Dense Upsampling Convolution (DUC) Input Image Ground Truth Bilinear Upsampling DUC 56

Dense Upsampling Convolution (DUC) Input Image Ground Truth Bilinear Upsampling DUC 57

Dense Upsampling Convolution (DUC) Performance on CityScapes Validation set, measured using mean Intersection over Union score: TP miou = TP + FP + FN Model Avg road sidew alk buildin g wall fence pole traffic light traffic sign veget ation terrain sky person rider car truck bus train Moto rcycle bicycle Bilinear Upsample 0.723 0.974 0.802 0.906 0.475 0.519 0.523 0.593 0.7 0.907 0.591 0.923 0.78 0.581 0.933 0.693 0.829 0.66 0.615 0.74 DUC 0.743 0.981 0.847 0.918 0.509 0.531 0.62 0.653 0.75 0.92 0.626 0.943 0.8 0.585 0.944 0.712 0.822 0.586 0.615 0.75 Benefit almost all categories. Especially helpful for small objects. 58

Hybrid Dilated Convolution (HDC) The gridding effect in normal dilated convolution: using the same expansion factor leads to holes in the information available. Layer 1, r= 2 Layer 2, r=2 Layer 3, r=2 59

Hybrid Dilated Convolution (HDC) The gridding effect in normal dilated convolution: Ground truth Regular dilated conv 60

Hybrid Dilated Convolution (HDC) Solution: change the dilation rate so that there is no common factor. Regular dilated conv r = 2,2,2 HDC r = 1,2,3 61

Hybrid Dilated Convolution (HDC) Ground truth Regular dilated conv HDC 62

Hybrid Dilated Convolution (HDC) Another benefit of using HDC: enlarge receptive field in-place. The network can see more and aggregate global information. Input GT DUC DUC+HDC 63

Experiments: CityScapes Dataset 64

Experiments: Kitti Road Dataset Contains images of three various categories of road scenes, including 289 training images and 290 test images.. Performance: 65

Experiments: Kitti Road Examples 66

Experiments: Pascal VOC 2012 Model Ave. plane bike bird boat bottle bus car cat chair cow table dog horse moto ppl. plant sheep sofa train tv DeepLabv2 -CRF 79.7 92.59 59.71 91.27 65.04 76.45 94.78 88.89 92.16 33.97 88.39 67.65 89.20 92.14 86.97 87.07 63.05 88.53 60.17 86.53 74.58 CentraleSu pelec Deep G-CRF 80.2 92.90 61.20 91.00 66.30 77.70 95.30 88.90 92.40 33.80 88.40 69.10 89.80 92.90 87.70 87.50 62.60 89.90 59.20 87.10 74.20 HikSeg_C OCO 81.4 95.00 64.20 91.50 79.00 78.70 93.40 88.40 94.30 45.80 89.60 65.20 90.60 92.80 88.70 87.50 62.40 88.40 56.40 86.20 75.30 ours 83.1 92.10 64.57 94.73 71.04 80.98 94.56 89.73 94.88 45.56 93.72 74.39 91.95 95.07 90.02 88.68 69.12 90.42 62.68 86.38 78.15 67

Experiments: Pascal VOC 2012 Examples Input Ground truth Ours After CRF 68

Conclusions We proposed simple yet effective convolutional operations for improving semantic segmentation systems. We designed a new dense upsampling convolution (DUC) operation to enable pixel-level prediction on feature maps, and hybrid dilated convolution (HDC) to deal with the gridding problem, effectively enlarging the receptive fields of the network. Experimental results demonstrate the effectiveness of our framework by achieving state-of-the-art performance on various semantic segmentation benchmarks. 69

Summary Human Visual Recognition (face, object, scene) Simulate Explain Predict TM, TDM 70

Publications Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., & Cottrell, G. Understanding Convolution for Semantic Segmentation. Under review at International Conference on Computer Vision (ICCV). Wang, P., & Cottrell, G. Central and Peripheral Vision for Scene Recognition: A Neurocomputational Modeling Exploration. In press at Journal of Vision. Wang, P., & Cottrell, G. W. (2016). Modeling the Contribution of Central Versus Peripheral Vision in Scene, Object, and Face Recognition, In Proceedings of the 38th annual conference of the cognitive science society. Austin, TX: Cognitive Science Society. Wang, P., Gauthier, I., & Cottrell, G. (2016). Are face and object recognition independent? A neurocomputational modeling exploration. Journal of Cognitive Neuroscience. 28(4), 558-574. Wang, P., & Cottrell, G. W. (2016). Basic Level Categorization Facilitates Visual Object Recognition., In International Conference on Learning Representations (ICLR) Workshop. Wang, P., Malave, V., & Cipollini, B. (2015). Encoding voxels with deep learning. The Journal of Neuroscience, 35(48), 15769-15771. Wang, P., Cottrell, G., & Kanan,C. (2015). Modeling the Object Recognition Pathway: A Deep Hierarchical Model Using Gnostic Fields. In Proceedings of the 37th annual conference of the cognitive science society. Austin, TX: Cognitive Science Society. Wang, P., Gauthier, I., & Cottrell, G. (2014). Experience matters: Modeling the relationship between face and object recognition. In Proceedings of the 36th annual conference of the cognitive science society. Austin, TX: Cognitive Science Society. Wang, P., & Cottrell, G. W. (2013). A computational model of the development of hemispheric asymmetry of face processing, In Proceedings of the 35th annual conference of the cognitive science society. Austin, TX: Cognitive Science Society. 71

Thank you! My dissertation committee: Dr. Gary Cottrell, Dr. Nuno Vasconcelos, Dr. Virginia de Sa, Dr. Truong Nguyen, Dr. Bhaskar Rao. My collaborators: Isabel Gauthier, Chris Kanan, Ben Cipollini, Vicente Malave, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou Gary s Unbelievable Research Unit (GURU) Perceptual Expertise Network (PEN) Temporal Dynamics of Learning Center (TDLC) My family and friends 72

Q&A 73