Bottom-up and top-down processing in visual perception

Size: px

Start display at page:

Download "Bottom-up and top-down processing in visual perception"

August Long
6 years ago
Views:

1 Bottom-up and top-down processing in visual perception Thomas Serre Brown University Department of Cognitive & Linguistic Sciences Brown Institute for Brain Science Center for Vision Research

2 The what problem monkey electrophysiology

3 Willmore Prenger & Gallant 10 The what problem monkey electrophysiology

4 b Yamane Carlson Bowman Wang & Connor 08 Willmore Prenger & Gallant 10 The what problem monkey electrophysiology

5 b Yamane Carlson Bowman Wang & Connor 08 Hung Kreiman et al 05 Willmore Prenger & Gallant 10 The what problem monkey electrophysiology

6 The what problem human psychophysics Thorpe et al 96; VanRullen & Thorpe 01; Fei-Fei et al 02 05; Evans & Treisman 05; Serre Oliva & Poggio 07

7 Vision as knowing what is where Aristotle; Marr 82

8 What and where cortical pathways Ungerleider & Mishkin 84

9 ventral what stream What and where cortical pathways Ungerleider & Mishkin 84

10 dorsal where stream ventral what stream What and where cortical pathways Ungerleider & Mishkin 84

11 How does the visual system combine information about the identity and location of objects?? dorsal where stream ventral what stream What and where cortical pathways Ungerleider & Mishkin 84

12 How does the visual system combine information about the identity and location of objects? Central thesis: visual attention (see also Van Der Velde and De Kamps 01; Deco and Rolls 04)? dorsal where stream ventral what stream What and where cortical pathways Ungerleider & Mishkin 84

13 Part I: A Bayesian theory of attention that solves the what is where problem Predictions for neurophysiology and human eye movement data Sharat Chikkerur Cheston Tan Tomaso Poggio Part II: Readout of IT population activity Attention eliminates clutter Ying Zhang Ethan Meyers Narcisse Bichot Tomaso Poggio Bob Desimone

14 Part I: A Bayesian theory of attention that solves the what is where problem Predictions for neurophysiology and human eye movement data Sharat Chikkerur Cheston Tan Tomaso Poggio Part II: Readout of IT population activity Attention eliminates clutter Ying Zhang Ethan Meyers Narcisse Bichot Tomaso Poggio Bob Desimone

15 S visual scene description I image measurements Perception as Bayesian inference Mumford 92; Knill & Richards 96; Dayan & Zemel 99; Rao 02 04; Kersten & Yuille 03; Kersten et al 04; Lee & Mumford 03; Dean 05; George & Hawkins 05 09; Hinton 07; Epshtein et al 08; Murray & Kreutz-Delgado 07

16 P (S I) P (I S)P (S) S visual scene description I image measurements Perception as Bayesian inference Mumford 92; Knill & Richards 96; Dayan & Zemel 99; Rao 02 04; Kersten & Yuille 03; Kersten et al 04; Lee & Mumford 03; Dean 05; George & Hawkins 05 09; Hinton 07; Epshtein et al 08; Murray & Kreutz-Delgado 07

17 P (S I) P (I S)P (S) S visual scene description I image measurements Perception as Bayesian inference Mumford 92; Knill & Richards 96; Dayan & Zemel 99; Rao 02 04; Kersten & Yuille 03; Kersten et al 04; Lee & Mumford 03; Dean 05; George & Hawkins 05 09; Hinton 07; Epshtein et al 08; Murray & Kreutz-Delgado 07

18 object identities object locations S = {O 1,O 2,,O n,l 1,L 2,,L n } P (S I) P (I S)P (S) S visual scene description I image measurements Perception as Bayesian inference Mumford 92; Knill & Richards 96; Dayan & Zemel 99; Rao 02 04; Kersten & Yuille 03; Kersten et al 04; Lee & Mumford 03; Dean 05; George & Hawkins 05 09; Hinton 07; Epshtein et al 08; Murray & Kreutz-Delgado 07

object identities object locations S = {O 1,O 2,,O n,l 1,L 2,,L n } P (S, I) =P (O 1,L 1,O 2,L 2,,L n,l n,i) S visual scene description I image measurements Perception as Bayesian inference Mumford

19 object identities object locations S = {O 1,O 2,,O n,l 1,L 2,,L n } P (S, I) =P (O 1,L 1,O 2,L 2,,L n,l n,i) S visual scene description I image measurements Perception as Bayesian inference Mumford 92; Knill & Richards 96; Dayan & Zemel 99; Rao 02 04; Kersten & Yuille 03; Kersten et al 04; Lee & Mumford 03; Dean 05; George & Hawkins 05 09; Hinton 07; Epshtein et al 08; Murray & Kreutz-Delgado 07

20 To recognize and localize objects in the scene, the visual system selects objects, one object at a time P (O, L, I) O,L visual scene description I image measurements Assumption #1: Attentional spotlight Broadbent 52 54; Treisman 60; Treisman & Gelade 80; Duncan & Desimone 95; Wolfe 97; and many others

21 Object location L and identity O are independent P (O, L, I) dorsal where stream O,L visual scene description ventral what stream I image measurements Assumption #2: what and where pathways

22 Object location L and identity O are independent P (O, L, I) =P (O)P (L)P (I L, O) dorsal where stream O object L ventral what stream location I image measurements Assumption #2: what and where pathways

23 Objects encoded by collections of universal features (of intermediate complexity) - Either present or absent - Conditionally independent given the object and its location O object L location I image measurements Assumption #3: universal features see Riesenhuber & Poggio 99; Serre et al 05 07

24 O object Objects encoded by collections of universal features (of intermediate complexity) - Either present or absent location F i position and scale tolerant features - Conditionally independent given the object and its location L X i N retinotopic features I image measurements Assumption #3: universal features see Riesenhuber & Poggio 99; Serre et al 05 07

25 O object location L F i position and scale tolerant features X i N retinotopic features I image measurements Assumption #3: universal features see Riesenhuber & Poggio 99; Serre et al 05 07

26 O object location L F i position and scale tolerant features X i N retinotopic features V2/V4 Serre et al 05 David et al 06 Cadieu et al 07 Willmore et al 10 I image measurements Assumption #3: universal features see Riesenhuber & Poggio 99; Serre et al 05 07

27 location O object Fi position and scale tolerant features Xi retinotopic features L N V2/V4 Serre et al 05 David et al 06 Cadieu et al 07 Willmore et al 10 Assumption #3: universal features I image measurements see Riesenhuber & Poggio 99; Serre et al 05 07

28 location IT Serre et al 07 O object Fi position and scale tolerant features Xi retinotopic features L N V2/V4 Serre et al 05 David et al 06 Cadieu et al 07 Willmore et al 10 Assumption #3: universal features I image measurements see Riesenhuber & Poggio 99; Serre et al 05 07

29 Predicting IT readout Serre et al 07 O object 1 IT Model Classification performance Size: Position: 3.4 o center 3.4 o center 1.7 o center 6.8 o center 3.4 o 2 o horz. 3.4 o 4 o horz. location L F i X i N position and scale tolerant features retinotopic features TRAIN TEST I image measurements Bayesian inference and attention Model: Serre et al 05 Experimental data: Hung* Kreiman*et al 05

30 O object behavior Serre Oliva & Poggio 07 location F i position and scale tolerant features L X i N retinotopic features I image measurements Bayesian inference and attention

31 O object behavior Serre Oliva & Poggio 07 location F i position and scale tolerant features L X i N retinotopic features I image measurements Bayesian inference and attention

32 O object behavior Serre Oliva & Poggio 07 location F i position and scale tolerant features L X i N retinotopic features I image measurements Bayesian inference and feedforward (bottom-up) sweep attention

33 O object behavior Serre Oliva & Poggio 07 location F i position and scale tolerant features L X i N retinotopic features I image measurements Bayesian inference and feedforward (bottom-up) sweep attention

34 O object Goal of visual perception: to estimate posterior probabilities of visual features, objects and their locations in an image location F i position and scale tolerant features Attention corresponds to conditioning on high-level latent variables representing particular features or locations (as well as on sensory input), and doing inference over the other latent variables L X i I N retinotopic features image measurements Bayesian inference and attention

35 dorsal / where O ventral / what F i L X i N I Bayesian inference and attention

36 dorsal / where O ventral / what F i L X i N P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) I Bayesian inference and attention

37 dorsal / where O ventral / what F i L feedforward input X i N P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) I Bayesian inference and feedforward (bottom-up) sweep attention

38 dorsal / where O ventral / what top-down spatial attention L F i top-down featurebased attention X i feedforward input attentional modulation N P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) I Bayesian inference and feedforward (bottom-up) sweep attention

39 dorsal / where O ventral / what top-down spatial attention L F i top-down featurebased attention X i X k feedforward input attentional modulation N P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) suppressive drive I Bayesian inference and feedforward (bottom-up) sweep attention

40 Neuron Review O The Normalization Model of Attention John H. Reynolds 1, * and David J. Heeger 2 1 Salk Institute for Biological Studies, La Jolla, CA , USA 2 Department of Psychology and Center for Neural Science, New York University, New York, NY 10003, USA *Correspondence: reynolds@salk.edu DOI /j.neuron F i R(x, θ) = A(x, θ)e(x, θ) S(x, θ)+σ L E A X i N P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) S I Bayesian inference and attention

41 P (X i = x I) F i,l P (X i = x F i,l)p (I X i )P (F i )P (L) McAdams and Maunsell 99 Model P (L = x) 1 P (L = x) =1/ L Multiplicative scaling of tuning curves by spatial attention see also Heeger & Reynolds 09

42 P (X i I) = P (I X i ) F i,l P (Xi F i,l)p(l)p(f i ) X P (I X i ) i F i,l P (Xi F i,l)p(l)p(f i ) Trujillo and Treue 02 Mc Adams and Maunsell 99 Contrast vs. response gain see also Heeger & Reynolds 09

43 P (X i = x I) F i,l P (X i = x F i,l)p (I X i )P (F i )P (L) Normalized response P stim/p cue NP stim/ P.cue P.stim/NP cue NP stim/np cue Normalized spikes Bichot et al Feature-based attention neural data from Bichot et al 05

44 O F i L X i N I Learning to localize cars and pedestrians

45 O learning object priors F i L X i N I Learning to localize cars and pedestrians

46 O learning object priors Context F i L X i N I Learning to localize cars and pedestrians

47 learning location priors (global contextual cues, see Torralba, Oliva et al) Context O learning object priors F i L X i N I Learning to localize cars and pedestrians

48 Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

49 Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

50 1st three fixations Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

51 1st three fixations Eye movements as proxy for attention ROC area Dataset: street-scenes images with cars & pedestrians and 20 without Experiment Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

52 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

53 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

54 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

55 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

56 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

57 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

58 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

59 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

60 ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment

61 Explains 92% of the inter-subject agreement! ROC area st three fixations cars pedestrians Uniform priors (bottom-up) Feature priors Feature + contextual (spatial) priors Humans Eye movements as proxy for attention Dataset: street-scenes images with cars & pedestrians and 20 without Experiment - 8 participants asked to count the number of cars/pedestrians - block design for cars and pedestrians - eye movements recorded using an infra-red eye tracker The experiment *similar (independent) results by Ehinger Hidalgo Torralba & Oliva 10

62 Method ROC area Bruce & Tsotos % Itti et al % Proposed 77.9% F i L X i N I Bottom-up saliency and free-viewing human eye data from Bruce & Tsotsos

63 Summary I

64 Summary I Goal of vision: - To solve the problem of what is where problem

65 Summary I Goal of vision: - To solve the problem of what is where problem Key assumptions: - Attentional spotlight recognition done sequentially, one object at a time - What and where independent pathways

66 Summary I Goal of vision: - To solve the problem of what is where problem Key assumptions: - Attentional spotlight recognition done sequentially, one object at a time - What and where independent pathways Attention as the inference process implemented by the interaction between ventral and dorsal areas - Integrates bottom-up and top-down (feature-based and context-based) attentional mechanisms - Seems consistent with known physiology

67 Summary I Goal of vision: - To solve the problem of what is where problem Key assumptions: - Attentional spotlight recognition done sequentially, one object at a time - What and where independent pathways Attention as the inference process implemented by the interaction between ventral and dorsal areas - Integrates bottom-up and top-down (feature-based and context-based) attentional mechanisms - Seems consistent with known physiology Main attentional effects in the presence of clutters - Spatial attention reduces uncertainty over location and improves object recognition performance over first bottom-up (feedforward) pass

68 Part I: A Bayesian theory of attention that solves the what is where problem Predictions for neurophysiology and human eye movement data Sharat Chikkerur Cheston Tan Tomaso Poggio Part II: Readout of IT population activity Attention eliminates clutter Ying Zhang Ethan Meyers Narcisse Bichot Tomaso Poggio Bob Desimone

69 The readout approach

70 The readout approach

71 The readout approach neuron 1 neuron 2 neuron 3 neuron n

72 The readout approach neuron 1 neuron 2 neuron 3 neuron n Pattern Classifier

73 The readout approach neuron 1 neuron 2 neuron 3 neuron n Prediction Pattern Classifier

74 The readout approach neuron 1 neuron 2 neuron 3 neuron n Correct Prediction Pattern Classifier

75 IT readout improves with attention train readout classifier on isolated object +

76 IT readout improves with attention train readout classifier on isolated object test generalization in clutter + +

77 IT readout improves with attention train readout classifier on isolated object test generalization in clutter unattended + +

78 IT readout improves with attention train readout classifier on isolated object test generalization in clutter unattended + + attended

79 IT readout improves with attention train readout classifier on isolated object test generalization in clutter unattended + + attended attend away

80 IT readout improves with attention Zhang Meyers Bichot Serre Poggio Desimone unpublished data

81 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

82 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

83 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

84 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

85 IT readout improves with attention Zhang Meyers Bichot Serre Poggio Desimone unpublished data

86 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

87 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

88 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

89 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

90 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

91 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

92 IT readout improves with attention + Zhang Meyers Bichot Serre Poggio Desimone unpublished data

93 Robert Desimone (MIT) Christof Koch (CalTech) Laurent Itti (USC) Tomaso Poggio (MIT) David Sheinberg (Brown)

94 Thank you!

A Bayesian inference theory of attention: neuroscience and algorithms Sharat Chikkerur, Thomas Serre, and Tomaso Poggio

Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-29-47 CBCL-28 October 3, 29 A Bayesian inference theory of attention: neuroscience and algorithms Sharat Chikkerur,