ELL 788 Computational Perception & Cognition July November 2015

Similar documents
USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Auditory Scene Analysis. Dr. Maria Chait, UCL Ear Institute

= + Auditory Scene Analysis. Week 9. The End. The auditory scene. The auditory scene. Otherwise known as

Computational Perception /785. Auditory Scene Analysis

Hearing in the Environment

Auditory scene analysis in humans: Implications for computational implementations.

Auditory Scene Analysis

Auditory gist perception and attention

Spectrograms (revisited)

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

HCS 7367 Speech Perception

Hearing. Juan P Bello

HCS 7367 Speech Perception

Hearing II Perceptual Aspects

Binaural processing of complex stimuli

J Jeffress model, 3, 66ff

Binaural Hearing. Steve Colburn Boston University

16.400/453J Human Factors Engineering /453. Audition. Prof. D. C. Chandra Lecture 14

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Motion Control for Social Behaviours

Lecture 3: Perception

Models of Attention. Models of Attention

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 5, SEPTEMBER A Computational Model of Auditory Selective Attention

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

Unit 4: Sensation and Perception

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

HEARING AND PSYCHOACOUSTICS

Introduction to Computational Neuroscience

VIDEO SALIENCY INCORPORATING SPATIOTEMPORAL CUES AND UNCERTAINTY WEIGHTING

LATERAL INHIBITION MECHANISM IN COMPUTATIONAL AUDITORY MODEL AND IT'S APPLICATION IN ROBUST SPEECH RECOGNITION

Prof. Greg Francis 7/7/08

Fundamentals of Cognitive Psychology, 3e by Ronald T. Kellogg Chapter 2. Multiple Choice

Computational Cognitive Science

Keywords- Saliency Visual Computational Model, Saliency Detection, Computer Vision, Saliency Map. Attention Bottle Neck

HearIntelligence by HANSATON. Intelligent hearing means natural hearing.

An Attentional Framework for 3D Object Discovery

Using Source Models in Speech Separation

Computational Cognitive Science. The Visual Processing Pipeline. The Visual Processing Pipeline. Lecture 15: Visual Attention.

21/01/2013. Binaural Phenomena. Aim. To understand binaural hearing Objectives. Understand the cues used to determine the location of a sound source

Systems Neuroscience Oct. 16, Auditory system. http:

SENSATION AND PERCEPTION KEY TERMS

Chapter 6. Attention. Attention

Sperling conducted experiments on An experiment was conducted by Sperling in the field of visual sensory memory.

Sound localization psychophysics

Congruency Effects with Dynamic Auditory Stimuli: Design Implications

Appendix E: Basics of Noise. Table of Contents

Introduction to Computational Neuroscience

Robustness, Separation & Pitch

Auditory Scene Analysis: phenomena, theories and computational models

Theoretical Neuroscience: The Binding Problem Jan Scholz, , University of Osnabrück

BINAURAL DICHOTIC PRESENTATION FOR MODERATE BILATERAL SENSORINEURAL HEARING-IMPAIRED

Who are cochlear implants for?

Issues faced by people with a Sensorineural Hearing Loss

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

Multimodal Saliency-Based Bottom-Up Attention A Framework for the Humanoid Robot icub

An Auditory-Model-Based Electrical Stimulation Strategy Incorporating Tonal Information for Cochlear Implant

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

Audio-Visual Saliency Map: Overview, Basic Models and Hardware Implementation

Multimodal interactions: visual-auditory

Audio Quality Assessment

On the role of context in probabilistic models of visual saliency

Computational Cognitive Science

Psychoacoustical Models WS 2016/17

Sensation and Perception

3-D Sound and Spatial Audio. What do these terms mean?

Visual Attention Driven by Auditory Cues

Auditory Saliency Using Natural Statistics

Hearing the Universal Language: Music and Cochlear Implants

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

Optical Illusions 4/5. Optical Illusions 2/5. Optical Illusions 5/5 Optical Illusions 1/5. Reading. Reading. Fang Chen Spring 2004

An Auditory System Modeling in Sound Source Localization

Clusters, Symbols and Cortical Topography

Theta sequences are essential for internally generated hippocampal firing fields.

Binaural Hearing. Why two ears? Definitions

Modulation and Top-Down Processing in Audition

COM3502/4502/6502 SPEECH PROCESSING

Visual Selection and Attention

On the influence of interaural differences on onset detection in auditory object formation. 1 Introduction

Sensation and Perception

Computational Models of Visual Attention: Bottom-Up and Top-Down. By: Soheil Borhani

What Is the Difference between db HL and db SPL?

Modeling task interference caused by traffic noise: results from a perception experiment

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

Processing Interaural Cues in Sound Segregation by Young and Middle-Aged Brains DOI: /jaaa

Neural correlates of the perception of sound source separation

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

Sensation and Perception

= add definition here. Definition Slide

Reading Assignments: Lecture 18: Visual Pre-Processing. Chapters TMB Brain Theory and Artificial Intelligence

Practice Test Questions

Can Saliency Map Models Predict Human Egocentric Visual Attention?

A Model of Saliency-Based Visual Attention for Rapid Scene Analysis

Prelude Envelope and temporal fine. What's all the fuss? Modulating a wave. Decomposing waveforms. The psychophysics of cochlear

Sensation and Perception: How the World Enters the Mind

Binaural Hearing for Robots Introduction to Robot Hearing

SUPPRESSION OF MUSICAL NOISE IN ENHANCED SPEECH USING PRE-IMAGE ITERATIONS. Christina Leitner and Franz Pernkopf

Linguistic Phonetics. Basic Audition. Diagram of the inner ear removed due to copyright restrictions.

Transcription:

ELL 788 Computational Perception & Cognition July November 2015 Module 8 Audio and Multimodal Attention

Audio Scene Analysis Two-stage process Segmentation: decomposition to time-frequency segments Grouping and segregation: based on perceived source Auditory cues: feature (pitch, loudness...), spatial (location, trajectory) Cross-modal cues (Visual) Masking Energy masking: (Lower level process) Competing sounds interfereing in frequency and in time Information masking: (Higher level process) Difficulties to detect a target that cannot be explained with energy masking

Grouping (sreaming) Primitive grouping (Bottom-up: data driven) Gestalt principle of perceptual organization Frequency / Temporal proximity Good continuation: Slowly varying frequency Common fate: Common start and stop time, amplitude / frequency modulation Schema-based grouping (Top-down: knowledge-driven) E.g. human word recognition A schema is activated by adequate sensory information regarding one particular regularity in environment Scemas can activate or suppress other schemas

Detection vs. Identification of sound Detection: asserting the presence Signal to noise ratio Saliency: Contextual incongruency with background Identification: associating a meaning Familiarity (knowledge) of the listener Physical components (unique features) Standard deviation of the spectrum Number of bursts or peaks Information content Not readily expected within a given environment

Audio attention Selective reception of an audio group (stream) Perceived the same source Multiple cues Auditory feature cues, Auditory spatial cues Other cues Involuntary: A sudden gunshot (saliency-driven) Voluntary: Cocktail party effect (task-driven) location, lip-reading, mean pitch differences, different speaking speeds, male/female speaking voices, distinctive accents

Computational models Studied more in visual domain Analogies between auditory and visual perception established in neuroscience literature Common models have been proposed for audio and visual domains Bottom-up (saliency-driven) vs. Top-down (task-driven) Attention on specific sound pattern (what is being said?) Attention on specific sound source(s) (a bird singing amidst noise) Attention on direction of sound (where that loud bang came from?)

Experimental methods Psychological experiments Users made to listen tones amidst white background noise Originating at different spatial locations Are asked to identify the tone listened Accuracy / response time recorded Neurological experiments Measure brain activities using EEG Identify / characterize activations

Some observations Both short and long tones on a noisy background are salient Also the gaps (absence of a tone) Long tones / gaps accumulate more saliency than short tones A time gap (~ 1 1.5 s) between two tones leads to faster response Temporally modulated tones are more salient than stationary tones In a sequence of two closely spaced tones (within critical band), the second is less salient We can selectively attend to a piece of conversation when there are many overlapping conversations (cocktail party effect) In a noisy environment, people respond when their own names are uttered in an unattented stream

Audio Saliency Model Kalini & Narayanan Mimics cochlear processing Audio-scene 20 ms frames; shifted by 10 ms 128 filter-banks STFT used to compute spectogram Feature extraction 8 features 1 each for I, F and T 2 for O( ) 3 for P 8-level Gaussian pyramids ( =1..8) (if time duration > 1.28 s) 1:1 1:128 Center-Surround differences I F T O( ) P F, P O ( ) 48 (6 x 8) feature maps

Gist features A low resolution representation of the feature maps for quick understanding of the audio scene The feature-maps are divided into m x n grids (m = 4, n = 5) Gist feature vectors are computed as averages in each cell of the grid v = width h = height of the cells i = feature map index {1.. 48} 20 (4 x 5) gist values for each of 48 feature maps Combined; PCA used for dimensionality reduction Saliency in terms of audio features Auditory gist feature F = { fi }, i = 1.. d

Task dependent biasing of auditory cues Given a task Enhance specific dimensions of the gist feature that are related to the task Attenuate specific dimensions of the gist feature that are not related to the task Feature dimensions (to enhance / attenuate) are learnt with supervised training

Example in speech understanding An audio scene refers to utterance of a syllable Probability (prominence) of a syllable uttered can be found from the gist features for an audio scene, p (c i F), i = {0 1} Lexical knowledge comes to play The probabilities for sequences of utterings of the syllabi are learnt Bi-gram / Tri-gram models Use a probabilistic model to adjust gist feature dimension weights Tune the attention to the next expected syllabii depending on the previous syllabii uttered

Multimodal attention model (Audio + Visual) [ Ruesch ][ Schauerte] (, ) icub (INRIA) Six DoF Ego-sphere (, ): Fixed with respect to the torso Reference does not change unless the robot moves Hexagonal or pentagonal cells Audio-visual saliency of a cell on the ego-sphere determines where the robot should look at (Bottom-up saliency model)

Schematic overview

Saliency features Visual saliency: similar to Itti, et al. Intensity, color and orientations (, ) representation mapped to (x,y) There is some distortion Audio saliency STFT used to create spectogram for each ear Inter-aural Time Difference (ITD) and Inter-aural Spectral Difference (ISD) used to for locating sound Design of pinna (outer ear ) for creating ISD Spatio-temporal clustering to eliminate outliers (noise) Saliency determined based on surprise element [Schauerte] Computed with respect to current head-orientation of the robot

Surprise Element (Audio) Following Itti & Baldi (2005) Spectogram: G(t,ω)= STFT (t, ω) 2 Assume that G(t, ) is caused by a GMM with parameters g Prior probability : Posterior probability : Surprise element : ω P Prior =P(g G(t 1,ω), G(t 2,ω),...G(t N,ω)) ω P Posterior =P(g G(t, ω),g(t 1,ω),...G(t N, ω)) ω S A (t, ω)=d KL (P Posterior ω, P Prior ) D KL is the Kullback Leibler Divergence between two GMMs Overall audio surprise element : S A (t)= 1 Ω ω Ω S A (t, ω)

Ego-centric saliency determination Basic steps 1.Convert stimulus orientation to torso-based, head-centered coordinates using head kinematics, 2.Represent orientation in spherical coordinates, project stimulus (saliency) intensity onto modality specific rectangular egocentric map, 3.Aggregate multimodal sensory information.

Multimodal saliency aggregation Project saliencies from ( ) coordinates to (x,y) coordinates for visualization There is some distortion Take union of visual and audio saliencies S(θ, φ;t)=max (S V (θ, φ;t), S A (θ,φ;t )) Proto-object regions determined by analysis of isophote curvatures Isophote = contours of equal saliencies on the saliency map Center of the proto-object regions ( ) are considered as protoobject locations

Attention and exploration The FoA goes to the proto-object region with highest saliency Habituation map Initialized to zero Updated recursively according to a Gaussian weighing function that favors the regions closer to the current FoA H (θ, φ;t)=(1 d H ) H (θ,φ;t 1)+d h G h (θ θ 0, φ φ 0 ;t ) 6º While attending to a salient point, the Habituation Map at that location will asymptotically tend to 1 Rate of convergence depends on d h Whenever the habituation value exceeds a predefined level, the system becomes attracted to novel salient points.

Inhibition map Initialized to 1 When the habituation of current FoA ( ) exceeds a threshold t h = 0.85 the Inhibition Map is modified by adding a scaled Gaussian function, G a, with amplitude 1, centered at FoA ( ) and variance σ 6º The resulting effect is that G a adds a smooth cavity at ( ) in the inhibition map A(θ,φ;t)=(1 d a ) A (θ,φ;t 1)+d a G a (θ θ 0, φ φ 0 ;t ) For attention selection Multiply: S(, ;t) A(, ;t) (Combine instantaneous saliency and the memory of recently attended locations)

References Wrigley (2000). A model of auditory attention Kayser, et al. (2005). Mechanisms for Allocating Auditory Attention... Kalinli & Narayanan (2009). Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information Oldoni, et al (2013). A computational model of auditory attention... Ruesch, et al. Multimodal Saliency-Based Bottom-Up Attention... Schauerte, et al. Multimodal Saliency-based Attention for Object-based Scene Analysis... Itti & Baldi (2005). Bayesian Surprise Attracts Human Attention (Visual) Lichtenauer, et al. Isophote Properties as Features for Object Detection