Auditory Scene Analysis: phenomena, theories and computational models

Similar documents
Automatic audio analysis for content description & indexing

Computational Auditory Scene Analysis: Principles, Practice and Applications

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Sound, Mixtures, and Learning

Auditory scene analysis in humans: Implications for computational implementations.

Recognition & Organization of Speech and Audio

Lecture 4: Auditory Perception. Why study perception?

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080

Using Source Models in Speech Separation

Computational Perception /785. Auditory Scene Analysis

Sound, Mixtures, and Learning

Auditory Scene Analysis

Hearing in the Environment

Recognition & Organization of Speech and Audio

Lecture 3: Perception

Modulation and Top-Down Processing in Audition

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Hearing II Perceptual Aspects

J Jeffress model, 3, 66ff

Binaural Hearing. Why two ears? Definitions

HCS 7367 Speech Perception

Auditory Scene Analysis. Dr. Maria Chait, UCL Ear Institute

Robustness, Separation & Pitch

21/01/2013. Binaural Phenomena. Aim. To understand binaural hearing Objectives. Understand the cues used to determine the location of a sound source

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Sound localization psychophysics

HCS 7367 Speech Perception

= + Auditory Scene Analysis. Week 9. The End. The auditory scene. The auditory scene. Otherwise known as

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Cognitive Processes PSY 334. Chapter 2 Perception

Recognition & Organization of Speech and Audio

Auditory gist perception and attention

On the influence of interaural differences on onset detection in auditory object formation. 1 Introduction

Consonant Perception test

Auditory Scene Analysis in Humans and Machines

Hearing. Juan P Bello

Role of F0 differences in source segregation

16.400/453J Human Factors Engineering /453. Audition. Prof. D. C. Chandra Lecture 14

HEARING AND PSYCHOACOUSTICS

HST.723J, Spring 2005 Theme 3 Report

Robust Neural Encoding of Speech in Human Auditory Cortex

Recognition & Organization of Speech & Audio

What Is the Difference between db HL and db SPL?

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 5, SEPTEMBER A Computational Model of Auditory Selective Attention

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

Representation of sound in the auditory nerve

General Soundtrack Analysis

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 THE DUPLEX-THEORY OF LOCALIZATION INVESTIGATED UNDER NATURAL CONDITIONS

Auditory System & Hearing

group by pitch: similar frequencies tend to be grouped together - attributed to a common source.

Acoustics Research Institute

Binaural interference and auditory grouping

Topic 4. Pitch & Frequency

Fundamentals of Cognitive Psychology, 3e by Ronald T. Kellogg Chapter 2. Multiple Choice

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

ELL 788 Computational Perception & Cognition July November 2015

Systems Neuroscience Oct. 16, Auditory system. http:

Sonic Spotlight. SmartCompress. Advancing compression technology into the future

Lecture 9: Speech Recognition: Front Ends

3-D Sound and Spatial Audio. What do these terms mean?

Lauer et al Olivocochlear efferents. Amanda M. Lauer, Ph.D. Dept. of Otolaryngology-HNS

Yonghee Oh, Ph.D. Department of Speech, Language, and Hearing Sciences University of Florida 1225 Center Drive, Room 2128 Phone: (352)

INTRODUCTION J. Acoust. Soc. Am. 100 (4), Pt. 1, October /96/100(4)/2352/13/$ Acoustical Society of America 2352

AUDITORY GROUPING AND ATTENTION TO SPEECH

Combating the Reverberation Problem

Processing Interaural Cues in Sound Segregation by Young and Middle-Aged Brains DOI: /jaaa

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Binaural processing of complex stimuli

The functional importance of age-related differences in temporal processing

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

Topic 4. Pitch & Frequency. (Some slides are adapted from Zhiyao Duan s course slides on Computer Audition and Its Applications in Music)

Spectral processing of two concurrent harmonic complexes

SoundRecover2 the first adaptive frequency compression algorithm More audibility of high frequency sounds

Hearing the Universal Language: Music and Cochlear Implants

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

Analysis of in-vivo extracellular recordings. Ryan Morrill Bootcamp 9/10/2014

Binaural Hearing. Steve Colburn Boston University

EEL 6586, Project - Hearing Aids algorithms

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

Linguistic Phonetics Fall 2005

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED

Theme 4: Pitch and Temporal Coding

Chapter 16: Chapter Title 221

Binaural Hearing for Robots Introduction to Robot Hearing

Effects of Location, Frequency Region, and Time Course of Selective Attention on Auditory Scene Analysis

Separate What and Where Decision Mechanisms In Processing a Dichotic Tonal Sequence

Oticon Agil. Connectivity Matters - A Whitepaper. Agil Improves ConnectLine Sound Quality

Discrete Signal Processing

Coincidence Detection in Pitch Perception. S. Shamma, D. Klein and D. Depireux

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

Implant Subjects. Jill M. Desmond. Department of Electrical and Computer Engineering Duke University. Approved: Leslie M. Collins, Supervisor

Combination of binaural and harmonic. masking release effects in the detection of a. single component in complex tones

Perceptual pitch shift for sounds with similar waveform autocorrelation

CHAPTER 1 INTRODUCTION

Categorical Perception

Reply to Reconsidering evidence for the suppression model of the octave illusion, by C. D. Chambers, J. B. Mattingley, and S. A.

Competing Frameworks in Perception

Competing Frameworks in Perception

Transcription:

Auditory Scene Analysis: phenomena, theories and computational models July 1998 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 4 The computational theory of ASA Cues & grouping Expectations & inference Big issues ASA - Dan Ellis 1998jul11-1

Auditory Scene Analysis What does our sense of hearing do? - recover useful information... about objects of interest... in a wide range of circumstances Measuring objects in an auditory scene: ASA - Dan Ellis 1998jul11-2

Subjective analysis of auditory scenes f/hz City 4000 2000 1000 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9 horn 2 S10 car horn S4 horn1 S6 double horn S2 first double horn S7 horn S7 horn2 S3 1st horn S5 Honk S8 car horns S1 honk, honk Crash (10/10) S7 gunshot S8 large object crash S6 slam S9 door Slam? S2 crash S4 crash S10 door slamming S5 Trash can S3 crash (not car) S1 slam Horn2 (5/10) S9 horn 5 S8 car horns S2 horn during crash S6 doppler horn S7 horn3 Truck (7/10) S8 truck engine S2 truck accelerating S5 Acceleration S1 rev up/passing S6 acceleration S3 closeup car S10 wheels on road Horn3 (5/10) S7 horn4 S9 horn 3 S8 car horns S3 2nd horn S10 car horn Subjects identify structures in dense scenes with high agreement ASA - Dan Ellis 1998jul11-3

Outline 1 2 3 4 The computational theory of ASA - ASA and CASA - The grouping paradigm - Marr s three levels of explanation Cues & grouping Expectations & inference Big issues ASA - Dan Ellis 1998jul11-4

Auditory Scene Analysis (ASA) The organization of sound scenes according to their inferred sources Real-world sounds rarely occur in isolation a useful sense of hearing must be able to segregate mixtures - people (and...) do this very well; unexpectedly difficult to model - depends on: subjective definition of relevant sources regularity/constraints of real-world sounds Studied via experimental psychology - characterize rules for organizing simple pieces (tones, noise bursts, clicks) i.e. reductive approach ASA - Dan Ellis 1998jul11-5

Computational Auditory Scene Analysis (CASA) Psychological rules suggest computer implementation -.. but many practical problems arise! Motivations: Practical applications - real-world interactive systems - indexing of media databases - hearing prostheses Crossover opportunities - unknown signal/information processing principles? Benefits for theory - implementations are very revealing ASA - Dan Ellis 1998jul11-6

The grouping paradigm Standard theory of ASA (Bregman, Darwin &c): - sound mixture is broken up into small elements e.g. time-frequency cells - each element has a number of feature dimensions (amplitude, ITD, period) - elements are grouped together according to their features to form larger structures - resulting groups have overall attributes (pitch, location) (from Darwin 1996) ASA - Dan Ellis 1998jul11-7

Marr s levels-of-explanation of information processing Three distinct aspects to info. processing Computational Theory Algorithm Implementation what and why ; the overall goal how ; an approach to meeting the goal practical realization of the process. Sound source organization Auditory grouping Feature calculation & binding Why bother? - to help organize understanding - avoid confusion/wasted effort use as an analysis tool... ASA - Dan Ellis 1998jul11-8

Level 1: Computational theory The underlying regularities that make the problem possible - i.e. the ecological facts Implicit definition of what is a source? : Independence of attributes between sources Continuity of attributes for each source + other source-specific constraints ASA - Dan Ellis 1998jul11-9

Level 2: Algorithm A particular approach to exploiting the constraints of the computational theory - both process & representation Audition: the elements-then-grouping approach - could have been otherwise e.g. templates Often the focus of analysis - but: debate is muddled without a clear computational theory ASA - Dan Ellis 1998jul11-10

Level 3: Implementation A specific realization of the algorithm - computer programs - neurons -... Can be analyzed separately? - provided epiphenomena are correctly assigned Needs context of algorithm, computational theory You cannot understand stereopsis simply by thinking about neurons ASA - Dan Ellis 1998jul11-11

The advantage of the appropriate level Computational theory - determines the purpose of the process; provides focus necessary for analysis e.g. biosonar: benefit of hyperresolution Algorithm - abstraction that is still specific, transferable e.g. autocorrelation for pitch Implementation - explain epiphenomena e.g. subjective octave from refractory period ASA - Dan Ellis 1998jul11-12

An example: Neural inhibition Computational theory Frequencydomain processing X(f) f Algorithm Discrete-time filtering (subtraction) Implementation Neurons with GABAergic inhibitions ASA - Dan Ellis 1998jul11-13

Summary 1 Acoustic scenes are very complex.. but the auditory system extracts useful information Grouping is the main focus of Auditory Scene Analysis.. but it fits into a larger Marrian framework ASA - Dan Ellis 1998jul11-14

Outline 1 2 3 4 The computational theory of ASA Cues & grouping - Cue analysis - Simple scenes - Models - Complications: interaction, ambiguity, time Expectations & inference Big issues ASA - Dan Ellis 1998jul11-15

Cues to grouping Common onset/offset/modulation ( fate ) Common periodicity ( pitch ) Common onset Periodicity Computational theory Algorithm Implementation Acoustic consequences tend to be synchronized Group elements that start in a time range Onset detector cells Synchronized osc s? (Nonlinear) cyclic processes are common? Place patterns? Autocorrelation? Delay-and-mult? Modulation spect Spatial location (ITD, ILD, spectral cues) Sequential cues... Source-specific cues... ASA - Dan Ellis 1998jul11-16

Simple grouping E.g. isolated tones freq time Computational theory Algorithm Implementation common onset common period (harmonicity) locate elements (tracks) group by shared features? exhaustive search evolution in time ASA - Dan Ellis 1998jul11-17

Computer models of grouping Bregman at face value (e.g. Brown 1992): input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq time onset period frq.mod - feature maps - periodicity cue - common-onset boost - resynthesis ASA - Dan Ellis 1998jul11-18

Grouping model results Able to extract voiced speech: brn1h.aif frq/hz 3000 2000 1500 1000 brn1h.fi.aif frq/hz 3000 2000 1500 1000 600 600 400 300 200 150 400 300 200 150 100 0.2 0.4 0.6 0.8 1.0 time/s 100 0.2 0.4 0.6 0.8 1.0 time/s Periodicity is the primary cue - how to handle aperiodic energy? Limitations - resynthesis via filter-mask - only periodic targets - robustness of discrete objects ASA - Dan Ellis 1998jul11-19

Complications for grouping: 1: Cues in conflict Mistuned harmonic (Moore, Darwin..): freq - harmonic usually groups by onset & periodicity - can alter frequency and/or onset time - degree of grouping from overall pitch match Gradual, various results: pitch shift time 3% mistuning - heard as separate tone, still affects pitch ASA - Dan Ellis 1998jul11-20

Complications for grouping: 2: The effect of time Added harmonics: freq time - onset cue initially segregates; periodicity eventually fuses The effect of time - some cues take time to become apparent - onset cue becomes increasingly distant... What is the impetus for fission? - e.g. double vowels - depends on what you expect..? ASA - Dan Ellis 1998jul11-21

Summary 2 Known grouping cues make sense Simple examples are straightforward Models can be implemented directly.. but problematic situations abound ASA - Dan Ellis 1998jul11-22

Outline 1 2 3 4 The computational theory of ASA Cues & grouping Expectations & inference - Old-plus-new - Streaming - Restoration & illusions - Top-down models Big issues ASA - Dan Ellis 1998jul11-23

The effect of context Context can create an expectation : i.e. a bias towards a particular interpretation e.g. Bregman s old-plus-new principle: A change in a signal will be interpreted as an added source whenever possible freq/khz 2 1 + 0 0.0 0.4 0.8 1.2 time/s - a different division of the same energy depending on what preceded it ASA - Dan Ellis 1998jul11-24

Streaming Successive tone events form separate streams freq. TRT: 60-150 ms 1 khz f: ±2 octaves Order, rhythm &c within, not between, streams time Computational theory Algorithm Implementation Consistency of properties for successive source events expectation window for known streams (widens with time) competing time-frequency affinity weights... ASA - Dan Ellis 1998jul11-25

Restoration & illusions Direct evidence may be masked or distorted make best guess using available information E.g. the continuity illusion : f/hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s - tones alternates with noise bursts - noise is strong enough to mask tone... so listener discriminate presence - continuous tone distinctly perceived for gaps ~100s of ms Inference acts at low, preconscious level ASA - Dan Ellis 1998jul11-26

Speech restoration Speech provides very strong bases for inference (coarticulation, grammar, semantics): frq/hz 3500 3000 nsoffee.aif Phonemic restoration 2500 2000 1500 1000 500 0 1.2 1.3 1.4 1.5 1.6 1.7 time/s Temporal compound (1998jul10) Temporal compounds 20 40 60 80 100 Sinewave speech (duplex?) 80 60 40 f/bark S1 env.pf:0 15 10 5 120 50 100 150 200 250 300 350 400 450 500 550 time / ms 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 ASA - Dan Ellis 1998jul11-27

Models of top-down processing Perception as a search for plausible explanations Prediction-driven CASA (PDCASA): input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine An approach as well as an implementation... Key features: - complete explanation of all scene energy - vocabulary of periodic/noise/transient elements - multiple hypotheses - explanation hierarchy ASA - Dan Ellis 1998jul11-28

PDCASA for old-plus-new Incremental analysis t1 t2 t3 Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished ASA - Dan Ellis 1998jul11-29

PDCASA for the continuity illusion Subjects hear the tone as continuous... if the noise is a plausible masker f/hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Data-driven analysis gives just visible portions: Prediction-driven can infer masking: ASA - Dan Ellis 1998jul11-30

PDCASA analysis of a complex scene f/hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/hz Wefts1 4 4000 2000 1000 400 200 1000 400 200 100 50 f/hz Noise2,Click1 4000 2000 1000 400 200 Weft5 Wefts6,7 Weft8 Wefts9 12 Crash (10/10) f/hz Noise1 4000 2000 1000 400 200 Squeal (6/10) Truck (7/10) 40 50 60 70 0 1 2 3 4 5 6 7 8 9 time/s db ASA - Dan Ellis 1998jul11-31

Marrian analysis of PDCASA Marr invoked to separate high-level function from low-level details Computational theory Algorithm Objects persist predictably Observations interact irreversibly Build hypotheses from generic elements Update by prediction-reconciliation Implementation??? It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware of the computational theory... ASA - Dan Ellis 1998jul11-32

Summary 3 Perceptual processing is highly context-dependent Auditory system will use prior knowledge to fill-in gaps (subconsciously) Prediction-reconciliation models can encompass this behavior ASA - Dan Ellis 1998jul11-33

Outline 1 2 3 4 The computational theory of ASA Cues & grouping Expectations & inference Big issues - the state of ASA and CASA - outstanding issues - discussion points ASA - Dan Ellis 1998jul11-34

The current state of ASA and CASA ASA - detailed descriptions of in vitro tests - some quite subtle effects explained (DV beats) but: how to extend to complex scenarios? CASA - numerous models, some convergence (mainly periodicity-based) - best results sound impressive (least plausible systems!) - applications in speech recognition? but: domains limited, poor robustness ASA - Dan Ellis 1998jul11-35

Big issues in CASA: Plausibility - correct level for human correspondence? - which phenomena are important to match? - how to implement symbolic-style processing? Top-down vs. bottom-up - different approaches to ambiguity, latency - how far down for top-down? - how far up for high level? - choice between extraction & inference? Integrating multiple cues (e.g. binaural) Other debates: - what is the real goal? - resynthesis - evaluation ASA - Dan Ellis 1998jul11-36

Big issues in ASA & CASA: Knowledge: how to acquire, represent & store... - short-term: context - long-term: memories - abstract: classes, generalities Attention: - what does it mean in these models? - limitation or important principle? ASA - Dan Ellis 1998jul11-37

Conclusions Real-world sounds are complex; scene-analysis is required We know certain cues & some rules, but real situations raise contradictions Current models handle obvious cases; robustness & generality are hard Many issues remain ASA - Dan Ellis 1998jul11-38

Discussion points Are Marr s levels important? Useful? Can you study levels in isolation? What do restoration phenomena imply about internal representations? Do we have an adequate account of an ASA algorithm? e.g. where do hypotheses come from? How important/challenging are phenomena like duplex perception, sinewave speech etc.? ASA - Dan Ellis 1998jul11-39