Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Similar documents
Automatic audio analysis for content description & indexing

Computational Auditory Scene Analysis: Principles, Practice and Applications

Auditory Scene Analysis: phenomena, theories and computational models

Sound, Mixtures, and Learning

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech & Audio

Computational Perception /785. Auditory Scene Analysis

Sound, Mixtures, and Learning

Using Source Models in Speech Separation

Modulation and Top-Down Processing in Audition

Robustness, Separation & Pitch

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Auditory gist perception and attention

Auditory scene analysis in humans: Implications for computational implementations.

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

HCS 7367 Speech Perception

General Soundtrack Analysis

Auditory Scene Analysis. Dr. Maria Chait, UCL Ear Institute

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

Auditory Scene Analysis

Binaural Hearing. Why two ears? Definitions

Lecture 4: Auditory Perception. Why study perception?

Sensation and Perception -- Team Problem Solving Assignments

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

HCS 7367 Speech Perception

group by pitch: similar frequencies tend to be grouped together - attributed to a common source.

Using Speech Models for Separation

Hearing II Perceptual Aspects

Chapter 16: Chapter Title 221

Binaural processing of complex stimuli

ID# Exam 2 PS 325, Fall 2009

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080

Role of F0 differences in source segregation

Auditory Physiology PSY 310 Greg Francis. Lecture 30. Organ of Corti

Sound Waves. Sensation and Perception. Sound Waves. Sound Waves. Sound Waves

= + Auditory Scene Analysis. Week 9. The End. The auditory scene. The auditory scene. Otherwise known as

Hearing in the Environment

Categorical Perception

Cognitive Processes PSY 334. Chapter 2 Perception

Musical Instrument Classification through Model of Auditory Periphery and Neural Network

SENSATION AND PERCEPTION KEY TERMS


Issues faced by people with a Sensorineural Hearing Loss

Lecture 3: Perception

Providing Effective Communication Access

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

Building Skills to Optimize Achievement for Students with Hearing Loss

Does Wernicke's Aphasia necessitate pure word deafness? Or the other way around? Or can they be independent? Or is that completely uncertain yet?

What s New in Primus

21/01/2013. Binaural Phenomena. Aim. To understand binaural hearing Objectives. Understand the cues used to determine the location of a sound source

using deep learning models to understand visual cortex

Multimodal interactions: visual-auditory

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

Frequency refers to how often something happens. Period refers to the time it takes something to happen.

iclicker+ Student Remote Voluntary Product Accessibility Template (VPAT)

What is mid level vision? Mid Level Vision. What is mid level vision? Lightness perception as revealed by lightness illusions

MULTI-CHANNEL COMMUNICATION

SUPPRESSION OF MUSICAL NOISE IN ENHANCED SPEECH USING PRE-IMAGE ITERATIONS. Christina Leitner and Franz Pernkopf

Auditory System & Hearing

Lyric3 Programming Features

Pattern Playback in the '90s

Product Model #:ASTRO Digital Spectra Consolette W7 Models (Local Control)

Transfer of Sound Energy through Vibrations

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

CROS System Initial Fit Protocol

J Jeffress model, 3, 66ff

Sound localization psychophysics

Product Model #: Digital Portable Radio XTS 5000 (Std / Rugged / Secure / Type )

whether or not the fundamental is actually present.

Auditory Scene Analysis in Humans and Machines

The functional importance of age-related differences in temporal processing

HOW TO USE THE SHURE MXA910 CEILING ARRAY MICROPHONE FOR VOICE LIFT

AA-M1C1. Audiometer. Space saving is achieved with a width of about 350 mm. Audiogram can be printed by built-in printer

Sound as a communication medium. Auditory display, sonification, auralisation, earcons, & auditory icons

ClaroTM Digital Perception ProcessingTM

Hearing. istockphoto/thinkstock

What Is the Difference between db HL and db SPL?

Auditory nerve. Amanda M. Lauer, Ph.D. Dept. of Otolaryngology-HNS

Gender Based Emotion Recognition using Speech Signals: A Review

Definition Slides. Sensation. Perception. Bottom-up processing. Selective attention. Top-down processing 11/3/2013

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

= add definition here. Definition Slide

Acoustic Sensing With Artificial Intelligence

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

Real-World Sound Recognition: A Recipe

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

Sound Interfaces Engineering Interaction Technologies. Prof. Stefanie Mueller HCI Engineering Group

Errol Davis Director of Research and Development Sound Linked Data Inc. Erik Arisholm Lead Engineer Sound Linked Data Inc.

9/29/14. Amanda M. Lauer, Dept. of Otolaryngology- HNS. From Signal Detection Theory and Psychophysics, Green & Swets (1966)

Outline.! Neural representation of speech sounds. " Basic intro " Sounds and categories " How do we perceive sounds? " Is speech sounds special?

A Consumer-friendly Recap of the HLAA 2018 Research Symposium: Listening in Noise Webinar

IN EAR TO OUT THERE: A MAGNITUDE BASED PARAMETERIZATION SCHEME FOR SOUND SOURCE EXTERNALIZATION. Griffin D. Romigh, Brian D. Simpson, Nandini Iyer

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

Date: April 19, 2017 Name of Product: Cisco Spark Board Contact for more information:

iclicker2 Student Remote Voluntary Product Accessibility Template (VPAT)

Transcription:

CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-1 The importance of auditory illusions for artificial listeners 1 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline Computational Auditory Scene Analysis Computational Auditory Scene Analysis: An overview and some observations 1 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline Modeling Auditory Scene Analysis 1 Auditory Scene Analysis The organization of complex sound scenes according to their inferred sources Sounds rarely occur in isolation - getting useful information from real-world sound requires auditory organization Human audition is very effective - unexpectedly difficult to model 2 3 4 A survey of CASA Illusions & prediction-driven CASA CASA and speech recognition 2 3 4 A survey of CASA Prediction-driven CASA CASA and speech recognition Correct analysis defined by goal - human beings have particular interests... - (in)dependence as the key attribute of a source - ecological constraints enable organization 5 Implications for duplex perception 5 Implications for other domains 6 Conclusions 6 Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-2 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-3 Computational Auditory Scene Analysis (CASA) Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources Psychoacoustics defines grouping rules - e.g. [Bregman 1990] - translate into computer programs? Motivations & Applications - it s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing (movies etc.) 2 CASA survey Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities Utterance-sized signals need more - cannot predict number of signals (0, 1, 2...) - birth/death processes Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals CASA1: Periodic pieces Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is hidden state Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-4 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-5 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-6 CASA2: Hypothesis systems Okuno et al. (1994-) - tracers follow each harmonic + noise agent - residue-driven: account for whole signal Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning CASA3: Other approaches Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence HMM decomposition (RK Moore) - recover combined source states directly Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? 3 Prediction-driven CASA Perception is not direct but a search for plausible hypotheses Data-driven... vs. Prediction-driven Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses Novel features - reconcile complete explanation to input - vocabulary of noise/transient/periodic - multiple hypotheses - sufficient detail for reconstruction - explanation hierarchy CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-7 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-8 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-9

CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-10 Analyzing the continuity illusion Interrupted tone heard as continuous -.. if the interruption could be a masker PDCASA example: Construction-site ambience 4 CASA for speech recognition Speech recognition is very fragile - lots of motivation to use source separation Recognize combined states? (Moore) - state becomes very complex Data-driven just sees gaps Data-driven: CASA as preprocessor - problems with holes (Cooke, Okuno) - doesn t exploit knowledge of speech structure Prediction-driven can accommodate Prediction-driven: speech as component - same reconciliation of speech hypotheses - need to express predictions in signal domain - special case or general principle? Problems - error allocation - rating hypotheses - source hierarchy - resynthesis CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-11 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-12 Example of speech & nonspeech 5 Prediction-driven analysis and duplex perception Single element 2 percepts? - e.g. contralateral formant transition - doesn t fit into exclusive support hierarchy But: two elements at same position - hypotheses suggest overlap - predictions combine - reconciliation is OK Order debate is sidestepped -.. not a left-to-right data path Duplex perception as masking & restoration Account for masking could work for duplex - bilateral masking levels? - masking spread? - tolerable colorations? Sinewave speech as a plausible masker? - formants hiding under each whistle? - greedy speech hypothesis generator Problems: - where do hypotheses come from? (priming) - what limits on illusory speech? Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-13 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-14 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-15 5 Lessons for other domains Problem: inadequate signal data - hearing: masking - vision: occlusion - other sensor domains: noise/limits General answer: employ constraints - high-level prior expectations - mid-level regularities - low-level continuity Hearing is a admirable solution Prediction-driven approach suggests priorities Essential features of PDCASA Prediction-reconciliation of hypotheses - specific hypotheses are pursued - lack-of-refutation standard Provide a complete explanation - keeping track of the obstruction can help in compensating for its effects Hierarchic representation - useful constraints occur at many levels: want to be able to apply where appropriate Preserve detail - even when resynthesis is not a goal - helps gauge goodness-of-fit 6 Conclusions Auditory organization is indispensable in real environments We don t know how listeners do it! - plenty of modeling interest Prediction-reconciliation can account for illusions - use knowledge when signal is inadequate - important in a wider range of circumstances? Speech recognizers are a good source of knowledge Wider implications of the prediction-driven approach - understanding perceptual paradoxes - applications in other domains CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-16 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-17 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-18

The importance of auditory illusions for artificial listeners Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 4 5 6 Computational Auditory Scene Analysis A survey of CASA Illusions & prediction-driven CASA CASA and speech recognition Implications for duplex perception Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-1

Computational Auditory Scene Analysis: An overview and some observations Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 4 5 6 Modeling Auditory Scene Analysis A survey of CASA Prediction-driven CASA CASA and speech recognition Implications for other domains Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-2

1 Auditory Scene Analysis The organization of complex sound scenes according to their inferred sources Sounds rarely occur in isolation - getting useful information from real-world sound requires auditory organization Human audition is very effective - unexpectedly difficult to model Correct analysis defined by goal - human beings have particular interests... - (in)dependence as the key attribute of a source - ecological constraints enable organization CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-3

Computational Auditory Scene Analysis (CASA) Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources Psychoacoustics defines grouping rules - e.g. [Bregman 1990] - translate into computer programs? Motivations & Applications - it s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing (movies etc.) CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-4

2 CASA survey Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities Utterance-sized signals need more - cannot predict number of signals (0, 1, 2...) - birth/death processes Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-5

CASA1: Periodic pieces Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is hidden state Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise frq/hz 3000 0 1500 brn1h.aif frq/hz 3000 0 1500 brn1h.fi.aif 600 300 150 100 0.2 0.4 0.6 0.8 1.0 time/s 600 300 150 100 0.2 0.4 0.6 0.8 1.0 time/s CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-6

CASA2: Hypothesis systems Okuno et al. (1994-) - tracers follow each harmonic + noise agent - residue-driven: account for whole signal Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning 3760 Hz 2540 Hz 2230 Hz 1475 Hz 500 Hz 460 Hz 420 Hz Buzzer-Alarm Glass-Clink Phone-Ring Siren-Chirp 2350 Hz 1675 Hz 950 Hz 1.0 2.0 3.0 4.0 sec TIME (a) 1.0 2.0 3.0 4.0 sec TIME (b) Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-7

CASA3: Other approaches Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence HMM decomposition (RK Moore) - recover combined source states directly Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-8

3 Prediction-driven CASA Perception is not direct but a search for plausible hypotheses Data-driven... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine Novel features - reconcile complete explanation to input - vocabulary of noise/transient/periodic - multiple hypotheses - sufficient detail for reconstruction - explanation hierarchy CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-9

Analyzing the continuity illusion Interrupted tone heard as continuous -.. if the interruption could be a masker f/hz ptshort 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s Data-driven just sees gaps Prediction-driven can accommodate - special case or general principle? CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-10

PDCASA example: Construction-site ambience f/hz Construction 0 0 100 50 0 1 2 3 4 5 6 7 8 9 Saw (10/10) Voice (6/10) Wood hit (7/10) Metal hit (8/10) Wood drop (10/10) Clink1 (4/10) Clink2 (7/10) f/hz Noise1 0 0 f/hz Noise2 0 0 Click1 f/hz Clicks2,3 Click4 Clicks5,6 Clicks7,8 0 0 f/hz Wefts1 6 0 0 100 50 f/hz Wefts8,10 0 0 100 50 Wefts7,9 30 40 50 60 0 1 2 3 4 5 6 7 8 9 time/s db Problems - error allocation - rating hypotheses - source hierarchy - resynthesis CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-11

4 CASA for speech recognition Speech recognition is very fragile - lots of motivation to use source separation Recognize combined states? (Moore) - state becomes very complex Data-driven: CASA as preprocessor - problems with holes (Cooke, Okuno) - doesn t exploit knowledge of speech structure Prediction-driven: speech as component - same reconciliation of speech hypotheses - need to express predictions in signal domain Speech components Hypothesis management Noise components Predict & combine input mixture Front end Compare & reconcile Periodic components CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-12

Example of speech & nonspeech f/hz 223cl 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Speech1 Click5 20 Clicks1 7 10 0 10 db 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-13

5 Prediction-driven analysis and duplex perception Single element 2 percepts? - e.g. contralateral formant transition - doesn t fit into exclusive support hierarchy But: two elements at same position - hypotheses suggest overlap - predictions combine - reconciliation is OK Order debate is sidestepped -.. not a left-to-right data path CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-14

Duplex perception as masking & restoration Account for masking could work for duplex - bilateral masking levels? - masking spread? - tolerable colorations? Sinewave speech as a plausible masker? - formants hiding under each whistle? - greedy speech hypothesis generator Problems: - where do hypotheses come from? (priming) - what limits on illusory speech? f/bark S1 env.pf:0 15 80 60 40 10 5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-15

5 Lessons for other domains Problem: inadequate signal data - hearing: masking - vision: occlusion - other sensor domains: noise/limits General answer: employ constraints - high-level prior expectations - mid-level regularities - low-level continuity Hearing is a admirable solution Prediction-driven approach suggests priorities CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-16

Essential features of PDCASA Prediction-reconciliation of hypotheses - specific hypotheses are pursued - lack-of-refutation standard Provide a complete explanation - keeping track of the obstruction can help in compensating for its effects Hierarchic representation - useful constraints occur at many levels: want to be able to apply where appropriate Preserve detail - even when resynthesis is not a goal - helps gauge goodness-of-fit CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-17

6 Conclusions Auditory organization is indispensable in real environments We don t know how listeners do it! - plenty of modeling interest Prediction-reconciliation can account for illusions - use knowledge when signal is inadequate - important in a wider range of circumstances? Speech recognizers are a good source of knowledge Wider implications of the prediction-driven approach - understanding perceptual paradoxes - applications in other domains CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-18