Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-1 The importance of auditory illusions for artificial listeners 1 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline Computational Auditory Scene Analysis Computational Auditory Scene Analysis: An overview and some observations 1 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline Modeling Auditory Scene Analysis 1 Auditory Scene Analysis The organization of complex sound scenes according to their inferred sources Sounds rarely occur in isolation - getting useful information from real-world sound requires auditory organization Human audition is very effective - unexpectedly difficult to model 2 3 4 A survey of CASA Illusions & prediction-driven CASA CASA and speech recognition 2 3 4 A survey of CASA Prediction-driven CASA CASA and speech recognition Correct analysis defined by goal - human beings have particular interests... - (in)dependence as the key attribute of a source - ecological constraints enable organization 5 Implications for duplex perception 5 Implications for other domains 6 Conclusions 6 Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-2 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-3 Computational Auditory Scene Analysis (CASA) Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources Psychoacoustics defines grouping rules - e.g. [Bregman 1990] - translate into computer programs? Motivations & Applications - it s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing (movies etc.) 2 CASA survey Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities Utterance-sized signals need more - cannot predict number of signals (0, 1, 2...) - birth/death processes Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals CASA1: Periodic pieces Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is hidden state Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-4 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-5 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-6 CASA2: Hypothesis systems Okuno et al. (1994-) - tracers follow each harmonic + noise agent - residue-driven: account for whole signal Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning CASA3: Other approaches Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence HMM decomposition (RK Moore) - recover combined source states directly Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? 3 Prediction-driven CASA Perception is not direct but a search for plausible hypotheses Data-driven... vs. Prediction-driven Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses Novel features - reconcile complete explanation to input - vocabulary of noise/transient/periodic - multiple hypotheses - sufficient detail for reconstruction - explanation hierarchy CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-7 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-8 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-9

CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-10 Analyzing the continuity illusion Interrupted tone heard as continuous -.. if the interruption could be a masker PDCASA example: Construction-site ambience 4 CASA for speech recognition Speech recognition is very fragile - lots of motivation to use source separation Recognize combined states? (Moore) - state becomes very complex Data-driven just sees gaps Data-driven: CASA as preprocessor - problems with holes (Cooke, Okuno) - doesn t exploit knowledge of speech structure Prediction-driven can accommodate Prediction-driven: speech as component - same reconciliation of speech hypotheses - need to express predictions in signal domain - special case or general principle? Problems - error allocation - rating hypotheses - source hierarchy - resynthesis CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-11 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-12 Example of speech & nonspeech 5 Prediction-driven analysis and duplex perception Single element 2 percepts? - e.g. contralateral formant transition - doesn t fit into exclusive support hierarchy But: two elements at same position - hypotheses suggest overlap - predictions combine - reconciliation is OK Order debate is sidestepped -.. not a left-to-right data path Duplex perception as masking & restoration Account for masking could work for duplex - bilateral masking levels? - masking spread? - tolerable colorations? Sinewave speech as a plausible masker? - formants hiding under each whistle? - greedy speech hypothesis generator Problems: - where do hypotheses come from? (priming) - what limits on illusory speech? Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-13 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-14 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-15 5 Lessons for other domains Problem: inadequate signal data - hearing: masking - vision: occlusion - other sensor domains: noise/limits General answer: employ constraints - high-level prior expectations - mid-level regularities - low-level continuity Hearing is a admirable solution Prediction-driven approach suggests priorities Essential features of PDCASA Prediction-reconciliation of hypotheses - specific hypotheses are pursued - lack-of-refutation standard Provide a complete explanation - keeping track of the obstruction can help in compensating for its effects Hierarchic representation - useful constraints occur at many levels: want to be able to apply where appropriate Preserve detail - even when resynthesis is not a goal - helps gauge goodness-of-fit 6 Conclusions Auditory organization is indispensable in real environments We don t know how listeners do it! - plenty of modeling interest Prediction-reconciliation can account for illusions - use knowledge when signal is inadequate - important in a wider range of circumstances? Speech recognizers are a good source of knowledge Wider implications of the prediction-driven approach - understanding perceptual paradoxes - applications in other domains CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-16 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-17 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-18

The importance of auditory illusions for artificial listeners Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 4 5 6 Computational Auditory Scene Analysis A survey of CASA Illusions & prediction-driven CASA CASA and speech recognition Implications for duplex perception Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-1

Computational Auditory Scene Analysis: An overview and some observations Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 2 3 4 5 6 Modeling Auditory Scene Analysis A survey of CASA Prediction-driven CASA CASA and speech recognition Implications for other domains Conclusions CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-2

1 Auditory Scene Analysis The organization of complex sound scenes according to their inferred sources Sounds rarely occur in isolation - getting useful information from real-world sound requires auditory organization Human audition is very effective - unexpectedly difficult to model Correct analysis defined by goal - human beings have particular interests... - (in)dependence as the key attribute of a source - ecological constraints enable organization CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-3

Computational Auditory Scene Analysis (CASA) Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources Psychoacoustics defines grouping rules - e.g. [Bregman 1990] - translate into computer programs? Motivations & Applications - it s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing (movies etc.) CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-4

2 CASA survey Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities Utterance-sized signals need more - cannot predict number of signals (0, 1, 2...) - birth/death processes Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-5

CASA1: Periodic pieces Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is hidden state Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise frq/hz 3000 0 1500 brn1h.aif frq/hz 3000 0 1500 brn1h.fi.aif 600 300 150 100 0.2 0.4 0.6 0.8 1.0 time/s 600 300 150 100 0.2 0.4 0.6 0.8 1.0 time/s CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-6

CASA2: Hypothesis systems Okuno et al. (1994-) - tracers follow each harmonic + noise agent - residue-driven: account for whole signal Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning 3760 Hz 2540 Hz 2230 Hz 1475 Hz 500 Hz 460 Hz 420 Hz Buzzer-Alarm Glass-Clink Phone-Ring Siren-Chirp 2350 Hz 1675 Hz 950 Hz 1.0 2.0 3.0 4.0 sec TIME (a) 1.0 2.0 3.0 4.0 sec TIME (b) Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-7

CASA3: Other approaches Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence HMM decomposition (RK Moore) - recover combined source states directly Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-8

3 Prediction-driven CASA Perception is not direct but a search for plausible hypotheses Data-driven... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine Novel features - reconcile complete explanation to input - vocabulary of noise/transient/periodic - multiple hypotheses - sufficient detail for reconstruction - explanation hierarchy CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-9

Analyzing the continuity illusion Interrupted tone heard as continuous -.. if the interruption could be a masker f/hz ptshort 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s Data-driven just sees gaps Prediction-driven can accommodate - special case or general principle? CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-10

PDCASA example: Construction-site ambience f/hz Construction 0 0 100 50 0 1 2 3 4 5 6 7 8 9 Saw (10/10) Voice (6/10) Wood hit (7/10) Metal hit (8/10) Wood drop (10/10) Clink1 (4/10) Clink2 (7/10) f/hz Noise1 0 0 f/hz Noise2 0 0 Click1 f/hz Clicks2,3 Click4 Clicks5,6 Clicks7,8 0 0 f/hz Wefts1 6 0 0 100 50 f/hz Wefts8,10 0 0 100 50 Wefts7,9 30 40 50 60 0 1 2 3 4 5 6 7 8 9 time/s db Problems - error allocation - rating hypotheses - source hierarchy - resynthesis CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-11

4 CASA for speech recognition Speech recognition is very fragile - lots of motivation to use source separation Recognize combined states? (Moore) - state becomes very complex Data-driven: CASA as preprocessor - problems with holes (Cooke, Okuno) - doesn t exploit knowledge of speech structure Prediction-driven: speech as component - same reconciliation of speech hypotheses - need to express predictions in signal domain Speech components Hypothesis management Noise components Predict & combine input mixture Front end Compare & reconcile Periodic components CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-12

Example of speech & nonspeech f/hz 223cl 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Speech1 Click5 20 Clicks1 7 10 0 10 db 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-13

5 Prediction-driven analysis and duplex perception Single element 2 percepts? - e.g. contralateral formant transition - doesn t fit into exclusive support hierarchy But: two elements at same position - hypotheses suggest overlap - predictions combine - reconciliation is OK Order debate is sidestepped -.. not a left-to-right data path CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-14

Duplex perception as masking & restoration Account for masking could work for duplex - bilateral masking levels? - masking spread? - tolerable colorations? Sinewave speech as a plausible masker? - formants hiding under each whistle? - greedy speech hypothesis generator Problems: - where do hypotheses come from? (priming) - what limits on illusory speech? f/bark S1 env.pf:0 15 80 60 40 10 5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-15

5 Lessons for other domains Problem: inadequate signal data - hearing: masking - vision: occlusion - other sensor domains: noise/limits General answer: employ constraints - high-level prior expectations - mid-level regularities - low-level continuity Hearing is a admirable solution Prediction-driven approach suggests priorities CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-16

Essential features of PDCASA Prediction-reconciliation of hypotheses - specific hypotheses are pursued - lack-of-refutation standard Provide a complete explanation - keeping track of the obstruction can help in compensating for its effects Hierarchic representation - useful constraints occur at many levels: want to be able to apply where appropriate Preserve detail - even when resynthesis is not a goal - helps gauge goodness-of-fit CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-17

6 Conclusions Auditory organization is indispensable in real environments We don t know how listeners do it! - plenty of modeling interest Prediction-reconciliation can account for illusions - use knowledge when signal is inadequate - important in a wider range of circumstances? Speech recognizers are a good source of knowledge Wider implications of the prediction-driven approach - understanding perceptual paradoxes - applications in other domains CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-18