Automatic audio analysis for content description & indexing

Size: px

Start display at page:

Download "Automatic audio analysis for content description & indexing"

Elmer O’Brien’
5 years ago
Views:

1 Automatic audio analysis for content description & indexing Dan Ellis International Computer Science Institute, Berkeley CA Outline Auditory Scene Analysis (ASA) Computational ASA (CASA) Prediction-driven CASA Speech recognition & sound mixtures Implications for content analysis Audio Indexing - Dan Ellis 1998feb04-1

2 1 Auditory Scene Analysis The organization of complex sound scenes according to their inferred sources Sounds rarely occur in isolation - organization required for useful information Human audition is very effective - unexpectedly difficult to model Correct analysis defined by goal - source shows independence, continuity ecological constraints enable organization f/hz city time/s db Audio Indexing - Dan Ellis 1998feb04-2

3 Psychology of ASA Extensive experimental research - organization of simple pieces (sinusoids & white noise) - streaming, pitch perception, double vowels Auditory Scene Analysis [Bregman 1990] grouping rules - common onset/offset/modulation, harmonicity, spatial location Debated... (Darwin, Carlyon, Moore, Remez) (from Darwin 1996) Audio Indexing - Dan Ellis 1998feb04-3

4 2 Computational Auditory Scene Analysis (CASA) Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources f/hz city horn car noise door crash horn yell time/s Translate psych. rules into programs? - representations to reveal common onset, harmonicity... Motivations & Applications - it s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing... Audio Indexing - Dan Ellis 1998feb04-4 db

5 CASA survey Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities Utterance-sized signals need more - cannot predict number of signals (0, 1, 2...) - birth/death processes Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals Audio Indexing - Dan Ellis 1998feb04-5

6 CASA1: Periodic pieces Weintraub separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is hidden state Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise frq/hz brn1h.aif frq/hz brn1h.fi.aif time/s time/s Audio Indexing - Dan Ellis 1998feb04-6

7 CASA2: Hypothesis systems Okuno et al. (1994-) - tracers follow each harmonic + noise agent - residue-driven: account for whole signal Klassner search for a combination of templates - high-level hypotheses permit front-end tuning 3760 Hz 2540 Hz 2230 Hz 1475 Hz 500 Hz 460 Hz 420 Hz Buzzer-Alarm Glass-Clink Phone-Ring Siren-Chirp 2350 Hz 1675 Hz 950 Hz sec TIME (a) sec TIME (b) Ellis model for events perceived in dense scenes - prediction-driven: observations - hypotheses Audio Indexing - Dan Ellis 1998feb04-7

8 CASA3: Other approaches Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence HMM decomposition (RK Moore) - recover combined source states directly Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? Audio Indexing - Dan Ellis 1998feb04-8

9 3 Prediction-driven CASA Perception is not direct but a search for plausible hypotheses Data-driven... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine Motivations - detect non-tonal events (noise & clicks) - support restoration illusions... hooks for high-level knowledge + complete explanation, multiple hypotheses, resynthesis Audio Indexing - Dan Ellis 1998feb04-9

10 Analyzing the continuity illusion Interrupted tone heard as continuous -.. if the interruption could be a masker f/hz ptshort time/s Data-driven just sees gaps Prediction-driven can accommodate - special case or general principle? Audio Indexing - Dan Ellis 1998feb04-10

11 Phonemic Restoration (Warren 1970) Another illusion instance Inference relies on high-level semantics frq/hz 3500 nsoffee.aif time/s Incorporating knowledge into models? Audio Indexing - Dan Ellis 1998feb04-11

12 Subjective ground-truth in mixtures? Listening tests collect perceived events : Consistent answers: f/hz City Horn1 (10/10) S9 horn 2 S10 car horn S4 horn1 S6 double horn S2 first double horn S7 horn S7 horn2 S3 1st horn S5 Honk S8 car horns S1 honk, honk Crash (10/10) S7 gunshot S8 large object crash S6 slam S9 door Slam? S2 crash S4 crash S10 door slamming S5 Trash can S3 crash (not car) S1 slam Horn2 (5/10) S9 horn 5 S8 car horns S2 horn during crash S6 doppler horn S7 horn3 Truck (7/10) S8 truck engine S2 truck accelerating S5 Acceleration S1 rev up/passing S6 acceleration S3 closeup car S10 wheels on road Audio Indexing - Dan Ellis 1998feb04-12

13 PDCASA example: City-street ambience f/hz City f/hz Wefts Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) Weft5 Wefts6,7 Weft8 Wefts9 12 f/hz Noise2,Click1 0 0 Crash (10/10) Squeal (6/10) Truck (7/10) f/hz Noise time/s db Problems - error allocation - rating hypotheses - source hierarchy - resynthesis Audio Indexing - Dan Ellis 1998feb04-13

14 4 Speech recognition & sound mixtures Conventional speech recognition: signal Feature extraction low-dim. features Phoneme classifier phoneme probabilities HMM decoder words - signal assumed entirely speech - find valid labelling by discrete labels - class models from training data Some problems: - need to ignore lexically-irrelevant variation (microphone, voice pitch etc.) - compact feature space everything speech-like Very fragile to nonspeech, background - scene-analysis methods very attractive... Audio Indexing - Dan Ellis 1998feb04-14

15 CASA for speech recognition Data-driven: CASA as preprocessor - problems with holes (but: Cooke, Okuno) - doesn t exploit knowledge of speech structure Prediction-driven: speech as component - same reconciliation of speech hypotheses - need to express predictions in signal domain Speech components Hypothesis management Noise components Predict & combine input mixture Front end Compare & reconcile Periodic components Audio Indexing - Dan Ellis 1998feb04-15

16 Example of speech & nonspeech (a) Clap (clap8k env.pf) f/bark db (b) Speech plus clap (223cl env.pf) (c) Recognizer output h# w n ay n tclt uw f ay ah s ay ow h# v s eh v ah n h# h# n ay n t uw f ay v ow h# s eh v ax n tcl <SIL> nine two five oh <SIL> seven (d) Reconstruction from labels alone (223cl renvg.pf) (e) Slowly varying portion of original (223cl envg.pf) (f) Predicted speech element ( = (d)+(e) ) (223cl renv.pf) (g) Click5 from nonspeech analysis (justclick.pf) (h) Spurious elements from nonspeech analysis (nonclicks.pf) Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration Audio Indexing - Dan Ellis 1998feb04-16

17 5 Implications for content analysis: Using CASA to index soundtracks f/hz city horn car noise door crash horn yell time/s db What are the objects in a soundtrack? - subjective definition need auditory model Segmentation vs. classification - low-level cues locate events - higher-level learned knowledge to give semantic label (footstep, crash)... AI complete? But: hard to separate - illusion phenomena suggest auditory organization depends on interpretation Audio Indexing - Dan Ellis 1998feb04-17

18 Using speech recognition for indexing Active research area: Access to news broadcast databases - e.g. Informedia (CMU), ThisL (BBC+...) - use LV-CSR to transcribe, then text-retrieval to find % word error rate, still works OK Several systems at NIST TREC workshop Tricks to ignore nonspeech/poor speech Audio Indexing - Dan Ellis 1998feb04-18

19 Open issues in automatic indexing How to do ASA? Explanation/description hierarchy - PDCASA: generic primitives + constraining hierarchy - subjective & task-dependent Classification - connecting subjective & objective properties finding subjective invariants, prominence - representation of sound-object classes Resynthesis? - a good description should be adequate - provided in PDCASA, but low quality - requires good knowledge-based constraints Audio Indexing - Dan Ellis 1998feb04-19

20 6 Conclusions Auditory organization is required in real environments We don t know how listeners do it! - plenty of modeling interest Prediction-reconciliation can account for illusions - use knowledge when signal is inadequate - important in a wider range of circumstances? Speech recognizers are a good source of knowledge Automatic indexing implies synthetic listener - need to solve a lot of modeling issues Audio Indexing - Dan Ellis 1998feb04-20

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches CASA talk - Haskins/NUWC - Dan Ellis 1997oct24/5-1 The importance of auditory illusions for artificial listeners 1 Dan Ellis International Computer Science Institute, Berkeley CA