Recognition & Organization of Speech and Audio

Size: px

Start display at page:

Download "Recognition & Organization of Speech and Audio"

Thomas Page
5 years ago
Views:

1 Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University Outline Sound organization Background & related work Existing projects Future projects Summary & conclusions talk - Dan Ellis

1 Organization of sound mixtures frq/hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Rumble Stab Voice (pleasant)

2 1 Organization of sound mixtures frq/hz time/s Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Core operation: Converting continuous, scalar signal into discrete, symbolic representation talk - Dan Ellis

3 Positioning sound organization Speech recognition Sound organization Audio processing Image/video analysis Signal processing Pattern recognition Machine learning Artificial intelligence Psychoacoustics & physiology Draws on many techniques Abuts/overlaps various areas talk - Dan Ellis

4 About auditory perception Received waveform is a mixture - two sensors, N signals... - need knowledge-based constraints Psychoacoustics: the study of human sound organization - auditory scene analysis (Bregman 90) Auditory perception is ecologically grounded - scene analysis is preconscious ( illusions) - perceived organization: real-world objects + events (transient) - subjective not canonical (ambiguity) talk - Dan Ellis

5 Key themes for Sound organization - recovering/constructing abstraction hierarchy - at an instant (sources) - along time (segmentation) Scene analysis - need to find attributes according to objects - use attributes to form objects -... plus constraints of knowledge Exploiting large data sets (the ASR lesson) - supervised/labelled: pattern recognition - unsupervised: structure discovery, clustering Special-purpose cases: - speech recognition - source-specific recognizers... within a complete explanation talk - Dan Ellis

6 Applications for sound organization What do people do with their ears? Robots - intelligence requires awareness - Sony s AIBO: dog-hearing Human-computer interface -.. includes knowing when (& why) you ve failed Archive indexing & retrieval - pure audio archives - true multimedia content analysis Content understanding - intelligent classification & summarization Autonomous monitoring Broader structure discovery algorithms talk - Dan Ellis

7 Outline Sound organization Background & related work - Audio coding & compression - Automatic Speech Recognition - Computational Auditory Scene Analysis - Multimedia information retrieval Existing projects Future projects Summary & conclusions talk - Dan Ellis

8 Audio coding & compression Goal is reconstruction, not abstraction But criteria are subjective : want same percept, not same waveform MPEG-Audio: - filterbanks - information-theoretic coding - psychoacoustic masking of quantization noise MPEG-4 Structured Audio - computer music synthesis model - instrument definition + control stream - automatic analysis? talk - Dan Ellis

Automatic Speech Recognition (ASR) Standard speech recognition structure: Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw"

9 Automatic Speech Recognition (ASR) Standard speech recognition structure: Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat") Acoustic classifier HMM decoder feature vectors phone probabilities phone / word sequence Understanding/ application... State of the art word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) Segmentation of speech & nonspeech -... recognizer wouldn t notice! talk - Dan Ellis

10 Spoken document retrieval Text-based IR on ASR transcripts - e.g. news broadcasts (CMU s Informedia, Thisl) Recognition errors are not the limiting factor - TREC-98 results: average precision Weak at word level, but OK over paragraphs - replay the audio, don t show the text! talk - Dan Ellis

11 Computational Auditory Scene Analysis (CASA) Implement psychoacoustic theory? (Brown 92) input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq time onset period frq.mod - what are the features? how are they used? Top down constraints are needed (Klassner 96) 3760 Hz 2540 Hz 2230 Hz 1475 Hz 500 Hz 460 Hz 420 Hz Buzzer-Alarm Glass-Clink Phone-Ring Siren-Chirp 2350 Hz 1675 Hz 950 Hz sec TIME sec TIME talk - Dan Ellis

12 Audio Information Retrieval Searching in a database of audio - speech.. use ASR - text annotations.. search them - sound effects library? e.g. Muscle Fish SoundFisher browser - define multiple perceptual feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are global for each soundfile, no attempt to separate mixtures - segmentation... talk - Dan Ellis

13 Music analysis Automatic transcription (score recovery) - classic hard problem : can people do it even? - recent success in reduced forms e.g. melody, drum track (Goto 00) Instrument identification - ideas from speaker identification (basic PR) + instrument family hierarchies (Martin 99) Fingerprinting - spot recordings despite noise, distortion - relies on perceptual invariants Music clustering - e.g. music recommendation based on signal - correlate objective features with user ratings? talk - Dan Ellis

14 Multimedia description MPEG-7 Metadata - MPEG is known for audio/video compression standards; - also develop standards for search and indexing MPEG-7 is a standard format for metadata: Well-defined categories for content description Focus is on framework & infrastructure Audio descriptor categories: - from ASR - from computer music community - uses still to emerge talk - Dan Ellis

15 Outline Sound organization Background & related work Existing projects - Acoustic change detection - Robust speech recognition - Nonspeech event detection - Prediction-driven CASA Future projects Summary & conclusions talk - Dan Ellis

frq/hz Spectrogram Acoustic change detection (with Williams/Sheffield, Ferreiros/UPMadrid) Approaches: - metric : find instants of maximal change - model-based : best alignment of model set -

16 frq/hz Spectrogram Acoustic change detection (with Williams/Sheffield, Ferreiros/UPMadrid) Approaches: - metric : find instants of maximal change - model-based : best alignment of model set - bayesian : generate models when warranted Dynamism vs. Entropy for 2.5 second segments of speecn/music Speech Music Speech+Music speech music speech+music Entropy time/s Posteriors Dynamism Typically agnostic about underlying problem - use any features, find any changes Good for ASR adaptation, otherwise... talk - Dan Ellis

Speech feature combination (with Bilmes/UW, Hermansky/OGI, ICSI) Multistream approaches Input sound Feature 1 calculation Feature 2 calculation Acoustic classifier Acoustic classifier Speech features

17 Speech feature combination (with Bilmes/UW, Hermansky/OGI, ICSI) Multistream approaches Input sound Feature 1 calculation Feature 2 calculation Acoustic classifier Acoustic classifier Speech features Posterior combination Phone probabilities HMM decoder Word hypotheses - streams can correct each other big gains Which feature streams to combine? - low mutual information between classifiers indicates complementary streams Conditional Mutual Information between feature streams PLPa PLPb MSGa MSGb PLPa PLPb MSGa MSGb talk - Dan Ellis CMI/bits

18 C0 C1 C2 tn+w h# pcl bcl tcl dcl C0 C1 C2 tn+w h# pcl bcl tcl dcl Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU) Neural net estimates phone posteriors; but Gaussian mixtures model finer detail Combine them! Hybrid Connectionist-HMM ASR Conventional ASR (HTK) Feature calculation Neural net classifier Noway decoder Feature calculation Gauss mix models HTK decoder Ck tn s ah t s ah t Input sound Speech features Phone probabilities Words Input sound Speech features Subword likelihoods Words Tandem modeling Feature calculation Neural net classifier Gauss mix models HTK decoder Ck tn s ah t Input sound Speech features Phone probabilities Subword likelihoods Words 50% relative improvement over GMMs alone - different statistical modeling schemes get different info from same training data talk - Dan Ellis

19 Alarm sound detection Deconstructing sound mixtures freq / Hz time / sec representation of energy in time-frequency - formation of atomic elements - grouping by common properties (onset &c.) 0 Alarm sounds have particular structure - people know them when they hear them - build a generic detector? talk - Dan Ellis

20 input mixture Prediction-driven CASA Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!) Front end signal features (maps) Object formation discrete objects Grouping rules Source groups input mixture Need top-down constraints: Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine - fit vocabulary of generic elements to sound... bottom of a hierarchy? - account for entire scene - driven by prediction failures - pursue alternative hypotheses talk - Dan Ellis

21 f/hz City PDCASA example Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/hz Wefts f/hz Noise2,Click Weft5 Wefts6,7 Weft8 Wefts9 12 Crash (10/10) f/hz Noise Squeal (6/10) Truck (7/10) db time/s talk - Dan Ellis

22 Outline somewhat concrete provisional Sound organization Background & related work Existing projects Future projects - Meeting Recorder - Missing-data recognition & CASA for ASR - Structure from audio-video features - Speech & speaker recognition - Music organization - Audio archive structure discovery Summary & conclusions talk - Dan Ellis

Meeting recorder (with ICSI, UW, SRI, IBM) Microphones in conventional meetings - for transcription/summarization/retrieval - informal, overlapped

23 Meeting recorder (with ICSI, UW, SRI, IBM) Microphones in conventional meetings - for transcription/summarization/retrieval - informal, overlapped speech Data collection (ICSI and...): Research: ASR, nonspeech, organization - unprecedented data, new applications talk - Dan Ellis

noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity

24 Missing data recognition & CASA (with Barker, Cooke, Green/Sheffield) Missing-data recognition - integrate across don t-know values - perfect mask excellent performance in noise Multi-source decoder - Viterbi search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" CASA for masks/fragments - larger fragments quicker search talk - Dan Ellis

25 Structure from audio-video features (Peng Xu) HMM modeling of sports video inter replay start middle end Distribution of camera motion labels, color features - also need within-state sequential structure Add features from audio - could be orthogonal/complementary Audio feature toolkit? - simple feature vectors, boundaries, classes - wealth of potential applications! talk - Dan Ellis

26 Speech & speaker recognition Words are not enough; Confidence-tagged alternate word hypotheses Other useful information: - speaker change detection - speaker characterization - phrasing & timing - prosodic cues to dialog state - laughter, pauses, etc. Integration with other analyses - segmentation for adaptation - nonspeech events to ignore - video-derived information... talk - Dan Ellis

27 Music organization Music is a special case - lots of structure - highly significant Trick is to find meaningful, tractable questions - boundary between speech and music? New (counter-intuitive) approaches? - perceive as whole, not by voice (Scheirer 00) global features for chord structures - generic event cues + local feature classification - more provisional notion of instruments/voices talk - Dan Ellis

28 CASA for audio retrieval Muscle Fish system uses global features: Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example Mixtures need elements & objects: Generic element analysis Object formation Continuous audio archive Element representations Objects + properties Seach/ comparison Results Generic element analysis (Object formation) Query example rushing water Word-to-class mapping Symbolic query Properties alone - features calculated on grouped subsets talk - Dan Ellis

29 Audio archive structure discovery What can you do with a large unlabeled training set (e.g. multimedia clips from the web)? - bootstrap learning: look for common patterns - have to learn generalizations in parallel: e.g. self-organizing maps, EM HMMs - post-filtering by humans may find meaning in clusters Associated text annotations provide a very small amount of labeling -.. but for a very large number of examples sufficient to obtain purchase? - maximize label utility through NLP-type operations (expansion, disambiguation etc.) - goal is automatic term-to-feature mapping for term-based content queries talk - Dan Ellis

30 Outline Sound organization Background & related work Existing projects Future projects Summary & conclusions talk - Dan Ellis

31 Summary DOMAINS Broadcast Movies Lectures Meetings Personal recordings Location monitoring Object-based structure discovery & learning Speech recognition Speech characterization Nonspeech recognition Scene analysis Audio-visual integration Music analysis APPLICATIONS Structuring Search Summarization Awareness Understanding talk - Dan Ellis

32 Conclusions Sound is more than just speech! - speech is a special case - most auditory perceivers don t understand speech Object-based analysis is critical - it s what people do - the world presents acoustic mixtures Whole-scene representation is the way - it s what people do - provides mutual constraints of overlap Broad range of approaches for a broad range of phenomena talk - Dan Ellis

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University http://www.ee.columbia.edu/~dpwe/ Outline 1 2 3 4 5 Introducing Tandem modeling