Sound, Mixtures, and Learning Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-1/34
1 Auditory Scene Analysis 4 frq/hz 3 2 1 2 4 6 8 1 12 time/s -2-4 -6 level / db Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events -... like listeners do Hearing is ecologically grounded - reflects natural scene properties - subjective, not absolute NCARAI Seminar - Dan Ellis 22-11-4-2/34
freq / khz 4 3 2 1 Sound, mixtures, and learning Speech Noise Speech + Noise + =.5 1 time / s.5 1 time / s.5 1 time / s Sound - carries useful information about the world - complements vision Mixtures -.. are the rule, not the exception - medium is transparent, sources are many - must be handled! Learning - the speech recognition lesson: let the data do the work - like listeners NCARAI Seminar - Dan Ellis 22-11-4-3/34
The problem with recognizing mixtures Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they? (after Bregman 9) Received waveform is a mixture - two sensors, N signals... underconstrained Disentangling mixtures as the primary goal? - perfect solution is not possible - need experience-based constraints NCARAI Seminar - Dan Ellis 22-11-4-4/34
Human Auditory Scene Analysis (Bregman 199) People hear sounds as separate sources How? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes Grouping rules (Darwin, Carlyon,...): - cues: common onset/offset/modulation, harmonicity, spatial location,... Onset map Frequency analysis Harmonicity map Grouping mechanism Source properties Position map (after Darwin, 1996) NCARAI Seminar - Dan Ellis 22-11-4-5/34
Cues to simultaneous grouping freq / Hz 8 6 Elements + attributes 4 2 1 2 3 4 5 6 7 8 9 time / s Common onset - simultaneous energy has common source Periodicity - energy in different bands with same cycle Other cues - spatial (ITD/IID), familiarity,... NCARAI Seminar - Dan Ellis 22-11-4-6/34
Context and Restoration Context can create an expectation : i.e. a bias towards a particular interpretation - e.g. auditory illusions = hearing what s not there The continuity illusion f/hz ptshort 4 2 1..2.4.6.8 1. 1.2 1.4 time/s SWS f/bark S1 env.pf: 15 8 6 4 1 5..2.4.6.8 1. 1.2 1.4 1.6 1.8 - duplex perception How to model these effects? NCARAI Seminar - Dan Ellis 22-11-4-7/34
Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description... Goal: Automatic sound organization ; Systems to pick out sounds in a mixture -... like people do E.g. voice against a noisy background - to improve speech recognition Approach: - psychoacoustics describes grouping rules -... just implement them? NCARAI Seminar - Dan Ellis 22-11-4-8/34
The Representational Approach (Brown & Cooke 1993) Implement psychoacoustic theory input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq time onset period frq.mod - bottom-up processing - uses common onset & periodicity cues Able to extract voiced speech: frq/hz 3 2 15 1 brn1h.aif frq/hz 3 2 15 1 brn1h.fi.aif 6 6 4 3 2 15 4 3 2 15 1.2.4.6.8 1. time/s 1.2.4.6.8 1. time/s NCARAI Seminar - Dan Ellis 22-11-4-9/34
Adding top-down constraints Perception is not direct but a search for plausible hypotheses Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine - match observations with parameters of a world-model - need world-model constraints... NCARAI Seminar - Dan Ellis 22-11-4-1/34
Approaches to sound mixture recognition Recognize combined signal - multicondition training - combinatorics.. Separate signals - e.g. CASA, ICA - nice, if you can do it Segregate features into fragments - then missing-data recognition NCARAI Seminar - Dan Ellis 22-11-4-11/34
Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures - standard ASR - approaches to speech + noise Fragment Recognition Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-12/34
2 Speech recognition & mixtures Speech recognizers are the most successful and sophisticated acoustic recognizers to date Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat") Acoustic classifier HMM decoder feature vectors phone probabilities phone / word sequence Understanding/ application... State of the art word-error rates (WERs): - 2% (dictation) - 3% (phone conv ns) NCARAI Seminar - Dan Ellis 22-11-4-13/34
Labeled training examples {x n,ω xn } Learning acoustic models Goal: describe with e.g. GMMs Sort according to class pxm ( ) Estimate conditional pdf for class ω 1 p(x ω 1 ) Separate models for each class - generalization as blurring Training data labels from: - manual annotation - best path from earlier classifier (Viterbi) - EM: joint estimation of labels & pdfs NCARAI Seminar - Dan Ellis 22-11-4-14/34
Speech + noise mixture recognition Background noise Biggest problem for current ASR? Feature invariance approach: Design features to reflect only speech - e.g. normalization, mean subtraction - one model for clean and noisy speech Alternative: More complex models of the signal - separate models for speech and noise NCARAI Seminar - Dan Ellis 22-11-4-15/34
HMM decomposition (e.g. Varga & Moore 1991, Roweis 2) Total signal model has independent state sequences for 2+ component sources model 2 model 1 observations / time New combined state space q' = {q 1 q 2 } - new observation pdfs for each combination pxq ( 1, q 2 ) NCARAI Seminar - Dan Ellis 22-11-4-16/34
Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition - separating signals vs. separating features - missing data recognition - recognizing multiple sources Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-17/34
3 Fragment Recognition (Jon Barker & Martin Cooke, Sheffield) Signal separation is too hard! Instead: - segregate features into partially-observed sources - then classify Made possible by missing data recognition - integrate over uncertainty in observations for optimal posterior distribution Goal: Relate clean speech models P(X M) to speech-plus-noise mixture observations -.. and make it tractable NCARAI Seminar - Dan Ellis 22-11-4-18/34
Comparing different segregations M Standard classification chooses between models M to match source features X = argmax PMX ( ) = M argmax M PXM ( ) PM ( ) ------------- PX ( ) Mixtures observed features Y, segregation S, all related by PXYS (, ) Observation Y(f ) Segregation S Source X(f ) - spectral features allow clean relationship Joint classification of model and segregation: PMSY (, ) = PM ( ) PXM ( ) freq PXYS (, ) ------------------------ dx PX ( ) - integral collapses in several cases... NCARAI Seminar - Dan Ellis 22-11-4-19/34 PSY ( )
Calculating fragment matches PMSY (, ) = PM ( ) PXM ( ) PXYS (, ) ------------------------ dx PX ( ) PSY ( ) P(X M) - the clean-signal feature model P(X Y,S)/P(X) - is X visible given segregation? Integration collapses some channels... P(S Y) - segregation inferred from observation - just assume uniform, find S for most likely M - use extra information in Y to distinguish S s e.g. harmonicity, onset grouping Result: - probabilistically-correct relation between clean-source models P(X M) and inferred contributory source P(M,S Y) NCARAI Seminar - Dan Ellis 22-11-4-2/34
Speech fragment decoder results "1754" + noise SNR mask Fragments Simple P(S Y) model forces contiguous regions to stay together - big efficiency gain when searching S space WER / % 9 8 7 6 5 4 3 2 1 AURORA 2 - Test Set A MD Soft SNR HTK clean training HTK multicondition -5 5 1 15 2 clean SNR / db Fragment Decoder "1754" Clean-models-based recognition rivals trained-in-noise recognition NCARAI Seminar - Dan Ellis 22-11-4-21/34
Multi-source decoding Search for more than one source Y(t) S 2 (t) q 2 (t) S 1 (t) q 1 (t) Mutually-dependent data masks Use e.g. CASA features to propose masks - locally coherent regions Theoretical vs. practical limits NCARAI Seminar - Dan Ellis 22-11-4-22/34
Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection - sound - mixtures - learning Future Work NCARAI Seminar - Dan Ellis 22-11-4-23/34
4 Alarm sound detection sn6a8+2 4 freq / khz 3 2 1 Alarm sounds have particular structure - people know them when they hear them - clear even at low SNRs hrn1 bfr2 buz1 2-2 5 1 15 2 25 time / s -4 level / db Why investigate alarm sounds? - they re supposed to be easy - potential applications... Contrast two systems: - standard, global features, P(X M) - sinusoidal model, fragments, P(M,S Y) NCARAI Seminar - Dan Ellis 22-11-4-24/34
Alarms: Sound (representation) Standard system: Mel Cepstra - have to model alarms in noise context: each cepstral element depends on whole signal Contrast system: Sinusoid groups - exploit sparse, stable nature of alarm sounds - 2D-filter spectrogram to enhance harmonics - simple magnitude threshold, track growing - form groups based on common onset 5 5 5 1 4 4 4 freq / Hz 3 2 3 2 3 2 1 1 1 1 1.5 2 2.5 1 1.5 2 2.5 time / sec 1 1.5 2 2.5 Sinusoid representation is already fragmentary - does not record non-peak energies NCARAI Seminar - Dan Ellis 22-11-4-25/34
Alarms: Mixtures Effect of varying SNR on representations: - sinusoid peaks have ~ invariant properties 6 db SNR 1 Sine track groups 2 1 5 Cepstra (normalized) 1 db SNR db SNR 2 3 1 1 2 3 4 5 1 15 2 25 time / s 1 5 5 1 15 2 25 time / s 5-5 NCARAI Seminar - Dan Ellis 22-11-4-26/34
Alarms: Learning Standard: train MLP on noisy examples Sound mixture Feature extraction PLP cepstra Neural net acoustic classifier Alarm probability Median filtering Detected alarms Alternate: learn distributions of group features - duration, frequency deviation, amp. modulation... Spectral moment 1 8 6 4 2 Alarm non alarm 14 Magnitude SD 12 1 8 6 4 2 5 1 15 2 Spectral centroid 1 2 3 4 Inverse Frequency SD - underlying models are clean (isolated) - recognize in different contexts... NCARAI Seminar - Dan Ellis 22-11-4-27/34
freq / khz 4 3 2 1 Alarms: Results Restaurant+ alarms (snr ns 6 al 8) MLP classifier output freq / khz 4 Sound object classifier output 6 7 8 9 3 2 1 2 25 3 35 4 45 time/sec 5 Both systems commit many insertions at db SNR, but in different circumstances: Noise Neural net system Sinusoid model system Del Ins Tot Del Ins Tot 1 (amb) 7 / 25 2 36% 14 / 25 1 6% 2 (bab) 5 / 25 63 272% 15 / 25 2 68% 3 (spe) 2 / 25 68 28% 12 / 25 9 84% 4 (mus) 8 / 25 37 18% 9 / 25 135 576% Overall 22 / 1 17 192% 5 / 1 147 197% NCARAI Seminar - Dan Ellis 22-11-4-28/34
Alarms: Summary Sinusoid domain - feature components belong to 1 source - simple segregation (grouping) model - alarm model as properties of group - robust to partial feature observation Future improvements - more complex alarm class models - exploit repetitive structure of alarms NCARAI Seminar - Dan Ellis 22-11-4-29/34
Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection Future Work - generative models & inference - model acquisition - ambulatory audio NCARAI Seminar - Dan Ellis 22-11-4-3/34
5 Future work CASA as generative model parameterization: Generation model Source models Source signals Channel parameters C 1 Received signals M 1 Y 1 X 1 Model dependence Θ C 2 M 2 Y 2 X 2 Observations O Analysis structure Observations O Fragment formation Mask allocation {Ki} Model fitting {Mi} p(x Mi,Ki) Likelihood evaluation NCARAI Seminar - Dan Ellis 22-11-4-31/34
Learning source models The speech recognition lesson: Use the data as much as possible - what can we do with unlimited data feeds? Data sources - clean data corpora - identify near-clean segments in real sound - build up clean views from partial observations? Model types - templates - parametric/constraint models - HMMs Hierarchic classification vs. individual characterization... NCARAI Seminar - Dan Ellis 22-11-4-32/34
Personal Audio Applications Smart PDA records everything Only useful if we have index, summaries - monitor for particular sounds - real-time description Scenarios - personal listener summary of your day - future prosthetic hearing device - autonomous robots Meeting data, ambulatory audio NCARAI Seminar - Dan Ellis 22-11-4-33/34
Summary Sound - carries important information Mixtures - need to segregate different source properties - fragment-based recognition Learning - information extracted by classification - models guide segregation Alarm sounds - simple example of fragment recognition General sounds - recognize simultaneous components - acquire classes from training data - build index, summary of real-world sound NCARAI Seminar - Dan Ellis 22-11-4-34/34