Sound, Mixtures, and Learning

Size: px

Start display at page:

Download "Sound, Mixtures, and Learning"

Junior Quinn
5 years ago
Views:

1 Sound, Mixtures, and Learning Dan Ellis oratory for Recognition and Organization of Speech and Audio () Electrical Engineering, Columbia University Outline Human sound organization Computational Auditory Scene Analysis Speech models and knowledge Sound mixture recognition Learning opportunities Sound, mixtures, Snowbird - Dan Ellis

2 1 Human sound organization 4000 frq/hz time/s level / db Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Analyzing and describing complex sounds: - continuous sound mixture distinct events Hearing is ecologically grounded - reflects natural scene properties - subjective not canonical (ambiguity) - mixture analysis as primary goal Sound, mixtures, Snowbird - Dan Ellis

3 freq / khz Sound mixtures Sound scene is almost always a mixture - always stuff going on - sound is transparent - but big energy range Speech Noise Speech + Noise + = time / s time / s time / s Need information related to our world model - i.e. separate objects - a wolf howling in a blizzard is the same as a wolf howling in a rainstorm - whole-signal statistics won t do this Separateness is similar to independence - objects/sounds that change in isolation - but: depends on the situation e.g. passing car vs. mechanic s diagnosis Sound, mixtures, Snowbird - Dan Ellis

4 Auditory scene analysis (Bregman 1990) How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes Grouping rules (Darwin, Carlyon,...): - cues: common onset/offset/modulation, harmonicity, spatial location,... Onset map Frequency analysis Harmonicity map Grouping mechanism Source properties Position map (after Darwin, 1996) Sound, mixtures, Snowbird - Dan Ellis

Cues to simultaneous grouping freq / Hz 8000 6000 Elements + attributes 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s Common onset - simultaneous energy has common source

5 Cues to simultaneous grouping freq / Hz Elements + attributes time / s Common onset - simultaneous energy has common source Periodicity - energy in different bands with same cycle Other cues - spatial (ITD/IID), familiarity,... Sound, mixtures, Snowbird - Dan Ellis

6 The effect of context Context can create an expectation : i.e. a bias towards a particular interpretation e.g. Bregman s old-plus-new principle: A change in a signal will be interpreted as an added source whenever possible frequency / khz time / s - a different division of the same energy depending on what preceded it Sound, mixtures, Snowbird - Dan Ellis

7 Outline Human sound organization Computational Auditory Scene Analysis - sound source separation - bottom-up models - top-down constraints Speech models and knowledge Sound mixture recognition Learning opportunities Sound, mixtures, Snowbird - Dan Ellis

8 2 Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description... Goal: Automatic sound organization ; Systems to pick out sounds in a mixture -... like people do E.g. voice against a noisy background - to improve speech recognition Approach: - psychoacoustics describes grouping rules -... just implement them? Sound, mixtures, Snowbird - Dan Ellis

CASA front-end processing Correlogram: Loosely based on known/possible physiology delay line short-time autocorrelation sound Cochlea filterbank frequency channels envelope follower freq Correlogram

9 CASA front-end processing Correlogram: Loosely based on known/possible physiology delay line short-time autocorrelation sound Cochlea filterbank frequency channels envelope follower freq Correlogram slice lag time - linear filterbank cochlear approximation - static nonlinearity - zero-delay slice is like spectrogram - periodicity from delay-and-multiply detectors Sound, mixtures, Snowbird - Dan Ellis

10 The Representational Approach (Brown & Cooke 1993) Implement psychoacoustic theory input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq time onset period frq.mod - bottom-up processing - uses common onset & periodicity cues Able to extract voiced speech: frq/hz brn1h.aif frq/hz brn1h.fi.aif time/s Sound, mixtures, Snowbird - Dan Ellis time/s

11 Problems with bottom-up CASA freq time Circumscribing time-frequency elements - need to have regions, but hard to find Periodicity is the primary cue - how to handle aperiodic energy? Resynthesis via masked filtering - cannot separate within a single t-f element Bottom-up leaves no ambiguity or context - how to model illusions? Sound, mixtures, Snowbird - Dan Ellis

12 Restoration in sound perception Auditory illusions = hearing what s not there The continuity illusion f/hz ptshort time/s SWS f/bark S1 env.pf: duplex perception Sound, mixtures, Snowbird - Dan Ellis

13 Adding top-down constraints Perception is not direct but a search for plausible hypotheses Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine - match observations with parameters of a world-model - need world-model constraints... Sound, mixtures, Snowbird - Dan Ellis

14 Aside: Optimal techniques (ICA, ABF) (Bell & Sejnowski etc.) General idea: Drive a parameterized separation algorithm to maximize independence of outputs m 1 a 11 a s x 12 1 m 2 a 21 a 22 s 2 δ MutInfo δa Attractions: - mathematically rigorous, minimal assumptions Problems: - limitations of separation algorithm (N x N) - essentially bottom-up Sound, mixtures, Snowbird - Dan Ellis

15 Outline Human sound organization Computational Auditory Scene Analysis Speech models and knowledge - automatic speech recognition - subword states - cepstral coefficients Sound mixture recognition Learning opportunities Sound, mixtures, Snowbird - Dan Ellis

3 Speech models & knowledge Standard speech recognition Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat")

16 3 Speech models & knowledge Standard speech recognition Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat") Acoustic classifier HMM decoder feature vectors phone probabilities phone / word sequence Understanding/ application... State of the art word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) Sound, mixtures, Snowbird - Dan Ellis

17 freq / Hz Speech units Speech is highly variable - simple templates won t do - spectral variation (voice quality) - time-warp problems Match short segments (states), allow repeats - model with pseudo-stationary slices of ~ 10 ms sil g w eh n sil time / s Speech models are distributions pxq ( ) Sound, mixtures, Snowbird - Dan Ellis

18 Auditory spectrum Cepstral coefficients Speech features: Cepstra Idea: Decorrelate & summarize spectral slices: X m [] l = IDFT{ log S[ mh, k] } - easier to model: Features Covariance matrix Example joint distrib (10,15) frames C 0 normalizes out average log energy Decorrelated pdfs fit diagonal Gaussians - DCT is close to PCA for log spectra Sound, mixtures, Snowbird - Dan Ellis

19 eled training examples {x n,ω xn } Acoustic model training Goal: describe with e.g. GMMs Sort according to class pxq ( ) Estimate conditional pdf for class ω 1 p(x ω 1 ) Training data labels from: - manual phonetic annotation - best path from earlier classifier (Viterbi) - EM: joint estimation of labels & pdfs Sound, mixtures, Snowbird - Dan Ellis

20 HMM decoding Feature vectors cannot be reliably classified into phonemes WE DO HAVE JUST A w iy d uwhh ae v jhah s tcl y uw FEW DAYS LEFT THE d ey z l eh n tcl h# w iy dcl duw hhaeaw jh ah stcl taxchjhy iy uwdcl d ey z l eh f h# dh freq / khz time / s Use top-down constraints to get good results - allowable phonemes - dictionary of known words - grammar of possible sentences Decoder searches all possible state sequences - at least notionally; pruning makes it possible Sound, mixtures, Snowbird - Dan Ellis

21 Outline Human sound organization Computational Auditory Scene Analysis Speech models and knowledge Sound mixture recognition - feature invariance - mixtures including - general mixtures Learning opportunities Sound, mixtures, Snowbird - Dan Ellis

22 4 Sound mixture recognition Biggest problem in speech recognition is background noise interference Feature invariance approach - use features that reflect only speech - e.g. normalization, mean subtraction - but: non-static noise? Or: more complex models of the signal - HMM decomposition - missing-data recognition Generalize to other, multiple sounds Sound, mixtures, Snowbird - Dan Ellis

23 Feature normalization Idea: feature variations, not absolute level Hence: calculate average level & subtract it: X[ k] = Sk [ ] mean{ Sk [ ]} Factors out fixed channel frequency response: log Sk [ ] sn [ ] = hn [ ] en [ ] = log Normalize variance to handle added noise? * H[ k] + log Ek [ ] freq / khz clean/mgg95958 N2_SNR5/MGG cepstra time / s time / s Sound, mixtures, Snowbird - Dan Ellis

24 HMM decomposition (e.g. Varga & Moore 1991, Roweis 2000) Total signal model has independent state sequences for 2+ component sources model 2 model 1 observations / time New combined state space q' = {q 1 q 2 } - new observation pdfs for each combination px i i i ( q 1, q 2 ) Sound, mixtures, Snowbird - Dan Ellis

25 Problems with HMM decomposition O(q k ) N is exponentially large... Normalization no longer holds! - each source has a different gain model at various SNRs? - models typically don t use overall energy C 0 - each source has a different channel H[k] Modeling every possible sub-state combination is inefficient, inelegant and impractical Sound, mixtures, Snowbird - Dan Ellis

26 Missing data recognition (Cooke, Green, Sheffield) Energy overlaps in time-freq. hide features - some observations are effectively missing Use missing feature theory... - integrate over missing data x m under model M pxm ( ) = px ( p x m, M)px ( m M) dx m Effective in speech recognition "1754" + noise Mask based on stationary noise estimate Missing-data Recognition "1754" WER 90 AURORA Test Set A Cle Problem: finding the missing data mask Sound, mixtures, Snowbird - Dan Ellis MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition

27 Maximum-likelihood data mask (Jon Sheffield) Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" Decoder searches over data mask K: pmkx (, ) pxkm (, )pkm ( )pm ( ) - how to estimate p(k) Sound, mixtures, Snowbird - Dan Ellis

28 Multi-source decoding Search for more than one source O(t) K 2 (t) q 2 (t) K 1 (t) q 1 (t) Mutually-dependent data masks Use CASA processing to propose masks - locally coherent regions - p(k q) Theoretical vs. practical limits Sound, mixtures, Snowbird - Dan Ellis

29 General sound mixtures Search for generative explanation: Generation model Source models Source signals Channel parameters C 1 M 1 Y 1 X 1 Received signals Model dependence Θ C 2 M 2 Y 2 X 2 Observations O Analysis structure Observations O Fragment formation Mask allocation {Ki} Model fitting {Mi} p(x Mi,Ki) Likelihood evaluation Sound, mixtures, Snowbird - Dan Ellis

30 Outline Human sound organization Computational Auditory Scene Analysis Speech models and knowledge Sound mixture recognition Opportunities for learning - learnable aspects of modeling - tractable decoding - some examples Sound, mixtures, Snowbird - Dan Ellis

31 5 Opportunities for learning Per model feature distributions - e.g. analyzing isolated sound databases Channel modifications - e.g. by comparing multi-mic recordings Signal combinations - determined by acoustics PXY ( ) PO ( { X i }) Patterns of model combinations - loose dependence between sources Search for most likely explanations M i ( { })P { X i } { M i } P( { } O) PO X i PYM ( ) P( { M i }) ( )P({ M i }) Short-term structure: repeating events Sound, mixtures, Snowbird - Dan Ellis

32 Source models The speech recognition lesson: Use the data as much as possible - what can we do with unlimited data feeds? Data sources - clean data corpora - identify near-clean segments in real sound Model types - templates - parametric/constraint models - HMMs Sound, mixtures, Snowbird - Dan Ellis

33 What are the HMM states? No sub-units defined for nonspeech sounds Final states depend on EM initialization - labels - clusters - transition matrix dogbarks2 Have ideas of what we d like to get - investigate features/initialization to get there freq / khz s s3 s2 s4 s2 s3 s7 s2 s3 s5 s2 s3 s2 s4 s3 s4 s3 s7 s4 time Sound, mixtures, Snowbird - Dan Ellis

34 Tractable decoding Speech decoder notionally searches all states Parametric models give infinite space - need closed-form partial explanations - examine residual, iterate, converge Need general cues to get started - return to Auditory Scene Analysis: - onsets - harmonic patterns - then parametric fitting Need multiple hypothesis search, pruning, efficiency tricks Learning? Parameters for new source events - e.g. from artificial (hence labeled) mixtures Sound, mixtures, Snowbird - Dan Ellis

Example: Alarm sound detection Alarm sounds have particular structure - people know them when they hear them Isolate alarms in sound mixtures 5000 5000 5000 1 4000 4000 4000 freq / Hz 3000 2000 3000

35 Example: Alarm sound detection Alarm sounds have particular structure - people know them when they hear them Isolate alarms in sound mixtures freq / Hz freq / Hz time / sec sinusoid peaks have invariant properties Speech + alarm Pr(alarm) time / sec Learn model parameters from examples Sound, mixtures, Snowbird - Dan Ellis

36 Example: Music transcription (e.g. Masataka Goto) High-quality training material: Synthesizer sample kits Ground truth available: Musical scores Find ML explanations for scores - guide by multiple pitch tracking (hyp. search) Applications in similarity matching Sound, mixtures, Snowbird - Dan Ellis

37 Summary Sound contains lots of information... but it s always mixed up Psychologists describe ASA... but bottom-up computer models don t work Speech recognition works for isolated speech... by exploiting top-down, context constraints Speech in mixtures via multiple-source models... practical combinatorics are the main problem Generalize this idea for all sounds... need models of all sounds... plus models of channel modification... plus ways to propose segmentations... plus missing-data recognition Sound, mixtures, Snowbird - Dan Ellis

38 Further reading [BarkCE00] [Breg90] [Chev00] [CookE01] [Ellis99] [Roweis00] J. Barker, M.P. Cooke & D. Ellis (2000). Decoding speech in the presence of other sound sources, Proc. ICSLP-2000, Beijing. ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organization of sound, MIT Press. A. de Cheveigné (2000). The Auditory System as a Separation Machine, Proc. Intl. Symposium on Hearing. M. Cooke, D. Ellis (2001). The auditory organization of speech and other sources in listeners and computational models, Speech Communication (accepted for publication). D.P.W. Ellis (1999). Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis..., Speech Communications S. Roweis (2000). One microphone source separation., Proc. NIPS Sound, mixtures, Snowbird - Dan Ellis

Sound, Mixtures, and Learning

Sound, Mixtures, and Learning Dan Ellis Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/