Using Speech Models for Separation

Size: px

Start display at page:

Download "Using Speech Models for Separation"

Derick Beasley
5 years ago
Views:

Using Speech Models for Separation Dan Ellis Comprising the work of

Organization of Speech and Audio Dept. Electrical Eng.

Eigenvoice Speaker Models. Spatial Parameter Models in Reverb 3.

1 Using Speech Models for Separation Dan Ellis Comprising the work of Michael Mandel and Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Eigenvoice Speaker Models. Spatial Parameter Models in Reverb 3. Combining Source + Spatial Speech Models for Separation - Dan Ellis /19

"bin white by R again" task: report letter + number for white Separation depends on source constraints the more the better - ASR model

2 1. Speech Separation Using Models Cooke & Lee s Monaural Speech Separation Task pairs of short, grammatically-constrained utterances: <command:><color:><preposition:><letter:5><number:1><adverb:> e.g. "bin white by R again" task: report letter + number for white Separation depends on source constraints the more the better - ASR model mix separation identify target energy t-f masking + resynthesis ASR speech models words vs. mix identify speech models find best words model words source knowledge Speech Models for Separation - Dan Ellis /19

3 Speech Mixture Recognition Kristjansson, Hershey et al. Speech recognizers contain speech models ASR is just argmax P(W X) Recognize mixtures with Factorial HMM one model+state sequence for each voice exploit sequence constraints, speaker differences model model 1 observations / time separation relies on detailed speaker model Speech Models for Separation - Dan Ellis /19

4 Eigenvoices Idea: Find speaker model parameter space Kuhn et al. 9, Weiss & Ellis 7,, 9 generalize without losing detail? Eigenvoice model: µ = µ + U w + B h Speaker models Speaker subspace bases adapted mean eigenvoice weights channel channel model voice bases bases weights Speech Models for Separation - Dan Ellis /19

Eigenvoice Bases Mean Voice 1 Mean model states x 3 bins = 9, dimensions Frequency

ax Eigenvoice dimension 1 3 5 Eigencomponents shift formants/ coloration additional

Eigenvoice dimension b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aa

y iy ih eh ey ae aa aw ay ah ao owuw ax Speech Models for Separation - Dan Ellis 1--

5 Eigenvoice Bases Mean Voice 1 Mean model states x 3 bins = 9, dimensions Frequency (khz) b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension Eigencomponents shift formants/ coloration additional components for acoustic channel Frequency (khz) Frequency (khz) Frequency (khz) b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aa aw ay ah ao owuw ax Speech Models for Separation - Dan Ellis /19

6 Speaker-Adapted Separation Factorial HMM analysis with tuning of source model parameters = eigenvoice speaker adaptation Weiss & Ellis Speech Models for Separation - Dan Ellis /19

7 Speaker-Adapted Separation Speech Models for Separation - Dan Ellis /19

8 Speaker-Adapted Separation Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) SI SA SD Speech Models for Separation - Dan Ellis /19

9 . Spatial Models & Reverb or 3 sources in reverberation assume just ears Mandel & Ellis 7 Model interaural spectrum of each source as stationary level and time differences: Speech Models for Separation - Dan Ellis /19

10 IPD, ILD Distributions Source at 75 in reverberation IPD IPD residual ILD IPD residual offsets phase by constant ωτ IPD can be fit by single Gaussian ILD needs frequency-dependence Speech Models for Separation - Dan Ellis /19

11 Model-Based EM Source Separation and Localization (MESSL) Re-estimate source parameters Mandel & Ellis 9 Assign spectrogram points to sources can model more sources than sensors flexible initialization Speech Models for Separation - Dan Ellis /19

12 MESSL Results Modeling uncertainty improves results tradeoff between constraints & noisiness.5 db. db.77 db 1.35 db 9.1 db -.7 db Speech Models for Separation - Dan Ellis /19

13 MESSL Results Speech recognizer (Digits) 1 Ground truth masks 1 Algorithmic masks Percent correct Human DP Oracle Oracle OracleAllRev Mixes Target to masker ratio (db) Human Sawada Mouba MESSL G MESSL ΩΩ DUET Mixes Target to masker ratio (db) Speech Models for Separation - Dan Ellis /19

signal) Can combine into one big EM framework.

14 3. Combining Spatial + Speech Models Weiss, Mandel & Ellis Interaural parameters give ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(S i (t, ω) is speech signal) Can combine into one big EM framework... E-step M-step u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params Speech Models for Separation - Dan Ellis /19

15 Parameter estimation and source separation MESSL-SP (Source Prior) Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models Speech Models for Separation - Dan Ellis May, 9 7 / /19

16 MESSL-SP Results Source models function as priors Interaural parameter spatial separation source model prior improves spatial estimate Frequency (khz) Frequency (khz) Ground truth (1. db) MESSL (5. db) Time (sec) DUET (3. db) MESSL SP (1.1 db) Time (sec) D FD BSS (5.1 db) Speech Models for Separation - Dan Ellis /19 MESSL EV (1.37 db) Time (sec) 1....

17 MESSL-SP Results SNR improvement vs. source angle separation sources (anechoic) sources (reverb) SNR improvement (db) SNR improvement (db) sources (anechoic) sources (reverb) Ground truth MESSL!EV MESSL!SP MESSL S!FD!BSS DUET Separation (degrees) Separation (degrees) Speech Models for Separation - Dan Ellis /19

18 Future Work Better parametric speaker models limitations of eigenvoices varying style Understanding reverb & ASR early echoes what spoils ASR? Models of other sources eigeninstruments? Speech Models for Separation - Dan Ellis /19

19 Summary & Conclusions Source models provide the constraints to make scene analysis possible Eigenvoices (model subspace) can be used to provide detailed models that generalize Spatial parameters can identify more sources than models in reverb (MESSL) Can combine source + spatial models Speech Models for Separation - Dan Ellis /19

Using Source Models in Speech Separation

Using Source Models in Speech Separation Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/