LabROSA Research Overview

Size: px

Start display at page:

Download "LabROSA Research Overview"

Jodie Gray
5 years ago
Views:

LabROSA Research Overview Dan Ellis Laboratory for

Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.

Speech Separation 3. Environmental Audio Classification.

1 LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Real-World Sound. Speech Separation 3. Environmental Audio Classification. Music Audio Analysis LabROSA Overview - Dan Ellis /17

2 LabROSA Overview Getting information from sound Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing LabROSA Overview - Dan Ellis /17

3 1. Real-World Sound frq/hz time/s level / db _m+s-15-evil-goodvoice-fade Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Sounds rarely occur in isolation.. so analyzing mixtures ( scenes ) is a problem.. for humans and machines LabROSA Overview - Dan Ellis /17

4 Auditory Scene Analysis Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they? (after Bregmanʼ9) Received waveform is a mixture sensors, N sources - underconstrained Use prior knowledge (models) to constrain LabROSA Overview - Dan Ellis /17

5 . Speech Separation Given models for sources, find best (most likely) states for spectra: p(x i 1,i )=N (x;c i1 + c i,σ) combination model {i 1 (t),i (t)} = argmax i1,i p(x(t) i 1,i ) can include sequential constraints... E.g. stationary noise: Roweis 1, 3 Kristjannson, inference of source state freq / mel bin Original speech In speech-shaped noise (mel magsnr =.1 db) VQ inferred states (mel magsnr = 3. db) 1 time / s 1 1 LabROSA Overview - Dan Ellis /17

Idea: Find speaker model parameter space Eigenvoices Weiss & Ellis 9, 1 generalize

Mean Voice Speaker models Speaker subspace bases 1 Eigenvoice model: states x 3

ch s z f th v dh m n l r w y iy ih eheyaeaaawayahaoowuwax Eigenvoice dimension 1 3

d g p t k jh ch s z f th v dh m n l r w y iy ih eheyaeaaawayahaoowuwax Eigenvoice

6 Idea: Find speaker model parameter space Eigenvoices Weiss & Ellis 9, 1 generalize without losing detail? Mean Voice Speaker models Speaker subspace bases 1 Eigenvoice model: states x 3 bins = 9, dimensions 1-3 dimensions Frequency (khz) Frequency (khz) b d g p t k jh ch s z f th v dh m n l r w y iy ih eheyaeaaawayahaoowuwax Eigenvoice dimension µ = µ + U w adapted mean eigenvoice weights model voice bases Frequency (khz) b d g p t k jh ch s z f th v dh m n l r w y iy ih eheyaeaaawayahaoowuwax Eigenvoice dimension b d g p t k jh ch s z f th v dh m n l r w y iy ih eheyaeaaawayahaoowuwax Eigenvoice dimension 3 LabROSA Overview - Dan Ellis /17

7 Speaker-Adapted Separation LabROSA Overview - Dan Ellis /17

8 Speaker-Adapted Separation Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix SA LabROSA Overview - Dan Ellis /17

9 3. Soundtrack Classification Short video clips as the evolution of snapshots 1-1 sec, one location, no editing browsing? Need information for indexing... video + audio foreground + background LabROSA Overview - Dan Ellis /17

MFCC Covariance Representation Each clip/segment fixed-size statistics similar to speaker ID and music genre classification Full Covariance

Spectrogram 1 3 5 7 9 time / sec 1 3 5 7 9 time / sec MFCC dimension Clip-to-clip distances for SVM classifier by KL or nd Gaussian model 3

10 MFCC Covariance Representation Each clip/segment fixed-size statistics similar to speaker ID and music genre classification Full Covariance matrix of MFCCs maps the kinds of spectral shapes present Video Soundtrack MFCC features freq / khz MFCC bin VTS 1 - Spectrogram time / sec time / sec MFCC dimension Clip-to-clip distances for SVM classifier by KL or nd Gaussian model level / db value MFCC dimension MFCC Covariance Matrix MFCC covariance LabROSA Overview - Dan Ellis /17

11 Classification Results Chang, Ellis et al. 7 Lee & Ellis 1 All classifiers vs. all labels MIP = some concepts are more audio-related Mutual Information Proportion I(classifier; label) H(label) Classifiers Classifiers RAND Playground Beach Parade NonMusicPerf MusicPerf WedDance WedCerem WedRecep Birthday Graduation Bird Dog Cat Biking Swimming Skiing IceSkating Soccer Baseball Basketball RAND Playground Beach Parade NonMusicPerf MusicPerf WedDance WedCerem WedRecep Birthday Graduation Bird Dog Cat Biking Swimming Skiing IceSkating Soccer Baseball Basketball CCV: Average Precision (mean=.3) Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWdMp Np Pa Be Pl RN Mutual Info Prop (mean=.175) Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWdMp Np Pa Be Pl RN Labels 1.5 AvPrec MIProp LabROSA Overview - Dan Ellis /17

12 Matching Videos via Fingerprints Landmark pairs are a noise-robust fingerprint freq / khz 3 VIdeo IMpLQaiHWbE at 195s Cotton & Ellis 1 Use to match distinct videos with same sound ambience freq / khz time / sec VIdeo Yi1hkNkqHBc at 1 s time / sec LabROSA Overview - Dan Ellis /17

. Music Audio Analysis Signal freq / khz Let it Be

Onsets & Beats Per-frame chroma Per-beat normalized

13 . Music Audio Analysis Signal freq / khz Let it Be (final verse)... at all levels from notes to genres Melody Piano C5 C C3 C C5 C C3 C - level / db time / s Onsets & Beats Per-frame chroma Per-beat normalized chroma G E D C A G E D C A time / beats intensity LabROSA Overview - Dan Ellis /17

14 Polyphonic Transcription Apply the Eigenvoice idea to music eigeninstruments? Subspace NMF Grindlay & Ellis 9 LabROSA Overview - Dan Ellis /17

15 Melodic-Harmonic Mining Million Song Dataset as Echo Nest Analyze Bertin-Mahieux et al. 1, 11 Frequent clusters of 1 x binarized eventchroma Music audio #1 (391) # (775) #3 (55) # (11) #5 (1) # (11) #7 (19) # (1) #9 (1) #1 (135) #11 (11) #1 (15) Beat tracking #13 (97) Chroma features #1 (9) Key normalization #15 (93) #1 (9) Landmark identification #17 (9) #1 (913) Locality Sensitive Hash Table #19 (91) # (97) #1 (7) # () #3 (1) # (1) #5 (79) # (75) #7 (75) # (7) #9 () #3 () #31 (39) #3 (39) #33 (79) #3 (7) #35 (75) #3 (77) #37 (731) #3 (71) #39 (7) # (9) #1 () # (7) #3 (75) # (57) #5 (5) # (51) #7 (7) # (3) #9 (1) #5 (593) Original Reconstruction #51 (59) #5 (591) #53 (59) #5 (57) #55 (571) #5 (55) #57 (59) #5 (53) #59 (53) # (531) LabROSA Overview - Dan Ellis /17

16 Results - Beatles Over Beatles tracks All beat offsets = 1,75 patches LSH takes 3 sec - approx NlogN in patches? High-pass along time to avoid sustained notes Song filter remove hits in same track chroma bin chroma bin I Should Have Known Better s 9-Martha My Dear s chroma bin chroma bin Here There And Everywhere s 1-Piggies.-9.s beat beat LabROSA Overview - Dan Ellis /17

17 Summary LabROSA : getting information from sound Speech monaural separation using eigenvoices binaural + reverb using MESSL Environmental classification of consumer video landmark-based events and matching Music transcription of notes, chords,... large corpus mining LabROSA Overview - Dan Ellis /17

Using Source Models in Speech Separation

Using Source Models in Speech Separation Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/