Recognition & Organization of Speech & Audio

Size: px

Start display at page:

Download "Recognition & Organization of Speech & Audio"

Darlene Amberly Little
5 years ago
Views:

1 Recognition & Organization of Speech & Audio Dan Ellis Outline Introducing Projects in speech, music & audio Summary overview - Dan Ellis

1 Sound organization frq/hz 4 2 2 4 6 8 1 12 time/s Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Central operation: -

2 1 Sound organization frq/hz time/s Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Central operation: - continuous sound mixture distinct objects & events Perceptual impression is very strong - but hard to see in signal overview - Dan Ellis

3 Bregman s lake Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they? (after Bregman 9) Received waveform is a mixture - two sensors, N signals... Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints overview - Dan Ellis

The information in sound freq / Hz 4 3 2 Steps 1 Steps 2 1 1 2 3 4 1 2 3 4 time / s A sense of

ecologically grounded - scene analysis is preconscious ( illusions) - special-purpose processing

4 The information in sound freq / Hz Steps 1 Steps time / s A sense of hearing is evolutionarily useful - gives organisms relevant information Auditory perception is ecologically grounded - scene analysis is preconscious ( illusions) - special-purpose processing reflects natural scene properties - subjective not canonical (ambiguity) overview - Dan Ellis

5 Key themes for Sound organization: construct hierarchy - at an instant (sources) - along time (segmentation) Scene analysis - find attributes according to objects - use attributes to form objects -... plus constraints of knowledge Exploiting large data sets (the ASR lesson) - supervised/labeled: pattern recognition - unsupervised: structure discovery, clustering Special cases: - speech recognition - other source-specific recognizers... within a complete explanation overview - Dan Ellis

6 Outline Introducing Projects in speech, music & audio - Tandem speech recognition - Meeting recorder speech analysis - Musical information extraction - Alarm sound detection Summary overview - Dan Ellis

Automatic Speech Recognition (ASR) Standard speech recognition structure: Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw"

7 Automatic Speech Recognition (ASR) Standard speech recognition structure: Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat") Acoustic classifier HMM decoder feature vectors phone probabilities phone / word sequence Understanding/ application... State of the art word-error rates (WERs): - 2% (dictation) - 3% (telephone conversations) Can use multiple streams... overview - Dan Ellis

8 C C1 C2 Ck tn tn+w h# pcl bcl tcl dcl C C1 C2 Ck tn tn+w h# pcl bcl tcl dcl Tandem speech recognition (with Manuel Reyes, ICSI, OGI, CMU) Neural net estimates phone posteriors; but Gaussian mixtures model finer detail Combine them! Hybrid Connectionist-HMM ASR Conventional ASR (HTK) Feature calculation Neural net classifier Noway decoder Feature calculation Gauss mix models HTK decoder s ah t s ah t Input sound Speech features Phone probabilities Words Input sound Speech features Subword likelihoods Words Tandem modeling Feature calculation Neural net classifier Gauss mix models HTK decoder s ah t Input sound Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output meaning overview - Dan Ellis

9 Tandem system results: Aurora digits 1 WER as a function of SNR for various Aurora99 systems WER / % (log scale) HTK GMM baseline Hybrid connectionist Tandem Tandem + PC Average WER ratio to baseline: HTK GMM: 1% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% SNR / db (averaged over 4 noises) 2 clean Aurora 2 Eurospeech 21 Evaluation improvement % - Multicondition Rel Avg. rel. improvement Columbia Philips UPC Barcelona Bell s IBM Motorola 1 Motorola 2 Nijmegen ICSI/OGI/Qualcomm ATR/Griffith AT&T Alcatel Siemens UCLA Microsoft Slovenia Granada overview - Dan Ellis

The Meeting Recorder project (with ICSI, UW, SRI, IBM) Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal,

10 The Meeting Recorder project (with ICSI, UW, SRI, IBM) Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech Data collection (ICSI, UW,...): - 1 hours collected, ongoing transcription - headsets + tabletop + PDA overview - Dan Ellis

11 Crosstalk cancellation Baseline speaker activity detection is hard: Spkr A speaker active Spkr B speaker B cedes floor Spkr C interruptions Spkr D breath noise Spkr E crosstalk Table top time / secs level/db 4 2 Noisy crosstalk model: m = C s + n Estimate subband C Aa from A s peak energy -... including pure delay (1 ms frames) -... then linear inversion overview - Dan Ellis

12 PDA-based speaker change detection Goal: small conference-tabletop device Speaker turns from PDA mock-up signals? 8 pda.aif: excerpt with 512-pt xcorr, 8% max thresh Morgan what's this one here this this one has a couple cheesy electrets oh that's connected too or that's connected two- both Adam Dan yep yeah both channels yeah that's that's the dummy dummy P.D.A. yeah it is lag/ms time/s SCD algo on spectral + interaural features - average spectral + per-channel ITD, φ overview - Dan Ellis

Music analysis: Lyrics extraction (with Adam Berenzweig) Vocal content is highly

Use an ASR classifier: speech (trnset #58) music (no vocals #1) singing (vocals #17 +

time / sec 1 2 time / sec Frame error rate ~2% for segmentation based on

13 Music analysis: Lyrics extraction (with Adam Berenzweig) Vocal content is highly salient, useful for retrieval Can we find the singing? Use an ASR classifier: speech (trnset #58) music (no vocals #1) singing (vocals # s) Spectrogram Posteriogram freq / khz phone class 4 2 singing 1 2 time / sec 1 2 time / sec 1 2 time / sec Frame error rate ~2% for segmentation based on posterior-feature statistics Lyric segmentation + transcribed lyrics training data for lyrics ASR... overview - Dan Ellis

14 Music analysis: Structure recovery (with Rob Turetsky) Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: theme similarity overview - Dan Ellis

15 Alarm sound detection Alarm sounds have particular structure - people know them when they hear them Isolate alarms in sound mixtures freq / Hz time / sec representation of energy in time-frequency - formation of atomic elements - grouping by common properties (onset &c.) - classify by attributes... Key: recognize despite background overview - Dan Ellis

16 The Machine listener Goal: An auditory system for machines - use same environmental information as people Aspects: - recognize spoken commands (but not others) - track acoustic channel quality (for responses) - categorize environment (conversation, crowd...) Scenarios - personal listener summary of your day - autonomous robots: need awareness overview - Dan Ellis

17 Outline Introducing Projects in speech, music & audio Summary overview - Dan Ellis

18 Summary DOMAINS Broadcast Movies Lectures Meetings Personal recordings Location monitoring Object-based structure discovery & learning Speech recognition Speech characterization Nonspeech recognition Scene analysis Audio-visual integration Music analysis APPLICATIONS Structuring Search Summarization Awareness Understanding overview - Dan Ellis

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University http://www.ee.columbia.edu/~dpwe/ Outline 1 2 3 4 Introducing Robust speech recognition