Sound, Mixtures, and Learning

Similar documents
Sound, Mixtures, and Learning

Recognition & Organization of Speech & Audio

Recognition & Organization of Speech and Audio

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Using Source Models in Speech Separation

Auditory Scene Analysis: phenomena, theories and computational models

Recognition & Organization of Speech and Audio

Recognition & Organization of Speech and Audio

Automatic audio analysis for content description & indexing

General Soundtrack Analysis

LabROSA Research Overview

Computational Auditory Scene Analysis: Principles, Practice and Applications

Lecture 9: Speech Recognition: Front Ends

Robustness, Separation & Pitch

Modulation and Top-Down Processing in Audition

HCS 7367 Speech Perception

HCS 7367 Speech Perception

Computational Perception /785. Auditory Scene Analysis

Speech recognition in noisy environments: A survey

Using Speech Models for Separation

Lecture 4: Auditory Perception. Why study perception?

Auditory scene analysis in humans: Implications for computational implementations.

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Lecture 3: Perception

Robust Neural Encoding of Speech in Human Auditory Cortex

Tandem modeling investigations

Auditory Scene Analysis in Humans and Machines

Using the Soundtrack to Classify Videos

CHAPTER 1 INTRODUCTION

Auditory gist perception and attention

Sound Analysis Research at LabROSA

Auditory Scene Analysis. Dr. Maria Chait, UCL Ear Institute

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Psychoacoustical Models WS 2016/17

Binaural Hearing. Why two ears? Definitions

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

Noise-Robust Speech Recognition Technologies in Mobile Environments

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

LATERAL INHIBITION MECHANISM IN COMPUTATIONAL AUDITORY MODEL AND IT'S APPLICATION IN ROBUST SPEECH RECOGNITION

SPHSC 462 HEARING DEVELOPMENT. Overview Review of Hearing Science Introduction

Auditory Scene Analysis

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Acoustic Signal Processing Based on Deep Neural Networks

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Role of F0 differences in source segregation

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Tandem acoustic modeling: Neural nets for mainstream ASR?

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

3-D Sound and Spatial Audio. What do these terms mean?

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Auditory principles in speech processing do computers need silicon ears?

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Lecture 8: Spatial sound

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

Performance of Gaussian Mixture Models as a Classifier for Pathological Voice

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080

Introduction to Computational Neuroscience

Binaural processing of complex stimuli

Sound Texture Classification Using Statistics from an Auditory Model

Sonic Spotlight. Binaural Coordination: Making the Connection

Sonic Spotlight. SmartCompress. Advancing compression technology into the future

Speech Separation in Humans and Machines

ELL 788 Computational Perception & Cognition July November 2015

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

= + Auditory Scene Analysis. Week 9. The End. The auditory scene. The auditory scene. Otherwise known as

Discrete Signal Processing

I. INTRODUCTION. OMBARD EFFECT (LE), named after the French otorhino-laryngologist

A Brain Computer Interface System For Auto Piloting Wheelchair

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., Fried, I. (2005). Invariant visual representation by single neurons in the human brain, Nature,

Spectrograms (revisited)

Hearing in the Environment

Proceedings of Meetings on Acoustics

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

A Consumer-friendly Recap of the HLAA 2018 Research Symposium: Listening in Noise Webinar

Ambiguity in the recognition of phonetic vowels when using a bone conduction microphone

ERA: Architectures for Inference

The Use of a High Frequency Emphasis Microphone for Musicians Published on Monday, 09 February :50

Overview of the visual cortex. Ventral pathway. Overview of the visual cortex

Prelude Envelope and temporal fine. What's all the fuss? Modulating a wave. Decomposing waveforms. The psychophysics of cochlear

Analysis of in-vivo extracellular recordings. Ryan Morrill Bootcamp 9/10/2014

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

Fig. 1 High level block diagram of the binary mask algorithm.[1]

Fabricating Reality Through Language

HOW TO USE THE SHURE MXA910 CEILING ARRAY MICROPHONE FOR VOICE LIFT

Juan Carlos Tejero-Calado 1, Janet C. Rutledge 2, and Peggy B. Nelson 3

Sensation and Perception -- Team Problem Solving Assignments

CS343: Artificial Intelligence

MULTI-CHANNEL COMMUNICATION

Gender Based Emotion Recognition using Speech Signals: A Review

An Auditory-Model-Based Electrical Stimulation Strategy Incorporating Tonal Information for Cochlear Implant

SPEECH recordings taken from realistic environments typically

Rethinking Cognitive Architecture!

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

In-Ear Microphone Equalization Exploiting an Active Noise Control. Abstract

Hearing. Juan P Bello

Transcription:

Sound, Mixtures, and Learning Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-1/34

1 Auditory Scene Analysis 4 frq/hz 3 2 1 2 4 6 8 1 12 time/s -2-4 -6 level / db Analysis Voice (evil) Rumble Stab Voice (pleasant) Strings Choir Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events -... like listeners do Hearing is ecologically grounded - reflects natural scene properties - subjective, not absolute NCARAI Seminar - Dan Ellis 22-11-4-2/34

freq / khz 4 3 2 1 Sound, mixtures, and learning Speech Noise Speech + Noise + =.5 1 time / s.5 1 time / s.5 1 time / s Sound - carries useful information about the world - complements vision Mixtures -.. are the rule, not the exception - medium is transparent, sources are many - must be handled! Learning - the speech recognition lesson: let the data do the work - like listeners NCARAI Seminar - Dan Ellis 22-11-4-3/34

The problem with recognizing mixtures Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they? (after Bregman 9) Received waveform is a mixture - two sensors, N signals... underconstrained Disentangling mixtures as the primary goal? - perfect solution is not possible - need experience-based constraints NCARAI Seminar - Dan Ellis 22-11-4-4/34

Human Auditory Scene Analysis (Bregman 199) People hear sounds as separate sources How? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes Grouping rules (Darwin, Carlyon,...): - cues: common onset/offset/modulation, harmonicity, spatial location,... Onset map Frequency analysis Harmonicity map Grouping mechanism Source properties Position map (after Darwin, 1996) NCARAI Seminar - Dan Ellis 22-11-4-5/34

Cues to simultaneous grouping freq / Hz 8 6 Elements + attributes 4 2 1 2 3 4 5 6 7 8 9 time / s Common onset - simultaneous energy has common source Periodicity - energy in different bands with same cycle Other cues - spatial (ITD/IID), familiarity,... NCARAI Seminar - Dan Ellis 22-11-4-6/34

Context and Restoration Context can create an expectation : i.e. a bias towards a particular interpretation - e.g. auditory illusions = hearing what s not there The continuity illusion f/hz ptshort 4 2 1..2.4.6.8 1. 1.2 1.4 time/s SWS f/bark S1 env.pf: 15 8 6 4 1 5..2.4.6.8 1. 1.2 1.4 1.6 1.8 - duplex perception How to model these effects? NCARAI Seminar - Dan Ellis 22-11-4-7/34

Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description... Goal: Automatic sound organization ; Systems to pick out sounds in a mixture -... like people do E.g. voice against a noisy background - to improve speech recognition Approach: - psychoacoustics describes grouping rules -... just implement them? NCARAI Seminar - Dan Ellis 22-11-4-8/34

The Representational Approach (Brown & Cooke 1993) Implement psychoacoustic theory input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq time onset period frq.mod - bottom-up processing - uses common onset & periodicity cues Able to extract voiced speech: frq/hz 3 2 15 1 brn1h.aif frq/hz 3 2 15 1 brn1h.fi.aif 6 6 4 3 2 15 4 3 2 15 1.2.4.6.8 1. time/s 1.2.4.6.8 1. time/s NCARAI Seminar - Dan Ellis 22-11-4-9/34

Adding top-down constraints Perception is not direct but a search for plausible hypotheses Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) input mixture Front end signal features Hypothesis management prediction errors Compare & reconcile hypotheses Noise components Periodic components predicted features Predict & combine - match observations with parameters of a world-model - need world-model constraints... NCARAI Seminar - Dan Ellis 22-11-4-1/34

Approaches to sound mixture recognition Recognize combined signal - multicondition training - combinatorics.. Separate signals - e.g. CASA, ICA - nice, if you can do it Segregate features into fragments - then missing-data recognition NCARAI Seminar - Dan Ellis 22-11-4-11/34

Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures - standard ASR - approaches to speech + noise Fragment Recognition Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-12/34

2 Speech recognition & mixtures Speech recognizers are the most successful and sophisticated acoustic recognizers to date Feature calculation sound D A T A Acoustic model parameters Word models s ah t Language model p("sat" "the","cat") p("saw" "the","cat") Acoustic classifier HMM decoder feature vectors phone probabilities phone / word sequence Understanding/ application... State of the art word-error rates (WERs): - 2% (dictation) - 3% (phone conv ns) NCARAI Seminar - Dan Ellis 22-11-4-13/34

Labeled training examples {x n,ω xn } Learning acoustic models Goal: describe with e.g. GMMs Sort according to class pxm ( ) Estimate conditional pdf for class ω 1 p(x ω 1 ) Separate models for each class - generalization as blurring Training data labels from: - manual annotation - best path from earlier classifier (Viterbi) - EM: joint estimation of labels & pdfs NCARAI Seminar - Dan Ellis 22-11-4-14/34

Speech + noise mixture recognition Background noise Biggest problem for current ASR? Feature invariance approach: Design features to reflect only speech - e.g. normalization, mean subtraction - one model for clean and noisy speech Alternative: More complex models of the signal - separate models for speech and noise NCARAI Seminar - Dan Ellis 22-11-4-15/34

HMM decomposition (e.g. Varga & Moore 1991, Roweis 2) Total signal model has independent state sequences for 2+ component sources model 2 model 1 observations / time New combined state space q' = {q 1 q 2 } - new observation pdfs for each combination pxq ( 1, q 2 ) NCARAI Seminar - Dan Ellis 22-11-4-16/34

Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition - separating signals vs. separating features - missing data recognition - recognizing multiple sources Alarm Sound Detection Future Work NCARAI Seminar - Dan Ellis 22-11-4-17/34

3 Fragment Recognition (Jon Barker & Martin Cooke, Sheffield) Signal separation is too hard! Instead: - segregate features into partially-observed sources - then classify Made possible by missing data recognition - integrate over uncertainty in observations for optimal posterior distribution Goal: Relate clean speech models P(X M) to speech-plus-noise mixture observations -.. and make it tractable NCARAI Seminar - Dan Ellis 22-11-4-18/34

Comparing different segregations M Standard classification chooses between models M to match source features X = argmax PMX ( ) = M argmax M PXM ( ) PM ( ) ------------- PX ( ) Mixtures observed features Y, segregation S, all related by PXYS (, ) Observation Y(f ) Segregation S Source X(f ) - spectral features allow clean relationship Joint classification of model and segregation: PMSY (, ) = PM ( ) PXM ( ) freq PXYS (, ) ------------------------ dx PX ( ) - integral collapses in several cases... NCARAI Seminar - Dan Ellis 22-11-4-19/34 PSY ( )

Calculating fragment matches PMSY (, ) = PM ( ) PXM ( ) PXYS (, ) ------------------------ dx PX ( ) PSY ( ) P(X M) - the clean-signal feature model P(X Y,S)/P(X) - is X visible given segregation? Integration collapses some channels... P(S Y) - segregation inferred from observation - just assume uniform, find S for most likely M - use extra information in Y to distinguish S s e.g. harmonicity, onset grouping Result: - probabilistically-correct relation between clean-source models P(X M) and inferred contributory source P(M,S Y) NCARAI Seminar - Dan Ellis 22-11-4-2/34

Speech fragment decoder results "1754" + noise SNR mask Fragments Simple P(S Y) model forces contiguous regions to stay together - big efficiency gain when searching S space WER / % 9 8 7 6 5 4 3 2 1 AURORA 2 - Test Set A MD Soft SNR HTK clean training HTK multicondition -5 5 1 15 2 clean SNR / db Fragment Decoder "1754" Clean-models-based recognition rivals trained-in-noise recognition NCARAI Seminar - Dan Ellis 22-11-4-21/34

Multi-source decoding Search for more than one source Y(t) S 2 (t) q 2 (t) S 1 (t) q 1 (t) Mutually-dependent data masks Use e.g. CASA features to propose masks - locally coherent regions Theoretical vs. practical limits NCARAI Seminar - Dan Ellis 22-11-4-22/34

Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection - sound - mixtures - learning Future Work NCARAI Seminar - Dan Ellis 22-11-4-23/34

4 Alarm sound detection sn6a8+2 4 freq / khz 3 2 1 Alarm sounds have particular structure - people know them when they hear them - clear even at low SNRs hrn1 bfr2 buz1 2-2 5 1 15 2 25 time / s -4 level / db Why investigate alarm sounds? - they re supposed to be easy - potential applications... Contrast two systems: - standard, global features, P(X M) - sinusoidal model, fragments, P(M,S Y) NCARAI Seminar - Dan Ellis 22-11-4-24/34

Alarms: Sound (representation) Standard system: Mel Cepstra - have to model alarms in noise context: each cepstral element depends on whole signal Contrast system: Sinusoid groups - exploit sparse, stable nature of alarm sounds - 2D-filter spectrogram to enhance harmonics - simple magnitude threshold, track growing - form groups based on common onset 5 5 5 1 4 4 4 freq / Hz 3 2 3 2 3 2 1 1 1 1 1.5 2 2.5 1 1.5 2 2.5 time / sec 1 1.5 2 2.5 Sinusoid representation is already fragmentary - does not record non-peak energies NCARAI Seminar - Dan Ellis 22-11-4-25/34

Alarms: Mixtures Effect of varying SNR on representations: - sinusoid peaks have ~ invariant properties 6 db SNR 1 Sine track groups 2 1 5 Cepstra (normalized) 1 db SNR db SNR 2 3 1 1 2 3 4 5 1 15 2 25 time / s 1 5 5 1 15 2 25 time / s 5-5 NCARAI Seminar - Dan Ellis 22-11-4-26/34

Alarms: Learning Standard: train MLP on noisy examples Sound mixture Feature extraction PLP cepstra Neural net acoustic classifier Alarm probability Median filtering Detected alarms Alternate: learn distributions of group features - duration, frequency deviation, amp. modulation... Spectral moment 1 8 6 4 2 Alarm non alarm 14 Magnitude SD 12 1 8 6 4 2 5 1 15 2 Spectral centroid 1 2 3 4 Inverse Frequency SD - underlying models are clean (isolated) - recognize in different contexts... NCARAI Seminar - Dan Ellis 22-11-4-27/34

freq / khz 4 3 2 1 Alarms: Results Restaurant+ alarms (snr ns 6 al 8) MLP classifier output freq / khz 4 Sound object classifier output 6 7 8 9 3 2 1 2 25 3 35 4 45 time/sec 5 Both systems commit many insertions at db SNR, but in different circumstances: Noise Neural net system Sinusoid model system Del Ins Tot Del Ins Tot 1 (amb) 7 / 25 2 36% 14 / 25 1 6% 2 (bab) 5 / 25 63 272% 15 / 25 2 68% 3 (spe) 2 / 25 68 28% 12 / 25 9 84% 4 (mus) 8 / 25 37 18% 9 / 25 135 576% Overall 22 / 1 17 192% 5 / 1 147 197% NCARAI Seminar - Dan Ellis 22-11-4-28/34

Alarms: Summary Sinusoid domain - feature components belong to 1 source - simple segregation (grouping) model - alarm model as properties of group - robust to partial feature observation Future improvements - more complex alarm class models - exploit repetitive structure of alarms NCARAI Seminar - Dan Ellis 22-11-4-29/34

Outline 1 2 3 4 5 Auditory Scene Analysis Speech Recognition & Mixtures Fragment Recognition Alarm Sound Detection Future Work - generative models & inference - model acquisition - ambulatory audio NCARAI Seminar - Dan Ellis 22-11-4-3/34

5 Future work CASA as generative model parameterization: Generation model Source models Source signals Channel parameters C 1 Received signals M 1 Y 1 X 1 Model dependence Θ C 2 M 2 Y 2 X 2 Observations O Analysis structure Observations O Fragment formation Mask allocation {Ki} Model fitting {Mi} p(x Mi,Ki) Likelihood evaluation NCARAI Seminar - Dan Ellis 22-11-4-31/34

Learning source models The speech recognition lesson: Use the data as much as possible - what can we do with unlimited data feeds? Data sources - clean data corpora - identify near-clean segments in real sound - build up clean views from partial observations? Model types - templates - parametric/constraint models - HMMs Hierarchic classification vs. individual characterization... NCARAI Seminar - Dan Ellis 22-11-4-32/34

Personal Audio Applications Smart PDA records everything Only useful if we have index, summaries - monitor for particular sounds - real-time description Scenarios - personal listener summary of your day - future prosthetic hearing device - autonomous robots Meeting data, ambulatory audio NCARAI Seminar - Dan Ellis 22-11-4-33/34

Summary Sound - carries important information Mixtures - need to segregate different source properties - fragment-based recognition Learning - information extracted by classification - models guide segregation Alarm sounds - simple example of fragment recognition General sounds - recognize simultaneous components - acquire classes from training data - build index, summary of real-world sound NCARAI Seminar - Dan Ellis 22-11-4-34/34