Using Source Models in Speech Separation

Similar documents
Using Speech Models for Separation

LabROSA Research Overview

Sound, Mixtures, and Learning

Auditory Scene Analysis in Humans and Machines

Lecture 9: Speech Recognition: Front Ends

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Robustness, Separation & Pitch

Using the Soundtrack to Classify Videos

Recognition & Organization of Speech & Audio

General Soundtrack Analysis

Computational Perception /785. Auditory Scene Analysis

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

Auditory Scene Analysis: phenomena, theories and computational models

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Sound, Mixtures, and Learning

Sound Analysis Research at LabROSA

Speech Separation in Humans and Machines

Automatic audio analysis for content description & indexing

Recognition & Organization of Speech and Audio

Lecture 8: Spatial sound

Lecture 4: Auditory Perception. Why study perception?

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Acoustic Signal Processing Based on Deep Neural Networks

Sound localization psychophysics

Lecture 3: Perception

Recognition & Organization of Speech and Audio

HCS 7367 Speech Perception

Tandem modeling investigations

Recognition & Organization of Speech and Audio

Computational Auditory Scene Analysis: Principles, Practice and Applications

Auditory principles in speech processing do computers need silicon ears?

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

HCS 7367 Speech Perception

Challenges in microphone array processing for hearing aids. Volkmar Hamacher Siemens Audiological Engineering Group Erlangen, Germany

Binaural Hearing for Robots Introduction to Robot Hearing

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

Binaural processing of complex stimuli

Sonic Spotlight. Binaural Coordination: Making the Connection

Codebook driven short-term predictor parameter estimation for speech enhancement

Modulation and Top-Down Processing in Audition

Hybrid Masking Algorithm for Universal Hearing Aid System

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

An Auditory System Modeling in Sound Source Localization

NOISE REDUCTION ALGORITHMS FOR BETTER SPEECH UNDERSTANDING IN COCHLEAR IMPLANTATION- A SURVEY

UvA-DARE (Digital Academic Repository) Perceptual evaluation of noise reduction in hearing aids Brons, I. Link to publication

SUPPRESSION OF MUSICAL NOISE IN ENHANCED SPEECH USING PRE-IMAGE ITERATIONS. Christina Leitner and Franz Pernkopf

HybridMaskingAlgorithmforUniversalHearingAidSystem. Hybrid Masking Algorithm for Universal Hearing Aid System

Speech Enhancement Based on Deep Neural Networks

Binaural Hearing. Why two ears? Definitions

Possibilities for the evaluation of the benefit of binaural algorithms using a binaural audio link

Noise-Robust Speech Recognition Technologies in Mobile Environments

Fig. 1 High level block diagram of the binary mask algorithm.[1]

Role of F0 differences in source segregation

GENERALIZATION OF SUPERVISED LEARNING FOR BINARY MASK ESTIMATION

Neural correlates of the perception of sound source separation

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

Combating the Reverberation Problem

Speech Compression for Noise-Corrupted Thai Dialects

SpeechZone 2. Author Tina Howard, Au.D., CCC-A, FAAA Senior Validation Specialist Unitron Favorite sound: wind chimes

CHAPTER 1 INTRODUCTION

Acoustic feedback suppression in binaural hearing aids

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 THE DUPLEX-THEORY OF LOCALIZATION INVESTIGATED UNDER NATURAL CONDITIONS

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

21/01/2013. Binaural Phenomena. Aim. To understand binaural hearing Objectives. Understand the cues used to determine the location of a sound source

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Binaural hearing and future hearing-aids technology

TOLERABLE DELAY FOR SPEECH PROCESSING: EFFECTS OF HEARING ABILITY AND ACCLIMATISATION

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Auditory scene analysis in humans: Implications for computational implementations.

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Systems for Improvement of the Communication in Passenger Compartments

White Paper: micon Directivity and Directional Speech Enhancement

Echo Canceller with Noise Reduction Provides Comfortable Hands-free Telecommunication in Noisy Environments

IN EAR TO OUT THERE: A MAGNITUDE BASED PARAMETERIZATION SCHEME FOR SOUND SOURCE EXTERNALIZATION. Griffin D. Romigh, Brian D. Simpson, Nandini Iyer

Influence of acoustic complexity on spatial release from masking and lateralization

Robust Neural Encoding of Speech in Human Auditory Cortex

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

I. INTRODUCTION. OMBARD EFFECT (LE), named after the French otorhino-laryngologist

ELL 788 Computational Perception & Cognition July November 2015

HAT Process: Determining HAT for a Student

COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM: APPLICATION TO SPEECH EMOTION RECOGNITION

Lateralized speech perception in normal-hearing and hearing-impaired listeners and its relationship to temporal processing

The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments

Gender Based Emotion Recognition using Speech Signals: A Review

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Active User Affect Recognition and Assistance

Ear Beamer. Aaron Lucia, Niket Gupta, Matteo Puzella, Nathan Dunn. Department of Electrical and Computer Engineering

16.400/453J Human Factors Engineering /453. Audition. Prof. D. C. Chandra Lecture 14

Proceedings of Meetings on Acoustics

INTRODUCTION TO HEARING DEVICES

3-D Sound and Spatial Audio. What do these terms mean?

The Devil is in the Fitting Details. What they share in common 8/23/2012 NAL NL2

The effect of binaural processing techniques on speech quality ratings of assistive listening devices in different room acoustics conditions

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

A Consumer-friendly Recap of the HLAA 2018 Research Symposium: Listening in Noise Webinar

Literature Overview - Digital Hearing Aids and Group Delay - HADF, June 2017, P. Derleth

C H A N N E L S A N D B A N D S A C T I V E N O I S E C O N T R O L 2

EEL 6586, Project - Hearing Aids algorithms

SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING

Tandem acoustic modeling: Neural nets for mainstream ASR?

Transcription:

Using Source Models in Speech Separation Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Mixtures, Separation, and Models. Monaural Speech Separation 3. Binaural Speech Separation. Conclusions Source Models in Separation - Dan Ellis 7-11-9-1 /17

LabROSA Overview Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing Source Models in Separation - Dan Ellis 7-11-9 - /17

1. Mixtures, Separation, and Models Sounds rarely occur in isolation.. so analyzing mixtures is a problem.. for humans and machines chan freq / khz mr--11--1:57:3 chan 1 chan 3 freq / khz freq / khz - - -6 level / db mr-11-1-ce+173.wav chan 9 freq / khz 1 3 5 6 7 8 9 time / sec Source Models in Separation - Dan Ellis 7-11-9-3 /17

Mixture Organization Scenarios Interactive voice systems human-level understanding is expected Speech prostheses crowds: #1 complaint of hearing aid users Archive analysis identifying and isolating sound events freq / khz 8 6 dpwe--9-1-13:15: 1 3 5 6 7 8 9 level / db time / sec Unmixing/remixing/enhancement... - - Source Models in Separation - Dan Ellis 7-11-9 - /17

Separation vs. Inference Ideal separation is rarely possible many situations where overlaps cannot be removed Overlaps Ambiguity scene analysis = find most reasonable explanation Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({S i } X) P(X {S i }) P({S i }) combination physics source models search over {S i }?? Better source models better inference.. learn from examples? Source Models in Separation - Dan Ellis 7-11-9-5 /17

Approaches to Separation ICA Multi-channel Fixed filtering Perfect separation maybe! CASA Single-channel Time-var. filter Approximate separation Model-based Any domain Param. search Synthetic output target x interference n mix proc + - n^ x^ mix x+n stft proc mask istft x^ mix x+n param. fit params source models synthesis x^ or combinations... Source Models in Separation - Dan Ellis 7-11-9-6 /17

EM for Model-based Separation Expectation-Maximization algorithm for solving partially-unknown problems (only local optimality guaranteed) EM for model-based separation E-step: find distribution of unknowns p(u) given current model parameters Θ and observations x M-step: optimize Θ to maximize fit to x given current p(u) E-step M-step u is... GMM mixture assignment... T-F cell dominance... current phone of voice i... Source Models in Separation - Dan Ellis 7-11-9-7 /17

What is a Source Model? Source Model describes signal behavior encapsulates constraints on form of signal (any such constraint can be seen as a model...) A model has parameters model + parameters instance What is not a source model? detail not provided in instance Excitation source g[n] e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes n Resonance filter H(e j! )! n Source Models in Separation - Dan Ellis 7-11-9-8 /17

. Monaural Speech Separation Cooke & Lee s Speech Separation Challenge short, grammatically-constrained utterances: <command:><color:><preposition:><letter:5><number:1><adverb:> e.g. "bin white by R 8 again" task: report letter + number for white special session at Interspeech 6 Separation or Description? mix separation identify target energy t-f masking + resynthesis ASR speech models words vs. mix identify speech models find best words model words source knowledge Source Models in Separation - Dan Ellis 7-11-9-9 /17

Codebook Models Given models for sources, find best (most likely) states for spectra: p(x i 1,i ) = N (x;c i1 + c i,σ) combination model {i 1 (t),i (t)} = argmax i1,i p(x(t) i 1,i ) can include sequential constraints... different domains for combining c and defining E.g. stationary noise: freq / mel bin 8 6 Original speech In speech-shaped noise (mel magsnr =.1 db) 8 6 8 6 VQ inferred states (mel magsnr = 3.6 db) Roweis 1, 3 Kristjannson, 6 inference of source state 1 time / s 1 1 Source Models in Separation - Dan Ellis 7-11-9-1/17

Speech Recognition Models Decode with Factorial HMM i.e. two state sequences, one model for each voice exploit sequence constraints, speaker differences? IBM superhuman Iroquois system Kristjansson, Hershey et al. 6 model model 1 observations / time fewer errors than people for same speaker, level exploit grammar constraints - higher-level dynamics wer / % 1 5 Same Gender Speakers 6 db 3 db db - 3 db - 6 db - 9 db SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human Source Models in Separation - Dan Ellis 7-11-9-11/17

Frequency (khz) 6 6 6 6 Speaker-Adapted (SA) Models Factorial HMM needs distinct speakers Mixture: t3_swila_m18_sbar9n Adaptation iteration 1 Adaptation iteration 3 Adaptation iteration 5 SD model separation 6.5 1 1.5 Time (sec)!!!!!!!!!! acc % (%% '% %- use eigenvoice speaker space iterate estimating voice & separating speech performs midway between speaker-independent (SI) and speaker-dependent (SD) 89::-6,7",1 Source Models in Separation - Dan Ellis 7-11-9-1/17 Weiss & Ellis 7 ;1*3/, )8 ) )< #*=,/97, SI SA SD -

3. Binaural Speech Separation or 3 sources in reverberation assume just ears Tasks: identify positions of sources (and number?) recover source signals Source Models in Separation - Dan Ellis 7-11-9-13/17

Spatial Estimation in Reverb Mandel & Ellis 7 Model interaural spectrum of each source as stationary level and time differences: converge via EM to a(), τ for each source mask is Pr(X(t,ω) dominated by source i) Left Left Right Right ILD ILD IPD IPD Mask Mask Mask Mask 11 8.77dB 8.77dB Source Models in Separation - Dan Ellis 7-11-9-1/17

Spatial Estimation Results Modeling uncertainty improves results tradeoff between constraints & noisiness.5 db. db 8.77 db 1.35 db 9.1 db -.7 db Source Models in Separation - Dan Ellis 7-11-9-15/17

Combining Spatial + Speech Model Interaural parameters give ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(S i (t, ω) is speech signal) Can combine into one big EM framework... E-step M-step u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params Source Models in Separation - Dan Ellis 7-11-9-16/17

Summary & Conclusions Inferring model parameters is very general.. and very difficult, in general Speech models can separate single channels.. better match to individual better results Spatial cues can separate binaural signals.. but account for uncertainty from e.g. reverb EM-type approach can integrate them both Source Models in Separation - Dan Ellis 7-11-9-17/17