Lecture 9: Speech Recognition: Front Ends

Similar documents
Using Source Models in Speech Separation

General Soundtrack Analysis

Speech recognition in noisy environments: A survey

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

Using Speech Models for Separation

Recognition & Organization of Speech and Audio

Sound, Mixtures, and Learning

I. INTRODUCTION. OMBARD EFFECT (LE), named after the French otorhino-laryngologist

HCS 7367 Speech Perception

Recognition & Organization of Speech and Audio

LabROSA Research Overview

Recognition & Organization of Speech & Audio

Sound, Mixtures, and Learning

Sound Analysis Research at LabROSA

Tandem modeling investigations

Modulation and Top-Down Processing in Audition

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

Prelude Envelope and temporal fine. What's all the fuss? Modulating a wave. Decomposing waveforms. The psychophysics of cochlear

LATERAL INHIBITION MECHANISM IN COMPUTATIONAL AUDITORY MODEL AND IT'S APPLICATION IN ROBUST SPEECH RECOGNITION

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

Lecturer: T. J. Hazen. Handling variability in acoustic conditions. Computing and applying confidence scores

EEG Signal Description with Spectral-Envelope- Based Speech Recognition Features for Detection of Neonatal Seizures

Acoustic Signal Processing Based on Deep Neural Networks

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Noise-Robust Speech Recognition Technologies in Mobile Environments

Lecture 4: Auditory Perception. Why study perception?

CHAPTER 1 INTRODUCTION

Computational Perception /785. Auditory Scene Analysis

HCS 7367 Speech Perception

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

Robustness, Separation & Pitch

Using the Soundtrack to Classify Videos

Lecture 3: Perception

Broadband Wireless Access and Applications Center (BWAC) CUA Site Planning Workshop

Binaural processing of complex stimuli

Speech Enhancement Based on Deep Neural Networks

Auditory Scene Analysis: phenomena, theories and computational models

Recognition & Organization of Speech and Audio

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

11 Music and Speech Perception

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

A New Paradigm for the Evaluation of Forensic Evidence. Geoffrey Stewart Morrison. p p(e H )

The Institution of Engineers Australia Communications Conference Sydney, October

Automatic audio analysis for content description & indexing

Speech Spectra and Spectrograms

Tandem acoustic modeling: Neural nets for mainstream ASR?

Who are cochlear implants for?

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Speech (Sound) Processing

EEL 6586, Project - Hearing Aids algorithms

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

Computational Auditory Scene Analysis: Principles, Practice and Applications

Heart Murmur Recognition Based on Hidden Markov Model

A New Paradigm for the Evaluation of Forensic Evidence

Spectrograms (revisited)

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

An active unpleasantness control system for indoor noise based on auditory masking

HISTOGRAM EQUALIZATION BASED FRONT-END PROCESSING FOR NOISY SPEECH RECOGNITION

ACOUSTIC ANALYSIS AND PERCEPTION OF CANTONESE VOWELS PRODUCED BY PROFOUNDLY HEARING IMPAIRED ADOLESCENTS

Visi-Pitch IV is the latest version of the most widely

A Review on Dysarthria speech disorder

Improving the Noise-Robustness of Mel-Frequency Cepstral Coefficients for Speech Processing

Lecture 8: Spatial sound

Ambiguity in the recognition of phonetic vowels when using a bone conduction microphone

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

Automatic Live Monitoring of Communication Quality for Normal-Hearing and Hearing-Impaired Listeners

Acoustics of Speech and Environmental Sounds

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

3-D Sound and Spatial Audio. What do these terms mean?

Juan Carlos Tejero-Calado 1, Janet C. Rutledge 2, and Peggy B. Nelson 3

SoundRecover2 the first adaptive frequency compression algorithm More audibility of high frequency sounds

SPEECH recordings taken from realistic environments typically

IMPACT OF ACOUSTIC CONDITIONS IN CLASSROOM ON LEARNING OF FOREIGN LANGUAGE PRONUNCIATION

Sound Texture Classification Using Statistics from an Auditory Model

Assistive Technologies

Hearing in the Environment

HearIntelligence by HANSATON. Intelligent hearing means natural hearing.

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

Implementation of Spectral Maxima Sound processing for cochlear. implants by using Bark scale Frequency band partition

Auditory-based Automatic Speech Recognition

Analysis of Face Mask Effect on Speaker Recognition

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Robust Speech Detection for Noisy Environments

Linguistic Phonetics Fall 2005

A New Paradigm for Forensic Science. Geoffrey Stewart Morrison Ewald Enzinger. p p(e H )

Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phase Components

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

How is the stimulus represented in the nervous system?

Contributions of the piriform fossa of female speakers to vowel spectra

EFFECTS OF TEMPORAL FINE STRUCTURE ON THE LOCALIZATION OF BROADBAND SOUNDS: POTENTIAL IMPLICATIONS FOR THE DESIGN OF SPATIAL AUDIO DISPLAYS

Proceedings of Meetings on Acoustics

Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag

Robust Neural Encoding of Speech in Human Auditory Cortex

Challenges in microphone array processing for hearing aids. Volkmar Hamacher Siemens Audiological Engineering Group Erlangen, Germany

EMOTIONS are one of the most essential components of

Neural correlates of the perception of sound source separation

Discrete Signal Processing

Transcription:

EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition: Front Ends 1 2 Recognizing Speech Feature Calculation Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e682/ Columbia University Dept. of Electrical Engineering Spring 23 E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-1

1 Recognizing Speech So, I thought about that and I think it s still possible 4 Frequency 2. 1 1. 2 2. 3 Time What kind of information might we want from the speech signal? - words - phrasing, speech acts (prosody) - mood / emotion - speaker identity What kind of processing do we need to get at that information? - time scale of feature extraction - signal aspects to capture in features - signal aspects to exclude from features E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-2

Speech recognition as Transcription Transcription = speech to text - find a word string to match the utterance Best suited to small vocabulary tasks - voice dialing, command & control etc. Gives neat objective measure: word error rate (WER) % - can be a sensitive measure of performance Three kinds of errors: Reference: THE Recognized: - WER = (S + D + I) / N CAT SAT ON THE MAT CAT SAT AN THE A MAT Deletion Substitution Insertion E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-3

Limitations of the Transcription paradigm freq / Hz 4 2 Starts to fall down with natural speech - some words may not even exist.2.4.6.8 1 time / s ay m eh m b ae r ax s t s ey dhax l ae s I'M EMBARASSED (TO) SAY THE LAST Word transcripts do not capture everything - speaker changes, intonation, phrasing Word error rate treats all errors as equal - small words ( of ) counted as big words - small differences ( company s companies ) vs. larger ( held police health plans ) Move towards other measures - e.g. task-defined: was the meaning recognized? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-4

Why is Speech Recognition hard? Why not match against a set of waveforms? - waveforms are never (nearly!) the same twice - speakers minimize information/effort in speech Speech variability comes from many sources: - speaker-dependent (SD) recognizers must handle within-speaker variability - speaker-independent (SI) recognizers must also deal with variation between speakers - all recognizers are afflicted by background noise, variable channels Need recognition models that: - generalize i.e. accept variations in a range, and - adapt i.e. tune in to a particular variant E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14 -

Within-speaker variability 4 Timing variation: - word duration varies enormously Frequency 2. 1. 1. 2. 2. 3. s ow SO ay aa ax b aw axay th dx th n th I ABOUT I IT'S STILL POSSIBLE THOUGHT THAT THINK AND - fast speech reduces vowels ih k n ih t s t ih l p aa s b ax l Speaking style variation: - careful/casual articulation - soft/loud speech Contextual effects: - speech sounds vary with context, role: How do you do? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-6

Between-speaker variability Accent variation - regional / mother tongue Voice quality variation - gender, age, huskiness, nasality Individual characteristics - mannerisms, speed, prosody mbma fjdm2 freq / Hz 8 6 4 2 8 6 4 2. 1 1. 2 2. time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-7

Environment variability Background noise - fans, cars, doors, papers Reverberation - boxiness in recordings Microphone channel - huge effect on relative spectral gain Close mic freq / Hz 4 2 Tabletop mic 4 2.2.4.6.8 1 1.2 1.4 time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-8

freq / Hz 4 3 How to recognize speech? Cross correlate templates? - waveform? - spectrogram? - time-warp problems Match short-segments & handle time-warp later - model with slices of ~ 1 ms - pseudo-stationary model of words: sil g w eh n sil 2 1..1..2.2.3.3.4.4 - other sources of variation... time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-9

Which segments to use? Assume words can be broken down into pseudo-stationary segments - not a perfect fit, but worth a try Linguists offer phonemes or phones - phonemes are the minimal set needed to disambiguate words - phones are realizations of phonemes Other possibilities: - data-clustering techniques to define segments intrinsically - lesson from synthesis: transitions as important or more important than steady portions?...but how to model? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-1

Probabilistic formulation Probability that segment label is correct - gives standard form of speech recognizers: Feature calculation transforms signal into easilyclassified domain Acoustic classifier calculates probabilities of each mutually-exclusive state q i sn [ ] X m m = --- n H pq ( i X) Finite state acceptor (i.e. HMM) Qˆ = pq (, q 1, q L X, X 1 X L ) { q argmax, q 1, q L } MAP match of allowable sequence to probabilities: X q = ay q 1 1 2...... time E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-11

Standard speech recognizer structure Feature calculation sound feature vectors Network weights Word models Language model Acoustic classifier HMM decoder phone probabilities phone & word labeling Questions: - what are the best features? - how do we do the acoustic classification? - how do we find/match the state sequence? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-12

Outline 1 2 Recognizing Speech Feature Calculation - Spectrogram, MFCCs & PLP - Improving robustness E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-13

2 Feature Calculation Goal: Find a representational space most suitable for classification - waveform: voluminous, redundant, variable - spectrogram: better, still quite variable -...? Pattern Recognition: Representation is upper bound on performance - maybe we should use the waveform... - or, maybe the representation can do all the work Feature calculation is intimately bound to classifier - pragmatic strengths and weaknesses Features develop by slow evolution - current choices more historical than principled E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-14

Desired characteristics for features Provide the right information - extract signal information for classification task - suppress irrelevant information Be compatible with acoustic classifier - relatively low dimensionality - uncorrelated dimensions? Be practical - applicable in all circumstances - relatively inexpensive to compute Be robust - so far as possible, exclude nonspeech information How to evaluate features? - normally: just put them in a recognizer E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14 -

Features (1): Spectrogram Plain STFT as features e.g. [, ] sn+ mh n X m [ k] = SmHk = [ ] wn [ ] e ( j2πkn) N freq / Hz 8 6 4 2 8 6 4 2 Consider examples: Feature vector slice. 1 1. 2 2. time / s Similarities between corresponding segments - but still large differences E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-16

Features (2): Cepstrum Male spectrum cepstrum Female spectrum Idea: Decorrelate, summarize spectral slices: 8 4 8 6 4 2 8 4 X m [] l = IDFT{ log S[ mh, k] } - good for Gaussian models - greatly reduce feature dimension cepstrum 8 6 4 2. 1 1. 2 2. time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-17

Features (3): Frequency axis warp Linear frequency axis gives equal space to -1 khz and 3-4 khz - but perceptual importance very different Male Warp frequency axis closer to perceptual axis: - mel, Bark, constant-q... 8 X[ c] = Sk [ ] 2 k u c = l c spectrum 4 audspec Female audspec 1 1. 1 1. 2 2. time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-18

Male level / db audspec Features (4): Spectral smoothing 4 Generalizing across different speakers is helped by smoothing (i.e. blurring) spectrum Truncated cepstrum is one way: - MSE approx to log Sk [ ] LPC modeling is a little different: Sk [ ] - MSE approx to prefers detail at peaks 3 2 4 6 8 1 12 14 16 18 freq / chan 1 plp audspec plp 1 smoothed. 1 1. 2 2. time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-19

Features (): Normalization along time Idea: feature variations, not absolute level Hence: calculate average level & subtract it: X[ k] = Sk [ ] mean{ Sk [ ]} Male plp mean norm Factors out fixed channel frequency response: 1 1 log Sk [ ] sn [ ] = hn [ ] en [ ] = log * H[ k] + log Ek [ ] Female mean norm 1. 1 1. 2 2. time / s E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-2

Features (6): RASTA filtering Mean subtraction high-pass filtering along time in log-spectral domain X[ k] = Sk [ ] lpf{ Sk [ ]} + smooth along time for more blurring - -1 Male plp Bandpass filter in time - relates to modulation sensitivity in hearing? 1 gain / db - -2-2 -3. 1 2 1 2 freq / Hz rasta 1 E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-21

Male ddeltas deltas plp (µ,σ norm) Delta features Want each segment to have static feature vals - but some segments intrinsically dynamic! calculate their derivatives - maybe steadier? Append dx/dt (+ d 2 X/dt 2 ) to feature vectors 1 1 1. 1 1. 2 2. time / s Relates to onset sensitivity in humans? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-22

Overall feature calculation MFCCs and/or RASTA-PLP Sound FFT X[k] FFT X[k] cepstra Mel scale freq. warp log X[k] IFFT Truncate Subtract mean spectra audspec Bark scale freq. warp log X[k] Rasta band-pass LPC smooth Cepstral recursion smoothed onsets LPC spectra Key attributes: - spectral, auditory scale - decorrelation - smoothed (spectral) detail - normalization of levels CMN MFCC features Rasta-PLP cepstral features E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-23

spectrum 8 4 Features summary Male Female audspec 1 rasta 1 deltas 1. 1 1.. 1 time / s Normalize same phones Contrast different phones E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-24

Summary Speech recognition as word transcription - neat definition, but limited - hard because of variability Feature calculation extracts information - smoothed, decorrelated spectral parameters - long evolution to match classifiers How to actually recognize feature sequences? E682 SAPR - Dan Ellis L9 - ASR: Front Ends 23-4-14-2