HCS 7367 Speech Perception

Similar documents
HCS 7367 Speech Perception

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Computational Perception /785. Auditory Scene Analysis

Prelude Envelope and temporal fine. What's all the fuss? Modulating a wave. Decomposing waveforms. The psychophysics of cochlear

Role of F0 differences in source segregation

SPEECH PERCEPTION IN A 3-D WORLD

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Linguistic Phonetics Fall 2005

Linguistic Phonetics. Basic Audition. Diagram of the inner ear removed due to copyright restrictions.

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Providing Effective Communication Access

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

UvA-DARE (Digital Academic Repository) Perceptual evaluation of noise reduction in hearing aids Brons, I. Link to publication

Topics in Linguistic Theory: Laboratory Phonology Spring 2007

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES

Asynchronous glimpsing of speech: Spread of masking and task set-size

Auditory gist perception and attention

CHAPTER 1 INTRODUCTION

9/29/14. Amanda M. Lauer, Dept. of Otolaryngology- HNS. From Signal Detection Theory and Psychophysics, Green & Swets (1966)

Juan Carlos Tejero-Calado 1, Janet C. Rutledge 2, and Peggy B. Nelson 3

Binaural processing of complex stimuli

Assessing Hearing and Speech Recognition

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

11 Music and Speech Perception

Best Practice Protocols

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Issues faced by people with a Sensorineural Hearing Loss

Psychoacoustical Models WS 2016/17

THE MECHANICS OF HEARING

Slow compression for people with severe to profound hearing loss

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

Sound localization psychophysics

Robust Neural Encoding of Speech in Human Auditory Cortex

Influence of acoustic complexity on spatial release from masking and lateralization

Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation

Temporal offset judgments for concurrent vowels by young, middle-aged, and older adults

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

Auditory-Visual Speech Perception Laboratory

Hearing. and other senses

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

Hearing in the Environment

Speech intelligibility in background noise with ideal binary time-frequency masking

Effects of noise and filtering on the intelligibility of speech produced during simultaneous communication

Who are cochlear implants for?

BINAURAL DICHOTIC PRESENTATION FOR MODERATE BILATERAL SENSORINEURAL HEARING-IMPAIRED

Speech Intelligibility Measurements in Auditorium

Enrique A. Lopez-Poveda Alan R. Palmer Ray Meddis Editors. The Neurophysiological Bases of Auditory Perception

Hearing. Juan P Bello

Auditory scene analysis in humans: Implications for computational implementations.

Frequency refers to how often something happens. Period refers to the time it takes something to happen.

Speech (Sound) Processing

Binaural Hearing. Why two ears? Definitions

A. SEK, E. SKRODZKA, E. OZIMEK and A. WICHER

Speech perception of hearing aid users versus cochlear implantees

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

Research Article The Acoustic and Peceptual Effects of Series and Parallel Processing

Auditory Scene Analysis

EEL 6586, Project - Hearing Aids algorithms

Intelligibility of narrow-band speech and its relation to auditory functions in hearing-impaired listeners

Hybrid Masking Algorithm for Universal Hearing Aid System

Lecture 3: Perception

Masker-signal relationships and sound level

Auditory principles in speech processing do computers need silicon ears?

Evaluating the role of spectral and envelope characteristics in the intelligibility advantage of clear speech

Noise Susceptibility of Cochlear Implant Users: The Role of Spectral Resolution and Smearing

Ambiguity in the recognition of phonetic vowels when using a bone conduction microphone

Combating the Reverberation Problem

Challenges in microphone array processing for hearing aids. Volkmar Hamacher Siemens Audiological Engineering Group Erlangen, Germany

Hearing II Perceptual Aspects

What Is the Difference between db HL and db SPL?

Brian D. Simpson Veridian, 5200 Springfield Pike, Suite 200, Dayton, Ohio 45431

Technical Discussion HUSHCORE Acoustical Products & Systems

Lecture 4: Auditory Perception. Why study perception?

Lecture 9: Speech Recognition: Front Ends

HybridMaskingAlgorithmforUniversalHearingAidSystem. Hybrid Masking Algorithm for Universal Hearing Aid System

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED

SPHSC 462 HEARING DEVELOPMENT. Overview Review of Hearing Science Introduction

Representation of sound in the auditory nerve

Essential feature. Who are cochlear implants for? People with little or no hearing. substitute for faulty or missing inner hair

Lateralized speech perception in normal-hearing and hearing-impaired listeners and its relationship to temporal processing

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080

Effects of slow- and fast-acting compression on hearing impaired listeners consonantvowel identification in interrupted noise

Using Source Models in Speech Separation

Changes in the Role of Intensity as a Cue for Fricative Categorisation

Speech conveys not only linguistic content but. Vocal Emotion Recognition by Normal-Hearing Listeners and Cochlear Implant Users

Auditory nerve. Amanda M. Lauer, Ph.D. Dept. of Otolaryngology-HNS

SLHS 1301 The Physics and Biology of Spoken Language. Practice Exam 2. b) 2 32

UvA-DARE (Digital Academic Repository)

Effect of Consonant Duration Modifications on Speech Perception in Noise-II

Auditory Scene Analysis: phenomena, theories and computational models

Release from informational masking in a monaural competingspeech task with vocoded copies of the maskers presented contralaterally

Elements of Effective Hearing Aid Performance (2004) Edgar Villchur Feb 2004 HearingOnline

The effects of aging on temporal masking

Masking release and the contribution of obstruent consonants on speech recognition in noise by cochlear implant users

Building Skills to Optimize Achievement for Students with Hearing Loss

Revisiting the right-ear advantage for speech: Implications for speech displays

Modulation and Top-Down Processing in Audition

Prescribe hearing aids to:

Transcription:

Babies 'cry in mother's tongue' HCS 7367 Speech Perception Dr. Peter Assmann Fall 212 Babies' cries imitate their mother tongue as early as three days old German researchers say babies begin to pick up the nuances of their parents' accents while still in the womb. http://news.bbc.co.uk/2/hi/health/834658.stm Long-term spectrum of speech Speech communication in adverse listening conditions Connected speech Males Absolute threshold Females http://www.utdallas.edu/~assmann/aud636/sproc.pdf Long-term spectrum of speech Vowels Males Females 2) Absolute threshold Masking and interference Energetic masking reduced audibility of signal components due to overlap in spectral energy within the same auditory channel. Informational masking reduced audibility of signal components due to non-energetic factors such as target-masker similarity. Forward vs. backward speech maskers Familiar vs. foreign language 1

Resistance to distortion Effects of noise Articulation score: % items correct on spoken lists of syllables, words or sentences Signal-to-noise ratio (SNR): when speech and noise have the same average rms level (SNR= db), articulation scores are above 5% for listeners with normal hearing Signal-to-noise ratio (SNR) SNR = 2 log 1 [ rms(speech) / rms(noise) ] Specified in decibels When speech and noise have the same average rms level (SNR= db), articulation scores are above 5% for listeners with normal hearing. Signal-to-noise ratio (SNR) SNR = 2 log 1 [ rms(speech) / rms(noise) ] Why is speech intelligible when the masker is presented at the same level as the speech? Articulation Index How much does audibility contribute to difficulty understanding speech in noise? Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility Articulation Index 1. Divides the speech and masker spectrum into a small number of frequency bands 2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility 3. Derives overall intelligibility by summing the contributions of each band. 2

Speech recognition in noise White Noise Spectral properties of the noise: white, pink, speech-shaped, competing speech, speech babble Temporal properties of the noise: steady vs. modulated or interrupted Amplitude 1.5 -.5 Waveform Amplitude (db) 4 3 2 1 Spectrum -1 2 4 6 8 1-1 1 2 3 4 Effects of noise on speech recognition Non-uniform noise Pink Noise Speech-shaped Noise 4 4 3 3 Amplitude (db) 2 1 Amplitude (db) 2 1-1 1 2 3 4-1 1 2 3 4 Source: Miller, Heise and Lichten, J. Exp. Psychol. 1951 Non-uniform noise Multi-talker babble Speech babble (mixture of 1 sentences from 1 talker) Amplitude (db) 5 4 3 2 1-1 1 2 3 4 Effect of increasing the number of competing voices 1 sentence 2 sentences 4 sentences 8 sentences 16 sentences 3

Effects of noise on vowel spectra Vowel / / in quiet and in noise Broadband noise tends to fill up the valleys between the formant peaks. Spectral contrast (peak-to-valley ratio) is reduced by the addition of noise. Because of the sloping long-term spectrum of speech, the upper formants (F3, F4, F5) are more susceptible to masking and distortion by the noise. Amplitude (db) Excitation (db) -2-4 -2-4.2.5 1 2 5 In quiet -2-4.2.5 1 2 5 Pink noise, +6 db SNR -2-4.2.5 1 2 5.2.5 1 2 5 Effects of noise on formant peaks Effects of filtering Effects of filtering on speech High-pass and low-pass filtering 1 Low-pass filtering to remove frequencies above 18 Hz reduces intelligibility from near perfect to around 67%. High-pass filtering to remove components below 18 Hz also produces about 67%. Identification accuracy (%) 8 6 4 2 HP LP 1 2 3 5 1 2 5 1 Frequency (Hz) 4

Bandpass filtering Bandpass filtering with one-third octave filters centered 15-21 Hz produces better than 95% accuracy for highpredictability sentences (Warren et al., 1995; Stickney & Assmann, 21). Speech communication has an extraordinary resilience to distortion. 1. Intelligibility remains high even when large portions of the spectrum are eliminated by filtering. Stickney and Assmann (JASA 21) Other frequency distortions Notch filtering to remove frequencies between 8 and 3 Hz leads to consonant identification scores better than 9% (Lippman, 1996) Conclusion: speech cues are widely distributed Perception of filtered speech Everyday English sentences filtered using narrow bandpass filters remain highly intelligible (>9% words correct) one-third octave bandwidth, 15 Hz center frequency, 1 db/octave slopes Warren et al. (Percept Psychophys 1995; JASA 2) Perception of filtered speech Speech communication has an extraordinary resiliance to distortion. 2. Large segments of the waveform can be deleted or replaced by silence. Interruption rate = 5 Hz 1 second Stickney and Assmann (JASA 21) 5

Speech communication has an extraordinary resiliance to distortion. 3. Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = db). Speech communication has an extraordinary resiliance to distortion. 3. Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = db). + Speech Speech-shaped noise 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 Speech communication has an extraordinary resiliance to distortion. 4. When the noise is from a competing voice, target and masker are similar and must be segregated. 5 4 3 2 1 How do listeners achieve this? Statistical redundancy of speech/language Combined strategies of top-down + bottom-up processing Grouping and segregation of auditory objects Tracking speech properties over time Glimpsing speech fragments during noise-free intervals 1 2 3 4 5 6 1 2 3 4 5 6 Redundancy in speech and language Coker and Umeda (1974) define redundancy as: any characteristic of the language that forces spoken messages to have, on average, more basic elements per message, or more cues per basic element, than the barest minimum [necessary for conveying the linguistic message]. Redundancy in error correction Redundancy can be used effectively; or it can be squandered on uneven repetition of certain data, leaving other crucial items very vulnerable to noise.... But more likely, if a redundancy is a property of a language and has to be learned, then it has a purpose. Coker and Umeda (1974, p. 349) 6

Redundancy contributes speech perception in several ways 1. by limiting perceptual confusion due to errors in speech production; 2. by helping to bridge gaps in the signal created by interfering noise, reverberation, and distortions of the communication channel; and 3. by compensating for momentary lapses in attention and misperceptions on the part of the listener. Effects of context Contextual cues lead to improved speech understanding in noise. Acoustic-phonetic context Prosodic context Semantic context Syntax Miller, Heise & Lichten, 1951 Miller, Heise & Lichten, 1951 1 PERCE NT ITEMS CORRECT 8 6 4 2 Recognition of interrupted speech in quiet Interrupted speech In this condition the speech is turned on and off at regular intervals using an electronic switch. Miller and Licklider JASA 195 7

Interrupted speech Miller and Licklider JASA 195 Word Identification Accuracy (%) 1 8 6 4 2 Interrupted speech.1 1. 1 1 1 1 Frequency of ofinterruptions (s) (s) Interrupted speech In quiet, speech can be interrupted (turned on and off) periodically without substantial loss of intelligibility (Miller & Licklider, 195). Miller and Licklider found the worst intelligibility for interruption rates < 2 Hz, where large speech fragments (words, phrases) are missing. Interrupted speech Masking of speech by interrupted noise Miller and Licklider found improved performance for interruption rates between 1 and 1 Hz. Why? For very high interruption rates (>1 khz) the signal sounded continuous, and performance was near perfect. Miller and Licklider also measured speech intelligibility in conditions where the speech was continuous but the noise was interrupted. 16 Hz 128 Hz 512 Hz 8

Masking of speech by interrupted noise When the noise is intermittent rather than continuous there is a release from masking. The benefits of non-stationarity depend on the interruption rate and the duty cycle (onoff ratio) of the noise. Masking of speech by interrupted noise Interrupted noise At low interruption rates the effects are similar to speech interrupted by silence. As the interruption rate increases there is a gradual improvement in speech recognition. With 1 interruptions per second, listeners receive several glimpses of each word and can patch together those glimpses to recognize about 75% of the words correctly Interrupted noise When a noise masker is alternated with silence using a 5% duty cycle, there may be considerable masking release, compared to a continuous masker, especially with alternation rates between 1 and 2 per second (Miller and Licklider, 195). Summary: Interrupted noise 1. At alternation rates between about 1 and 2 per second, listeners can patch together cues from the clean segments between the bursts of noise. 2. With slower interruption rates, entire words or phrases are masked; others are noise-free. 3. At rates > 2/sec the masking effect is the same as uninterrupted, continuous noise. 9

Picket-fence effect Picket-fence effect Interrupted speech can have a harsh, distorted quality. But when speech and noise are alternated periodically, filling silent gaps with noise, the speech sounded smooth and continuous. Possibly, noise in the gaps enhances the listener s ability to exploit contextual cues. Howard-Jones and Rosen (1993) Checkerboard noise maskers Effects of interruption rate and frequency bandwidth of the checkerboard pattern Howard-Jones and Rosen (1993) Can listeners exploit asynchronous timefrequency glimpses? Yes, but only over broad frequency ranges 2. 2. Frequency 1..5.2 Frequency 1..5.2.1.1 1 2 3 4 5 6 7 Time 1 2 3 4 5 6 7 Time A glimpsing model of speech perception in noise Martin Cooke Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562 1573, March 26 Speech source separation How do the ear and brain separate the target voice from the noise? spatial cues lip-reading semantic context auditory scene analysis (Bregman, 199) glimpsing and tracking 1

Auditory scene analysis Bregman (199) The sound that reaches the eardrum of the listener is often a mixture of different sources Acoustic signals originating from different sound sources combine additively Unlike vision, the concept of occlusion is hard to define in audition: sounds overlap but also combine in complex ways. Auditory scene analysis Computational auditory scene analysis Reviewed by Cooke and Ellis (21) Human listeners are good at separating mixtures of sounds, as reflected in speech communication and listening to music in complex listening environments (cocktail parties) Attempts to reproduce this separation process using computational models had limited success (a hard problem!) Glimpsing speech in noise speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed. Glimpsing speech in noise The information conveyed by the spectrotemporal energy spectrum of clean speech is redundant Redundancy allows speech to be identified based on relatively sparse evidence. 2. 2. Frequency (Hz) 1..5 Frequency (Hz) 1..5.2.1 1 2 3 4 5 6 7.2.1 1 2 3 4 5 6 7 Glimpsing speech in noise Can listeners take advantage of glimpses? direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noise Maskers differed in glimpse size ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bands Conclusion: model + listeners benefit from glimpsing. Speech + noise mixtures Some regions dominated by target voice Local SNR varies across time and frequency Where the target voice dominates, the problem of source segregation is solved because the signal is effectively clean speech. Clean speech is highly redundant; it remains intelligible after 5% or more of its energy is removed by gating and/or filtering 11

STEP model Auditory excitation pattern (Moore, 23) Spectrogram-like representation Reflects non-uniform frequency selectivity in different frequency bands Incorporates a sliding time window reflecting temporal analysis by the auditory system Relative audibility at different frequencies Loudness model Missing data ASR HMM-based speech recognizer Missing-data models Glimpses only Ignore missing information (in masked regions) Glimpses-plus-background Try to fill in missing information (based on masked regions) Sparseness and redundancy Glimpses = spectrotemporal regions where signal exceeds masker by ~3 db. single talker masker target eight-talker masker speech-shaped noise Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947). glimpses Results 12

Results Conclusions Best model: Uses information in glimpses and counterevidence in the masked regions Glimpses constrained to a minimum area Treats all regions with local SNR > -5 db as potential glimpses FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 db. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is.955. Conclusions A higher glimpse threshold (e.g. local SNR > db) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 db). Conclusions Limitation: local SNR must be known in advance. Is there a way to estimate the local SNR directly from the mixture? Tracking problem: how to integrate glimpses over time? Brungart et al. (21) Brungart et al. (21) 2-talker correct responses (%) Different Modulated talker, talker, Same different talker same noise sex 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) 2-talker correct responses (%) 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) 13