Babies 'cry in mother's tongue' HCS 7367 Speech Perception Dr. Peter Assmann Fall 212 Babies' cries imitate their mother tongue as early as three days old German researchers say babies begin to pick up the nuances of their parents' accents while still in the womb. http://news.bbc.co.uk/2/hi/health/834658.stm Long-term spectrum of speech Speech communication in adverse listening conditions Connected speech Males Absolute threshold Females http://www.utdallas.edu/~assmann/aud636/sproc.pdf Long-term spectrum of speech Vowels Males Females 2) Absolute threshold Masking and interference Energetic masking reduced audibility of signal components due to overlap in spectral energy within the same auditory channel. Informational masking reduced audibility of signal components due to non-energetic factors such as target-masker similarity. Forward vs. backward speech maskers Familiar vs. foreign language 1
Resistance to distortion Effects of noise Articulation score: % items correct on spoken lists of syllables, words or sentences Signal-to-noise ratio (SNR): when speech and noise have the same average rms level (SNR= db), articulation scores are above 5% for listeners with normal hearing Signal-to-noise ratio (SNR) SNR = 2 log 1 [ rms(speech) / rms(noise) ] Specified in decibels When speech and noise have the same average rms level (SNR= db), articulation scores are above 5% for listeners with normal hearing. Signal-to-noise ratio (SNR) SNR = 2 log 1 [ rms(speech) / rms(noise) ] Why is speech intelligible when the masker is presented at the same level as the speech? Articulation Index How much does audibility contribute to difficulty understanding speech in noise? Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility Articulation Index 1. Divides the speech and masker spectrum into a small number of frequency bands 2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility 3. Derives overall intelligibility by summing the contributions of each band. 2
Speech recognition in noise White Noise Spectral properties of the noise: white, pink, speech-shaped, competing speech, speech babble Temporal properties of the noise: steady vs. modulated or interrupted Amplitude 1.5 -.5 Waveform Amplitude (db) 4 3 2 1 Spectrum -1 2 4 6 8 1-1 1 2 3 4 Effects of noise on speech recognition Non-uniform noise Pink Noise Speech-shaped Noise 4 4 3 3 Amplitude (db) 2 1 Amplitude (db) 2 1-1 1 2 3 4-1 1 2 3 4 Source: Miller, Heise and Lichten, J. Exp. Psychol. 1951 Non-uniform noise Multi-talker babble Speech babble (mixture of 1 sentences from 1 talker) Amplitude (db) 5 4 3 2 1-1 1 2 3 4 Effect of increasing the number of competing voices 1 sentence 2 sentences 4 sentences 8 sentences 16 sentences 3
Effects of noise on vowel spectra Vowel / / in quiet and in noise Broadband noise tends to fill up the valleys between the formant peaks. Spectral contrast (peak-to-valley ratio) is reduced by the addition of noise. Because of the sloping long-term spectrum of speech, the upper formants (F3, F4, F5) are more susceptible to masking and distortion by the noise. Amplitude (db) Excitation (db) -2-4 -2-4.2.5 1 2 5 In quiet -2-4.2.5 1 2 5 Pink noise, +6 db SNR -2-4.2.5 1 2 5.2.5 1 2 5 Effects of noise on formant peaks Effects of filtering Effects of filtering on speech High-pass and low-pass filtering 1 Low-pass filtering to remove frequencies above 18 Hz reduces intelligibility from near perfect to around 67%. High-pass filtering to remove components below 18 Hz also produces about 67%. Identification accuracy (%) 8 6 4 2 HP LP 1 2 3 5 1 2 5 1 Frequency (Hz) 4
Bandpass filtering Bandpass filtering with one-third octave filters centered 15-21 Hz produces better than 95% accuracy for highpredictability sentences (Warren et al., 1995; Stickney & Assmann, 21). Speech communication has an extraordinary resilience to distortion. 1. Intelligibility remains high even when large portions of the spectrum are eliminated by filtering. Stickney and Assmann (JASA 21) Other frequency distortions Notch filtering to remove frequencies between 8 and 3 Hz leads to consonant identification scores better than 9% (Lippman, 1996) Conclusion: speech cues are widely distributed Perception of filtered speech Everyday English sentences filtered using narrow bandpass filters remain highly intelligible (>9% words correct) one-third octave bandwidth, 15 Hz center frequency, 1 db/octave slopes Warren et al. (Percept Psychophys 1995; JASA 2) Perception of filtered speech Speech communication has an extraordinary resiliance to distortion. 2. Large segments of the waveform can be deleted or replaced by silence. Interruption rate = 5 Hz 1 second Stickney and Assmann (JASA 21) 5
Speech communication has an extraordinary resiliance to distortion. 3. Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = db). Speech communication has an extraordinary resiliance to distortion. 3. Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = db). + Speech Speech-shaped noise 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 Speech communication has an extraordinary resiliance to distortion. 4. When the noise is from a competing voice, target and masker are similar and must be segregated. 5 4 3 2 1 How do listeners achieve this? Statistical redundancy of speech/language Combined strategies of top-down + bottom-up processing Grouping and segregation of auditory objects Tracking speech properties over time Glimpsing speech fragments during noise-free intervals 1 2 3 4 5 6 1 2 3 4 5 6 Redundancy in speech and language Coker and Umeda (1974) define redundancy as: any characteristic of the language that forces spoken messages to have, on average, more basic elements per message, or more cues per basic element, than the barest minimum [necessary for conveying the linguistic message]. Redundancy in error correction Redundancy can be used effectively; or it can be squandered on uneven repetition of certain data, leaving other crucial items very vulnerable to noise.... But more likely, if a redundancy is a property of a language and has to be learned, then it has a purpose. Coker and Umeda (1974, p. 349) 6
Redundancy contributes speech perception in several ways 1. by limiting perceptual confusion due to errors in speech production; 2. by helping to bridge gaps in the signal created by interfering noise, reverberation, and distortions of the communication channel; and 3. by compensating for momentary lapses in attention and misperceptions on the part of the listener. Effects of context Contextual cues lead to improved speech understanding in noise. Acoustic-phonetic context Prosodic context Semantic context Syntax Miller, Heise & Lichten, 1951 Miller, Heise & Lichten, 1951 1 PERCE NT ITEMS CORRECT 8 6 4 2 Recognition of interrupted speech in quiet Interrupted speech In this condition the speech is turned on and off at regular intervals using an electronic switch. Miller and Licklider JASA 195 7
Interrupted speech Miller and Licklider JASA 195 Word Identification Accuracy (%) 1 8 6 4 2 Interrupted speech.1 1. 1 1 1 1 Frequency of ofinterruptions (s) (s) Interrupted speech In quiet, speech can be interrupted (turned on and off) periodically without substantial loss of intelligibility (Miller & Licklider, 195). Miller and Licklider found the worst intelligibility for interruption rates < 2 Hz, where large speech fragments (words, phrases) are missing. Interrupted speech Masking of speech by interrupted noise Miller and Licklider found improved performance for interruption rates between 1 and 1 Hz. Why? For very high interruption rates (>1 khz) the signal sounded continuous, and performance was near perfect. Miller and Licklider also measured speech intelligibility in conditions where the speech was continuous but the noise was interrupted. 16 Hz 128 Hz 512 Hz 8
Masking of speech by interrupted noise When the noise is intermittent rather than continuous there is a release from masking. The benefits of non-stationarity depend on the interruption rate and the duty cycle (onoff ratio) of the noise. Masking of speech by interrupted noise Interrupted noise At low interruption rates the effects are similar to speech interrupted by silence. As the interruption rate increases there is a gradual improvement in speech recognition. With 1 interruptions per second, listeners receive several glimpses of each word and can patch together those glimpses to recognize about 75% of the words correctly Interrupted noise When a noise masker is alternated with silence using a 5% duty cycle, there may be considerable masking release, compared to a continuous masker, especially with alternation rates between 1 and 2 per second (Miller and Licklider, 195). Summary: Interrupted noise 1. At alternation rates between about 1 and 2 per second, listeners can patch together cues from the clean segments between the bursts of noise. 2. With slower interruption rates, entire words or phrases are masked; others are noise-free. 3. At rates > 2/sec the masking effect is the same as uninterrupted, continuous noise. 9
Picket-fence effect Picket-fence effect Interrupted speech can have a harsh, distorted quality. But when speech and noise are alternated periodically, filling silent gaps with noise, the speech sounded smooth and continuous. Possibly, noise in the gaps enhances the listener s ability to exploit contextual cues. Howard-Jones and Rosen (1993) Checkerboard noise maskers Effects of interruption rate and frequency bandwidth of the checkerboard pattern Howard-Jones and Rosen (1993) Can listeners exploit asynchronous timefrequency glimpses? Yes, but only over broad frequency ranges 2. 2. Frequency 1..5.2 Frequency 1..5.2.1.1 1 2 3 4 5 6 7 Time 1 2 3 4 5 6 7 Time A glimpsing model of speech perception in noise Martin Cooke Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562 1573, March 26 Speech source separation How do the ear and brain separate the target voice from the noise? spatial cues lip-reading semantic context auditory scene analysis (Bregman, 199) glimpsing and tracking 1
Auditory scene analysis Bregman (199) The sound that reaches the eardrum of the listener is often a mixture of different sources Acoustic signals originating from different sound sources combine additively Unlike vision, the concept of occlusion is hard to define in audition: sounds overlap but also combine in complex ways. Auditory scene analysis Computational auditory scene analysis Reviewed by Cooke and Ellis (21) Human listeners are good at separating mixtures of sounds, as reflected in speech communication and listening to music in complex listening environments (cocktail parties) Attempts to reproduce this separation process using computational models had limited success (a hard problem!) Glimpsing speech in noise speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed. Glimpsing speech in noise The information conveyed by the spectrotemporal energy spectrum of clean speech is redundant Redundancy allows speech to be identified based on relatively sparse evidence. 2. 2. Frequency (Hz) 1..5 Frequency (Hz) 1..5.2.1 1 2 3 4 5 6 7.2.1 1 2 3 4 5 6 7 Glimpsing speech in noise Can listeners take advantage of glimpses? direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noise Maskers differed in glimpse size ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bands Conclusion: model + listeners benefit from glimpsing. Speech + noise mixtures Some regions dominated by target voice Local SNR varies across time and frequency Where the target voice dominates, the problem of source segregation is solved because the signal is effectively clean speech. Clean speech is highly redundant; it remains intelligible after 5% or more of its energy is removed by gating and/or filtering 11
STEP model Auditory excitation pattern (Moore, 23) Spectrogram-like representation Reflects non-uniform frequency selectivity in different frequency bands Incorporates a sliding time window reflecting temporal analysis by the auditory system Relative audibility at different frequencies Loudness model Missing data ASR HMM-based speech recognizer Missing-data models Glimpses only Ignore missing information (in masked regions) Glimpses-plus-background Try to fill in missing information (based on masked regions) Sparseness and redundancy Glimpses = spectrotemporal regions where signal exceeds masker by ~3 db. single talker masker target eight-talker masker speech-shaped noise Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947). glimpses Results 12
Results Conclusions Best model: Uses information in glimpses and counterevidence in the masked regions Glimpses constrained to a minimum area Treats all regions with local SNR > -5 db as potential glimpses FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 db. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is.955. Conclusions A higher glimpse threshold (e.g. local SNR > db) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 db). Conclusions Limitation: local SNR must be known in advance. Is there a way to estimate the local SNR directly from the mixture? Tracking problem: how to integrate glimpses over time? Brungart et al. (21) Brungart et al. (21) 2-talker correct responses (%) Different Modulated talker, talker, Same different talker same noise sex 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) 2-talker correct responses (%) 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) 13