Long-term spectrum of speech HCS 7367 Speech Perception Connected speech Absolute threshold Males Dr. Peter Assmann Fall 212 Females Long-term spectrum of speech Vowels Males Females 2) Absolute threshold Sound pressure level (db) 12 1 8 6 4 2 Audibility of speech Conversational Speech 62.5 125 25 5 1k 2k 4k 8k 16k 32k Frequency (Hz) Absolute threshold (normal listeners) Types of hearing loss Conductive loss Sensorineural loss Audibility/distortion Effect of noise Hearing Loss in db (ANSI-1996) Pure-tone audiogram Left ear 25 5 1K 2K 4K -1 1 2 3 4 5 6 7 8 9 1 11 12 13 Normal -1 1 2 3 4 5 6 7 8 9 1 11 12 13 Right ear 25 5 1K 2K 4K Conductive loss Source: www.brainconnection.com Bone conduction thresholds Air conduction thresholds Source: www.bcm.tmc.edu/oto/studs/aud.html 1
Hearing Loss in db (ANSI-1996) Pure-tone audiogram Left ear 25 5 1K 2K 4K -1 1 2 3 4 5 6 7 8 9 1 11 12 13 Sensorineural loss -1 1 2 3 4 5 6 7 8 9 1 11 12 13 Bone conduction thresholds Air conduction thresholds Right ear 25 5 1K 2K 4K Mixed loss Speech audiometry Nonsense syllables, real words, words in sentences threshold for recognizing 5% of test items percentage of items correctly reported Speech tests provide a valid measure of hearing handicap. Poor speech scores may indicate hearing loss of retrocochlear origin Source: www.bcm.tmc.edu/oto/studs/aud.html Sensorineural hearing loss Listeners with cochlear hearing loss have difficulty recognizing speech when background noise is present. Reduced audibility Supra-threshold distortions Impaired frequency selectivity Loudness recruitment Speech recognition in noise Speech reception threshold, SRT (Plomp & Mimpen, 1969) Speech-to-noise ratio required to achieve a specific level of intelligibility, typically 5% Effects of speech materials Effects of type of masker (e.g., speech-shaped noise vs. a single competing talker) Effects of spatial separation of target & masker Speech recognition in noise Masker type Listening situation Deficit in SRT Speech-shaped noise Speech-shaped noise Speech+masker in front, unaided Speech+masker in front, aided 2.5-7. db 2.5-6. db Single talker Speech+masker in front, unaided 6. - 12. db Single talker Speech+masker in front, aided 4. - 1. db Articulation Index How much does audibility contribute to difficulty understanding speech in noise? Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility Single talker Speech+masker in front, spatially separated 12. 19. db Source: Moore, BCJ (23) Speech Communication 2
Articulation Index 1. Divides the speech and masker spectrum into a small number of frequency bands 2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility 3. Derives overall intelligibility by summing the contributions of each band. Articulation Index Most studies show that speech intelligibility is worse than predicted by the AI for hearing-impaired listeners, especially for moderate or severe hearing loss. Articulation Index Conclusion: factors other than audibility must be responsible for the difficulties experienced by hearing-impaired listeners understanding speech in noise. What else? Frequency selectivity Temporal resolution Frequency Selectivity Frequency selectivity is the ability to resolve the spectral components of complex sounds. Reduced frequency selectivity may lead to difficulty in understanding speech in noise. Auditory filters Fletcher (194) suggested that the peripheral auditory system could be modeled as a bank of linear bandpass filters with continuously overlapping center frequencies. Auditory filters Each point along the basilar membrane corresponds to a filter with a different center frequency, with center frequencies increasing roughly logarithmically from the apex to the base. Gain Frequency 3
Auditory filters About half of the length of the human basilar membrane is devoted to the lowest khz (F1 range of speech) with the majority of neural fibers responding best to low-tomid-frequencies. Critical Bandwidth Fletcher (194) band-widening experiment The threshold for detecting a pure tone in the presence of a bandpass noise masker increases as the noise bandwidth increases, until the width of the band exceeds the critical bandwidth of the auditory filter. Tone detection threshold Noise masker bandwidth Critical Bandwidth Sources of evidence for critical bandwidth: Band-widening experiments (Fletcher, 194) Loudness summation (Zwicker et al., 1957) Two-tone masking (Zwicker, 1954) Discrimination of partials within complex tones (Plomp and Mimpen, 1968) Critical Bandwidth Fletcher (194) made the simplifying assumption that the auditory filter could be modeled as a rectangle, with flat top and vertical slopes. Gain CB Frequency Only the lowest 5-8 partials can be reliably discriminated. Power spectrum model of masking Power spectrum model of masking Fletcher suggested that only a narrow band of frequencies in the region of the tone contribute to masking. He called this the critical bandwidth (CB). Gain Auditory Filter CB Frequency But threshold changes gradually as the noise bandwidth increases, suggesting auditory filters with sloping rather than rectangular skirts (Patterson, 1976). Gain Auditory Filter CB Frequency 4
Power spectrum model of masking Detection of probe tone in the presence of a noise masker depends of the relative power of probe and noise passed by the auditory filter centered on the tone (Patterson, 1976). Auditory Filter Power spectrum model of masking Noise power is often specified as the power in a band of frequencies 1 Hz wide. This is called noise power density, designated N. The total power in a band of noise is calculated as W N, where W is the noise bandwidth in Hz. Tone Noise masker W Frequency Power spectrum model of masking When the noise just masks the tone, the ratio of the power in the tone to the power in the noise is a constant, K. P / ( W N ) K and W P / ( K N ) Power spectrum model of masking Assumptions: Only frequencies within the passband of the auditory filter contribute to masking. Detection is based on a single auditory filter, centered on the frequency of the tone. Listeners ignore short-term fluctuations in the noise, and do not rely on phase differences between signal and noise. LP Noise Notched noise method Auditory Filter HP Noise HP Noise Off-frequency listening Shifted Filter Tone detection can be improved by shifting filter center frequency to maximize SNR Tone Patterson (1976) Tone 5
Notched noise method Patterson (1976) estimated auditory filter shapes from the function relating tone threshold to notch width. The derived filters have a rounded top and steep skirts, with bandwidths 1-15% of filter center frequency. Relative amplitude (db) -1-2 -3-4 -5 4 6 8 1 12 14 16 Derived auditory filter shape Relative amplitude (db) -1-2 -3-4 Simulation of reduced frequency selectivity -5 4 6 8 1 12 14 16 Normal ( ) -1-2 -3-4 -5 4 6 8 1 12 14 16 Impaired (3 Normal) Derived auditory filter shapes Filter Gain (db) -1-2 -3-4 -5 Auditory filter shapes as a function of frequency Frequency response of gammatone filter bank Fc=194 Hz ERB=143 Hz -6 1 2 3 4 5 Output level (db) Auditory filter shapes as a function of level 9 8 7 6 5 4 3 2 1 5 1 15 2 Frequency (Hz) Equivalent Rectangular Bandwidth The equivalent rectangular bandwidth (or ERB) of a filter is the bandwidth of a rectangular filter which has the same power output as that filter, when the input is white noise. Equivalent Rectangular Bandwidth ERB equivalent rectangular bandwidth of the estimated auditory filter about 1-15% of the filter center frequency. 2 1 Relative amplitude (db) -1-2 -3-4 ERB ERB (Hz).5.2.1.5-5 4 6 8 1 12 14 16.2.5.1.2.5 1 2 5 1 Center Frequency (Hz) 6
Cochlear frequency-place map Greenwood (1961) developed a function to relate the characteristic frequency (CF) at each place on the cochlea to the distance (x) of that place from the apex. ERB Scale One ERB unit corresponds to a distance of about.89 mm along the basilar membrane. Human data ERB-rate scale The ERB-rate scale is a warped frequency scale modeling changes in the ERB of the auditory filter as a function of frequency. ERB-rate (ERB) 3 25 2 15 1 5 1 2 3 4 Frequency (Hz) ERB-rate as a function of frequency Excitation patterns Auditory excitation patterns show the composite output of a bank of simulated auditory filters as a function of filter center frequency. Filter output Filter Center Frequency Excitation patterns Excitation patterns Excitation patterns provide a good model of auditory frequency selectivity and masking: frequency components that are resolved by the auditory system produce distinct peaks in the excitation pattern. Outer and middle ears Energy Detector Energy Detector Energy Detector Frequency (ERB-rate) Cochlear Filtering CNS 7
Excitation patterns Excitation patterns -1 5 Hz pure tone -1 Complex tone, equal amplitude harmonics Amplitude (db) -2-3 -4 Amplitude (db) -2-3 -4-5 -5-6.2.5 1. 2. 4. -6.2.5 1. 2. 4. Excitation patterns Auditory filterbank spectrogram Amplitude (db) -1-2 -3-4 -5 2 Hz Vowel: / æ / 4 6 F = 2 Hz 8 F2 145 Hz F3 245 Hz Frequency (Hz) 2. 1..5-6.2.5 1. 2. 4..2.1 1 2 3 4 5 6 7 Time (ms) Simulation studies Simulation of reduced frequency selectivity Simulation of reduced frequency selectivity (spectral smearing of the short-term speech spectrum) results in lowered intelligibility for listeners with normal hearing, particularly in noise (ter Keurs et al., 1993; Baer & Moore, 1994) Relative amplitude (db) -1-2 -3-4 -5 4 6 8 1 12 14 16 Normal ( ) -1-2 -3-4 -5 4 6 8 1 12 14 16 Impaired (3 Normal) Derived auditory filter shapes 8
Amplitude (db) -1-2 -3-4 -5 Effects of reduced frequency selectivity on vowel / ӕ / F = 2 Hz 2 Hz 4 6 8 F2 145 Hz F3 245 Hz 3 x normal 2 x normal 1 x normal Distortion of spectral shape Broader auditory filters produce a smeared excitation pattern: reduced prominence of peaks, smaller peak-to-valley ratios. Introduction of noise fills up the valleys between the spectral peaks and reduces the distinctiveness of the spectral profile. -6.2.5 1. 2. 4. Distortion of temporal structure Broader auditory filters alter the temporal fine structure of the output. Increased contribution of adjacent components Increase in within-channel modulation Diminished differences between adjacent channels Filter center frequency (Hz) Effects of reduced frequency selectivity on temporal structure 31 62 93 124 155 186 217 248 279 31 3311 3612 3913 4214 Normal x 1 Normal x 3 Normal x 1 31 62 93 124 155 186 217 248 279 31 3311 3612 3913 4214 Normal x 3 4515 1 2 3 Time 4515 1 2 3 Time Loudness Recruitment When a sound is increased in level above absolute threshold, the rate of growth of loudness is greater than normal. At levels >9-1 db SPL, loudness returns to normal (sound appears equally loud to hearing-impaired and normal listeners). Loudness Recruitment Loudness recruitment is associated with reduced dynamic range (range between absolute threshold and highest comfortable level). Recruitment may reduce the ability to listen in the dips in a fluctuating masker, such as a competing voice. Recruitment distorts loudness relationships among components of speech sounds. 9
Glimpsing speech in noise A glimpsing model of speech perception in noise speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed. Martin Cooke Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562 1573, March 26 Frequency (Hz) 2. 1..5.2.1 1 2 3 4 5 6 7 Time (ms) Glimpsing speech in noise The information conveyed by the spectrotemporal energy spectrum of clean speech is redundant Redundancy allows speech to be identified based on relatively sparse evidence. Frequency (Hz) 2. 1..5 Glimpsing speech in noise Can listeners take advantage of glimpses? direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noise Maskers differed in glimpse size ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bands Conclusion: model + listeners benefit from glimpsing..2.1 1 2 3 4 5 6 7 Time (ms) Speech + noise mixtures Some regions dominated by target voice Local SNR varies across time and frequency Where the target voice dominates, the problem of source segregation is solved because the signal is effectively clean speech. Clean speech is highly redundant; it remains intelligible after 5% or more of its energy is removed by gating and/or filtering STEP model Auditory excitation pattern (Moore, 23) Spectrogram-like representation Reflects non-uniform frequency selectivity in different frequency bands Incorporates a sliding time window reflecting temporal analysis by the auditory system Relative audibility at different frequencies Loudness model 1
Missing data ASR HMM-based speech recognizer Missing-data models Glimpses only Ignore missing information (in masked regions) Glimpses-plus-background Try to fill in missing information (based on masked regions) Sparseness and redundancy Glimpses = spectrotemporal regions where signal exceeds masker by ~3 db. single talker masker target eight-talker masker speech-shaped noise glimpses Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947). Results Results FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 db. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is.955. 11
Conclusions Best model: Uses information in glimpses and counterevidence in the masked regions Glimpses constrained to a minimum area Treats all regions with local SNR > -5 db as potential glimpses Conclusions A higher glimpse threshold (e.g. local SNR > db) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 db). Conclusions Limitation: local SNR must be known in advance. Is there a way to estimate the local SNR directly from the mixture? Tracking problem: how to integrate glimpses over time? 2-talker correct responses (%) Brungart et al. (21) Different Modulated talker, talker, Same different talker same noise sex 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) Brungart et al. (21) 2-talker correct responses (%) 12 9 6 3 3 6 9 12 Target-to-Masker Ratio (db) 12