Auditory-Visual Speech Perception Laboratory

Size: px

Start display at page:

Download "Auditory-Visual Speech Perception Laboratory"

Eileen McKinney
5 years ago
Views:

1 Auditory-Visual Speech Perception Laboratory Research Focus: Identify perceptual processes involved in auditory-visual speech perception Determine the abilities of individual patients to carry out these processes successfully Design intervention strategies using signal processing technologies and training techniques to remedy any deficiencies that may be found. Primary Funding: NIH Grant: DC Al NSF Grant (subcontract): SBR Learning and Intelligent Systems Initiative of the National Science DARPA Grant (subcontract): ONR Award N Staffing: Lab Director: Ken W. Grant Research Associate: Mary T. Cord

2 Auditory-Visual Speech Perception Laboratory Collaborations: Steven Greenberg - The Speech Institute, Oakland, CA David Poeppel - University of Maryland, College Park, MD Virginie van Wassenhove - University of Maryland, College Park, MD

3 Topics Auditory Supplements to Speechreading Bimodal Comodulation Spectro-Temporal Window of Integration

4 Auditory Supplements to Speechreading

5 Selective Needs of Severe-to to-profound Hearing Loss Speech Input Hearing Capacity compressed dynamic range compressed frequency range Speechreading/lipreading as primary channel of spoken language reception

6 Percent Correct Speech Recognition: Sentences B B J B J B J J B B B B B B J J J J J 20 B Auditory-Visual J Auditory J S/N (db) Roughly 6 db improvement in S/N; roughly 30% improvement in intelligibility for NH subjects

7 Speech Recognition: Consonants Percent Correct Recognition Speech-to-Noise Ratio (db) NH - Auditory Consonants HI Auditory Consonants HI-Auditory Sentences ASR Sentences

8 Auditory-Visual vs. Audio Speech Recognition Percent Correct Recognition Speech-to-Noise Ratio (db) NH - Audiovisual Consonants NH - Auditory Consonants HI - Audiovisual Consonants HI Auditory Consonants HI-Auditory Sentences ASR Sentences Roughly 6 db improvement in S/N; roughly 30% improvement in intelligibility for NH subjects.

9 Possible Roles of Speechreading Provide segmental information that is redundant with acoustic information

10 Possible Roles of Speechreading Provide segmental information that is redundant with acoustic information Provide segmental information that is complementary with acoustic information

11 Possible Roles of Speechreading Provide segmental information that is redundant with acoustic information Provide segmental information that is complementary with acoustic information Direct auditory analyses to the target signal who, where, when, what (spectral)

12 Auditory-Visual Speech Recognition: Consonants What information is available through speechreading? Which acoustic signals supplement speechreading? Are there significant audio-visual interactions in speech processing?

13 Visual Feature Recognition 4% 0% 3% Voicing Manner Place Other From Grant, K.W., and Walden, B.E. (1996). J. Acoust. Soc. Am. 100, % %Information Transmitted re: Total Information Received

14 Speech Recognition: Consonants Linguistic feature contributions to visual speech recognition. The top row represents typical feature classifications for speechreading alone (visemes). Each subsequent row represents the effects of adding information about another linguistic feature via an additional input channel (in this case auditory). Note that as additional features are added, consonant confusions associated with speechreading are resolved to a greater and greater extent. Speechreading Voicing Nasality Affrication p,b,m t,d,n g,k f,v T,D s,z S,tS,dZ,Z l r w j p b,m t d,n g k f v T D s z S,tS dz,z l r w j p b m t d n g k f v T D s z S,tS dz,z l r w j p b m t d n g k f v T D s z S ts dz Z l r w j

15 Hypothetical Consonant Recognition - Perfect Feature Transmission 1 Probability Correct A 0 PRE Voicing Manner Place Voicing + Manner Feature(s) Transmitted Auditory consonant recognition based on perfect transmission of indicated feature. Responses within each feature category were uniformly distributed. Predicted AV consonant recognition based on PRE model of integration (Braida, 1991).

16 Designer Acoustic Signals: Minimal Bandwidth, Maximum Benefit Frank and Joe Hardy scanned the wide valley that appeared before them as their yellow sports sedan rounded the crest of a hill. In the distance, a huge cylindrical tower rose from the valley floor. "Looks like a giant barnacle", Joe remarked to his older, dark-haired brother Frank. Biff Hooper, a tall, muscular high school friend of the two amateur detectives, leaned forward from the back seat. "You're looking at the cooling tower. The reactor itself is in the building next to it." Biff's uncle, Jerry Hooper, was a nuclear engineer at the Bayridge Nuclear Power Plant located outside Bayport. He had invited the three boys on a private afternoon tour of the facility. The summer had just begun, and the Hardy brothers were eager for new adventures. WB LPF AM AMFM

17 Designer Acoustic Signals: Minimal Bandwidth, Maximum Benefit Tracking Rate (WPM) SA AM FM AMFM LPF Receiving Condition Tracking Rate (% re: Normal)

18 Designer Acoustic Signals: AM Bands 100 CUNY Sentences Percent Correct / / SA WB 500 Hz 1600 Hz 3150 Hz Prefilter Condition

19 AM Bands - Smoothing-Filter Effects Percent Correct Percent Correct CUNY Sentences 100 IEEE Sentences Smoothing Filter Cutoff (Hz) Prefilter-Carrier SA

20 Problems With Articulation Theory Effective AI With Visual Cues ANSI, Calculated AI

21 Auditory-Visual Spectral Interactions: Consonants A AV Percent Correct Filter Condition Condition (Hz)

22 Feature Distribution re: Center Frequency Percent Information Transmitted (re: Total Received) Voicing + Manner Place From Grant, K.W., and Walden, B.E. (1996). J. Acoust. Soc. Am. 100, Bandwidth Center Frequency (Hz)

23 How It All Works - The Prevailing View Information extracted from both sources independently Integration of extracted information Decision statistic Evaluation Integration Decision From Massaro, 1998

24 A More Expanded View of the Process

25 Auditory Supplements to Speechreading SUMMARY: Speechreading provides information mostly about place-ofarticulation

26 Auditory Supplements to Speechreading SUMMARY: Speechreading provides information mostly about place-ofarticulation Auditory-visual speech recognition is determined primarily by complementary cues between visual and auditory modalities

27 Auditory Supplements to Speechreading SUMMARY: Speechreading provides information mostly about place-ofarticulation Auditory-visual speech recognition is determined primarily by complementary cues between visual and auditory modalities The most intelligible auditory speech signals do not necessarily result in the most intelligible auditory-visual speech signal

28 Auditory Supplements to Speechreading SUMMARY: Speechreading provides information mostly about place-ofarticulation Auditory-visual speech recognition is determined primarily by complementary cues between visual and auditory modalities The most intelligible auditory speech signals do not necessarily result in the most intelligible auditory-visual speech signal Acoustic cues for voicing and manner-or articulation are the best supplements to speechreading

29 Auditory Supplements to Speechreading SUMMARY: Speechreading provides information mostly about place-ofarticulation Auditory-visual speech recognition is determined primarily by complementary cues between visual and auditory modalities The most intelligible auditory speech signals do not necessarily result in the most intelligible auditory-visual speech signal Acoustic cues for voicing and manner-or articulation are the best supplements to speechreading These cues tend to be low frequency

30 Bimodal Coherence: Audio and Visual Comodulation

31 The Independence of Sensory Systems??? Information is extracted independently from A and V modalities Early versus Late Integration Most models are "late integration" models

32 Back to Our Conceptual Framework

33 The Independence of Sensory Systems??? Information is extracted independently from A and V modalities Early versus Late Integration Most models are "late integration" models BUT Speechreading activates primary auditory cortex (cf. Sams et al., 1991)

34 The Independence of Sensory Systems??? Information is extracted independently from A and V modalities Early versus Late Integration Most models are "late integration" models BUT Speechreading activates primary auditory cortex (cf. Sams et al., 1991) Population of neurons in cat Superior Colliculus respond only to bimodal input (cf. Stein and Meredith, 1993)

35 Sensory Integration: Many Questions How is the auditory system modulated by visual speech activity? What is the temporal window governing this interaction?

36 Bimodal Coherence Masking Protection BCMP (Grant and Seitz, 2000, Grant, 2001) Detection of speech in noise is improved by watching a talker (i.e., speechreading) as they produce the target speech signal, provided that the "visible" movement of the lips and acoustic amplitude envelope are highly correlated.

37 Basic Paradigm for BCMP: Exp. 1 Auditory-only speech detection... speech Auditory-visual speech detection... speech noise noise noise noise

38 Methodology for Orthographic BCMP: Exp. 2 Auditory-only speech detection Auditory + orthographic speech detection... speech text... speech noise noise noise noise

39 Methodology for Filtered BCMP: Exp. 3 F1 ( Hz) F2 ( Hz) Auditory-only detection of filtered speech Auditory-visual detection of filtered speech... speech... speech noise noise noise noise

40 Congruent versus Incongruent Speech Masked Threshold (db) A AV M AV UM -21 Sentence 1 Sentence 2 Sentence 3 Average Target Sentence

41 BCMP for Congruent and Incongruent Speech Bimodal Coherence Protection (db) AV M 2.15 AV UM Sentence 1 Sentence 2 Sentence 3 Average Target Sentence

42 Bimodal Coherence Protection (db) BCMP for Orthographic Speech Sentence 1 Sentence 2 Sentence 3 Average Target Sentence

43 Masking Protection (db) BCMP for Filtered Speech AV F1 AV F2 AV WB AV O 0 Sentence 2 Sentence 3 Average Target Sentence

44 Bimodal Masking Release (db) Average BCMP AV WB AV F2 AV F1 AV O Condition Bimodal comodulation masking protection for wideband speech (WB), filtered speech (F2 and F1), and for orthographically cued speech (O).

45 Acoustic Envelope and Lip Area Functions RMS Amplitude (db) WB F1 F2 F3 Lip Area (in 2 ) Watch the log float in the wide river Time (ms)

46 Cross Modality Correlation - Lip Area versus Amplitude Envelope Sentence 3 Sentence 2 WB r=0.52 r=0.35 RMS Envelope r=0.49 r=0.65 F1 F2 r=0.32 r=0.47 r=0.60 F3 r=0.45 Lip Area Lip Area

47 Local Correlations (Lip Area versus F2-Amplitude Envelope Correlation Coefficient Watch the log float in the wide river Time (ms) RMS Amplitude (db)

48 Local Correlations (Lip Area versus F2-Amplitude Envelope Correlation Coefficient Both brothers wear the same size Time (ms) RMS Amplitude (db)

49 Cross Modality Correlations (lip area versus acoustic envelope) Cross Modality Correlations (lip area versus acoustic envelope) S2 S3 Time (ms) S F F S F F2 Time (ms) Correlation Coefficient Peak Amplitude (db) Correlation Coefficient Peak Amplitude (db) Correlation Coefficient Correlation Coefficient Peak Amplitude (db) Peak Amplitude (db)

50 BCMP Summary Speechreading reduces auditory speech detection thresholds by about 1.5 db (range: 1-3 db depending on sentence)

51 BCMP Summary Speechreading reduces auditory speech detection thresholds by about 1.5 db (range: 1-3 db depending on sentence) Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics

52 BCMP Summary Speechreading reduces auditory speech detection thresholds by about 1.5 db (range: 1-3 db depending on sentence) Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 db, independent of the specific target sentence

53 BCMP Summary Speechreading reduces auditory speech detection thresholds by about 1.5 db (range: 1-3 db depending on sentence) Amount of BCMP depends on the degree of coherence between acoustic envelope and facial kinematics Providing listeners with explicit (orthographic) knowledge of the identity of the target sentence reduces speech detection thresholds by about 0.5 db, independent of the specific target sentence Manipulating the degree of coherence between area of mouth opening and acoustic envelope by filtering the target speech has a direct effect on BCMP

54 Temporal Window for Auditory-Visual Integration

55 AUDIO-ALONE ALONE EXPERIMENTS

Audio (Alone) Spectral Slit Paradigm The edge of each slit was separated from its nearest neighbor by an octave Can listeners decode spoken sentences using just four

56 Audio (Alone) Spectral Slit Paradigm The edge of each slit was separated from its nearest neighbor by an octave Can listeners decode spoken sentences using just four narrow (1/3 octave) channels ( slits ) distributed across the spectrum? YES (cf. next slide) What is the intelligibility of each slit alone and in combination with others?

57 Word Intelligibility - Single and Multiple Slits 89% 60% 13% Slit Number CF (Hz) 2% 9% 9% 4% Slit Number CF (Hz)

Slit Asynchrony Affects Intelligibility Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility The effect of asynchrony on intelligibility is

58 Slit Asynchrony Affects Intelligibility Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility The effect of asynchrony on intelligibility is relatively symmetrical These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions

59 Cross-Spectral Temporal Asynchrony Effects TIMIT Word Recognition (%) Bands Bands Slit Asynchrony (ms) From Greenberg, Arai, and Silipo (1998). Proc. ICSLP, Sydney, Dec. 1-4.

60 AUDIO-VISUAL EXPERIMENTS

61 IEEE Sentences Auditory-Visual Tasks Recognition of key words Audio slits Video presented at various temporal asynchronies CV Syllables Recognition of McGurk pairs Audio /pa/, /ba/, /ta/, /da/ Video /ka/, /ga/, /ta/, /da/ Synchrony identification and discrimination Yes/No single interval simultaneity judgments 2IFC adaptive tracking

62 Auditory-Visual Asynchrony - Paradigm

63 Cross-Modality Temporal Asynchrony Effects: Sentences 100 IEEE Word Recognition (%) Audio Alone - Bands Speechreading Alone Audio Delay (ms)

64 Auditory-Visual Integration - by Individual S's These data are complex, but the implications are clear. Video signal leading is better than synchronous for 8 of 9 subjects Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility Variation across subjects

65 McGurk Synchrony Paradigm

66 Temporal Integration in the McGurk Effect Response Probability /pa/ /ta/ /ka/ Audio Delay (ms)

67 Simultaneity Judgements - Natural vs. McGurk AV Tokens 1 A d V d Probability of Simultaneity A t V t A p V k A b V g Audio Delay (ms)

68 Spectro-Temporal Synchrony Discrimination Filter Band Condition (Hz) WB Audio Lead Audio Lag Audio Delay (ms)

69 Temporal Window of Integration McGurk Synchrony Discrimination McGurk Synchrony Identification Audio Audiovisual McGurk Fusion Identification CV Synchrony Discrimination CV Synchrony Identification IEEE Synchrony Discrimination IEEE Word Identification IEEE Synchrony Discrimination TIMIT Word Identification Audio Delay (ms)

70 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration)

71 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical

72 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical TWI roughly 50 ms or less (phoneme?)

73 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical TWI roughly 50 ms or less (phoneme?) Across Modality (Cross-Modal AV Integration)

74 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical TWI roughly 50 ms or less (phoneme?) Across Modality (Cross-Modal AV Integration) TWI is highly asymmetrical favoring visual leads

75 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical TWI roughly 50 ms or less (phoneme?) Across Modality (Cross-Modal AV Integration) TWI is highly asymmetrical favoring visual leads TWI is roughly ms (syllable?)

76 Spectro-Temporal Integration: Summary Within Modality (Cross- Spectral Auditory Integration) TWI is symmetrical TWI roughly 50 ms or less (phoneme?) Across Modality (Cross-Modal AV Integration) TWI is highly asymmetrical favoring visual leads TWI is roughly ms (syllable?) TWI for Incongruent CV's (McGurk Stimuli) is not as wide as TWI for natural congruent CV's

77 Auditory-Visual Speech Perception Laboratory Walter Reed Army Medical Center Army Audiology and Speech Center Washington, DC USA wramc.amedd.army.mil/departments/.army.mil/departments/aasc/avlabavlab

Detection of auditory (cross-spectral) and auditory visual (cross-modal) synchrony

Speech Communication 44 (2004) 43 53 www.elsevier.com/locate/specom Detection of auditory (cross-spectral) and auditory visual (cross-modal) synchrony Ken W. Grant a, *, Virginie van Wassenhove b, David