A visual concomitant of the Lombard reflex

Similar documents
Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag

Providing Effective Communication Access

Auditory and Auditory-Visual Lombard Speech Perception by Younger and Older Adults

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Consonant Perception test

Proceedings of Meetings on Acoustics

11 Music and Speech Perception

Exploiting visual information for NAM recognition

THE ROLE OF VISUAL SPEECH CUES IN THE AUDITORY PERCEPTION OF SYNTHETIC STIMULI BY CHILDREN USING A COCHLEAR IMPLANT AND CHILDREN WITH NORMAL HEARING

Best Practice Protocols

IS THERE A STARTING POINT IN THE NOISE LEVEL FOR THE LOMBARD EFFECT?

FM SYSTEMS. with the FONIX 6500-CX Hearing Aid Analyzer. (Requires software version 4.20 or above) FRYE ELECTRONICS, INC.

Ambiguity in the recognition of phonetic vowels when using a bone conduction microphone

Gick et al.: JASA Express Letters DOI: / Published Online 17 March 2008

SPEECH PERCEPTION IN A 3-D WORLD

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Noise-Robust Speech Recognition Technologies in Mobile Environments

Spatial processing in adults with hearing loss

PERCEPTION OF UNATTENDED SPEECH. University of Sussex Falmer, Brighton, BN1 9QG, UK

PSYCHOMETRIC VALIDATION OF SPEECH PERCEPTION IN NOISE TEST MATERIAL IN ODIA (SPINTO)

REFERRAL AND DIAGNOSTIC EVALUATION OF HEARING ACUITY. Better Hearing Philippines Inc.

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

Can you hear me now? Amanda Wolfe, Au.D. Examining speech intelligibility differences between bilateral and unilateral telephone listening conditions

HCS 7367 Speech Perception

ACOUSTIC ANALYSIS AND PERCEPTION OF CANTONESE VOWELS PRODUCED BY PROFOUNDLY HEARING IMPAIRED ADOLESCENTS

Voice Pitch Control Using a Two-Dimensional Tactile Display

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

Temporal offset judgments for concurrent vowels by young, middle-aged, and older adults

CLASSROOM AMPLIFICATION: WHO CARES? AND WHY SHOULD WE? James Blair and Jeffery Larsen Utah State University ASHA, San Diego, 2011

Testing FM Systems on the 7000 Hearing Aid Test System

Linguistic Phonetics Fall 2005

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES

SLHS 1301 The Physics and Biology of Spoken Language. Practice Exam 2. b) 2 32

Linguistic Phonetics. Basic Audition. Diagram of the inner ear removed due to copyright restrictions.

Differential-Rate Sound Processing for Cochlear Implants

Echo Canceller with Noise Reduction Provides Comfortable Hands-free Telecommunication in Noisy Environments

An active unpleasantness control system for indoor noise based on auditory masking

April 23, Roger Dynamic SoundField & Roger Focus. Can You Decipher This? The challenges of understanding

Evaluating Auditory Contexts and Their Impacts on Hearing Aid Outcomes with Mobile Phones

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

The Use of a High Frequency Emphasis Microphone for Musicians Published on Monday, 09 February :50

HCS 7367 Speech Perception

Andrea Pittman, PhD CCC-A Arizona State University

TOPICS IN AMPLIFICATION

Twenty subjects (11 females) participated in this study. None of the subjects had

Lindsay De Souza M.Cl.Sc AUD Candidate University of Western Ontario: School of Communication Sciences and Disorders

The Ear. The ear can be divided into three major parts: the outer ear, the middle ear and the inner ear.

Appendix E: Basics of Noise. Table of Contents

Topics in Linguistic Theory: Laboratory Phonology Spring 2007

A Sound Foundation Through Early Amplification

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

Speech (Sound) Processing

Hear Better With FM. Get more from everyday situations. Life is on

Juan Carlos Tejero-Calado 1, Janet C. Rutledge 2, and Peggy B. Nelson 3

Use of Auditory Techniques Checklists As Formative Tools: from Practicum to Student Teaching

Integration of Acoustic and Visual Cues in Prominence Perception

Variation in spectral-shape discrimination weighting functions at different stimulus levels and signal strengths

Contents THINK ACOUSTICS FIRST NOT LAST WHO BENEFITS FROM IMPROVED ACOUSTICS?

TIPS FOR TEACHING A STUDENT WHO IS DEAF/HARD OF HEARING

How can the Church accommodate its deaf or hearing impaired members?

Hearing. and other senses

My child with a cochlear implant (CI)

ENZO 3D First fitting with ReSound Smart Fit 1.1

1.34 Intensity and Loudness of Sound

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

Extending the Perception of Speech Intelligibility in Respiratory Protection

5. Which word refers to making

ReSound NoiseTracker II

Voice Switch Manual. InvoTek, Inc Riverview Drive Alma, AR USA (479)

Improving Communication in Parkinson's Disease: One Voice, Many Listeners. This session was recorded on: Tuesday, June 4, 2013 at 1:00 PM ET.

The Benefits and Challenges of Amplification in Classrooms.

Speech Reading Training and Audio-Visual Integration in a Child with Autism Spectrum Disorder

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Sound Localization PSY 310 Greg Francis. Lecture 31. Audition

Optimizing Dynamic Range in Children Using the Nucleus Cochlear Implant

PC BASED AUDIOMETER GENERATING AUDIOGRAM TO ASSESS ACOUSTIC THRESHOLD

Spectrograms (revisited)

Hearing Evaluation: Diagnostic Approach

APPENDIX G NOISE TERMINOLOGY

Proceedings of Meetings on Acoustics

Tactile Communication of Speech

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

Psychoacoustical Models WS 2016/17

ASR. ISSN / Audiol Speech Res 2017;13(3): / RESEARCH PAPER

Speech intelligibility in simulated acoustic conditions for normal hearing and hearing-impaired listeners

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

Hearing. Figure 1. The human ear (from Kessel and Kardon, 1979)

The Effect of Analysis Methods and Input Signal Characteristics on Hearing Aid Measurements

Your Hearing Assessment Report

[5]. Our research showed that two deafblind subjects using this system could control their voice pitch with as much accuracy as hearing children while

HOW TO USE THE SHURE MXA910 CEILING ARRAY MICROPHONE FOR VOICE LIFT

WIDEXPRESS. no.30. Background

Speech Generation and Perception

My child has a hearing loss

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

Auditory-Visual Speech Perception Laboratory

Sound Texture Classification Using Statistics from an Auditory Model

1706 J. Acoust. Soc. Am. 113 (3), March /2003/113(3)/1706/12/$ Acoustical Society of America

easy read Your rights under THE accessible InformatioN STandard

Evaluating Language and Communication Skills

Transcription:

ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 2005 (AVSP 05) British Columbia, Canada July 24-27, 2005 A visual concomitant of the Lombard reflex Jeesun Kim 1, Chris Davis 1, Guillaume Vignali 1 and Harold Hill 2 1 Department of Psychology, The University of Melbourne, Melbourne, Australia 2 Department of Vision Dynamics, ATR, Kyoto, Japan ABSTRACT The aim of the study was to examine how visual speech (speech related head and face movement) might vary as a function of communicative environment. To this end, measurements of head and face movement were recorded for four talkers who uttered the same ten sentences in quiet and four types of background noise condition (Babble and White noise presented through ear plugs or loud speakers). These sentences were also spoken in a whisper. Changes between the normal and in-noise conditions were apparent in many of the Principal Components (PCs) of head and face movement. To simplify the analysis of differences between conditions only the first six movement PCs were considered. The strength and composition of the changes was variable. Large changes occurred for jaw and mouth movement, face expansion and contraction and head rotation in the Z axis. Minimal change occurred for PC3 (rigid head translation in the Z axis). Whispered speech showed many of the characteristics of speech produced in noise but was distinguished by a marked increase in head translation in the Z axis. Analyses of the correlation between auditory speech intensity and movement under the different production conditions also revealed a complex pattern of changes. The correlations between RMS speech energy and the PCs that involved jaw and mouth movement (PC1 and 2) increased markedly from the normal to in-noise production conditions. An increase in the RMS and movement correlation also occurred for head Z-rotation as a function of speaking condition. No increases were observed for the movement associated with head Z-translation, lip protrusion or mouth opening with face contraction. These findings suggest that the relationships underlying Audio-Visual speech perception may differ depending on how that speech was produced. Studies on the Lombard effect have concentrated on describing the changes that occur in the auditory signal. In this paper, we extend this approach to include the changes that occur in the visual correlates of speech articulation. In the current study we also examined whispered speech as it has been claimed that some of the effects of whispering are similar to those that occur for speech production made with increased vocal effort [8]. 2. METHOD Participants. Four people participated in the experiment (3 males, 1 female). All were native speakers of English (one British, two Australian and one American); ages ranged from 32 to 54 years. Materials. The materials were 10 sentences selected from the 1965 revised list of phonetically balanced sentences (Harvard Sentences, [9]). Noise. Two types of background noise were employed, multi-talker babble and white noise. A commercial babble track (Auditec, St. Louis, MO) was used; the flat white noise was generated at the same average RMS as the babble track. The noise was presented to participants either through ear plugs or through two loud speakers (Yamaha MS 101-II), see Figure 1 for layout. The conditions will be referred to as Babble_Plug; Babble_LS for the babble noise ear plug and loud speaker conditions and White_Plug and White_LS for the white noise plug and loud speaker conditions. 1. INTRODUCTION The Lombard effect (or Lombard reflex or sign) refers to the tendency for a person to increase the loudness of his/her voice in the presence of noise [1]. Compared with normal speech, speech in noise is typically produced with increased volume, decreased speaking rate, and changes in articulation and pitch [2]. The Lombard effect occurs in children [3], macaques [4] and birds [5]. In most instances the Lombard effect occurs reflexively; it is not diminished by training or instruction [6; although see 7]. Figure 1. The layout of the capture session. The participant whose head and face movements were recorded sat in an adjustable chair facing a conversational partner who stood behind the two OPTOTRAK machines. 17

Apparatus Movement capture. Two Northern Digital Optotrak machines were used to record the movement data. The configuration of the markers on the face and head rig is shown in Figure 2. Figure 2. Position of the 24 facial markers and the four rigid body markers (on the extremities of the head-rig). Also shown are the directions of the XYZ axes. Sound capture. Sound was captured from both a headmounted (Share SM12A) and a floor microphone (Sennheiser MKH416P48U-3). Video capture. Video was captured using a digital camera (Sony HDCAM HKDW-702). Procedure. Each session began with the placement of the movement sensors (see Fig 1) during which time participants were asked to memorize the ten sentences to be spoken. Each participant was recorded individually in a session that lasted approximately 90 minutes. Participants were seated in an adjustable dentist chair in a quite room and were asked to say aloud ten sentences (one at a time) to a person who was directly facing them at a distance of approximately 2.5 metres (See Fig 1). The participant then repeated the ten sentences. This basic procedure was repeated five times more, once for each speech mode condition. These conditions consisted of the participant speaking while hearing multi-talker babble through a set of ear plugs (at approximately 80 db SPL); hearing the same babble through two Loud Speakers; hearing white noise through ear plugs (at the same intensity); hearing white noise through the Loud Speakers and finally whispering the sentences (at a level judged loud enough for the conversational partner to hear). Data processing. Non-rigid facial and rigid head movement was extracted from the raw marker positions. The data were recorded at a sampling rate of 60Hz. Each frame was represented in terms of its displacement from the first frame and Principle Component (PC) analysis was used to reduce the dimensionality of the data. Discrete data was fitted as continuous functions using B- spline basis functions (so called functional data, see [10]). This process of converting discrete data to trajectories required that all instances were of the same time, this was achieved by a reversible time alignment procedure (with time differences recorded). The characteristic shape of the data was maintained over the time warping procedure by using manually placed landmarks which were then aligned in time (the beginning and end of each curve were also taken as landmarks). 3. RESULTS Analysis of the auditory data indicates that there was a Lombard effect. For instance, speech produced in the noise conditions was louder than that produced without background noise [F (1,78) = 365.41, p < 0.05]. On average the size of this effect was 11 db. The length of the renditions also increased for the in-noise conditions compared to no-noise Normal speech [F(1,78) = 33.4) with an average increased production time of 290 ms. Figure 3 shows the cumulative contribution of the eigenvalues that resulted from the PCA. It can be seen that 26 PCs were sufficient to recover 99% of the variance of the head and face movement. Figure 3. The cumulative contribution of the eigenvalues of the Principal Component Analysis (PCA). In order to parameterize the contribution of the PCs, absolute values of each PC were summed to represent the amount that each PC contributed to head and face movement over time (see Fig 4). These data ( PC strength ) were used as the dependant measure in a series of ANOVAs to determine whether there were differences in the amount of movement across speech modes, sentences and persons. Figure 4. The absolute value of each PCs contribution across time was summed to provide a PC strength measure. In the analysis that follows we chose to concentrate on the first 6 PCs since visualization of these showed that they largely represented specific head and face 18

movement. That is, PC1 could be glossed as jaw motion (and mouth opening); PC2 as mouth opening and eyebrow raising (without jaw motion), PC3 as head translation (in the Z axis), PC4 as lip protrusion, PC5 as mouth opening and eyebrow closure (cf PC2) and PC6 as rotation in the Z axis (see Fig 5). Approximate 90% of the variance was captured by the first 6 PCs. Normal x PCs, F = 23.58, p < 0.05 and White _LS vs. Normal x PCs, F = 19.2, p < 0.05]. There was also an interaction for whispered speech vs. Normal x PCs [F = 11.04]. Figure 5. The first six Principal Components (PCs). See text for details. In terms of these PCs, the effect of talking in noise (as measured from the no-noise normal condition) can be characterized as consisting of a marked increase in jaw (PC1) and mouth motion and eyebrow expansion (PC2), increases both in lip protrusion (PC4), mouth and eyebrow closure (PC5) and head rotation (PC6). There was however no change in the translation of the head (PC3). Figure 6 (top) shows the increase in size of first 6 PCs for speech in noise (relative to each of the associated PCs for the normal speech condition). Whispered speech is also shown (bottom panel). Note how PC3 differs across the two panels of the figure. Each of the in-noise conditions was compared to that of normal speech. There was a main effect of speech mode (in-noise vs. normal) for each contrast [Babble_Plug, F(1,200) = 146.46, p < 0.05; Babble_LS, F = 56.67, p < 0.05; White_plug, F = 99.58, p < 0.05; White _LS, F = 80.96, p < 0.05]. Whispered speech movement also differed from those of normal speech [F = 137.8, p < 0.05]. Speech in noise produced a different pattern of PCs compared to normal speech as indicated by significant interactions between the main effects of noise vs. normal and the difference PCs [the interactions for Babble_Plug vs. Normal x PCs, F(5,200) = 35.4, p < 0.05; Babble_LS vs. Normal x PCs, F = 15.86, p < 0.05; White_plug vs. Figure 6. The change in the size of PCs 1 6 (relative to the no-noise condition) for the four speech-in-noise conditions (top) and the whispered speech condition The four speech-in-noise conditions were analyzed separately to determine if the type of noise affected the amount (main effect) and constitution (interaction effect) of head and face movement. The analyses indicated that there was a main effect of noise-type [F (3,585) = 3.82, p < 0.05] and that this interacted with the effect of the different PCs [F(15, 585) = 2.86, p < 0.05]. ANOVAs were conducted with noise-type and the mode of noise delivery (Plug vs. Loud Speaker) as factors to determine whether these properties altered the pattern of head and face movement. The strength of PCs 1 6 are shown in Figure 7 separated into Plug vs. Loud Speaker (top) and Babble vs. White noise (bottom). Analysis revealed that there was a significant effect of noise-type [F(1,195) = 4.41, p < 0.05, with the sum of PCs 1-6 being larger for babble noise than for white noise] and a significant effect of the mode of noise delivery (Plug vs. Loud Speaker) [F(1,195 = 4.2, p < 0.05, with the sum of PCs 1-6 being larger for the Plug than for Louder Speakers]. There was also an interaction between these effects and the pattern of PCs. For noise-type x PCs [F(5, 195) = 4.46, p < 0.05] and noise-delivery x PCs [F(5,195) = 2.25, p = 0.05]. Analysis of the mean RMS intensities for the four noise conditions showed that there was no difference in the intensity of the productions under the different noise conditions [F(3,340 = 1.06, p > 0.05]. 19

Previous studies have shown that the correlation of auditory RMS and speech related movement (typically mouth movement) change as a function of speech frequency band (e.g., [11]). To examine this, the auditory stimuli were band-pass filtered into two speech bands that should contain F1 (100-800 Hz) and F2 energy respectively (800-2200 Hz). Figure 9 shows the average correlation coefficients for F1 RMS energy and PCs 1-6 for normal speech and for an average of the speech in noise conditions. Correlations for the F2 band are shown in Figure 10. Figure 7. The strength of PCs 1 6 for the Plug and Loud Speaker noise conditions (top) and the Babble and White noise conditions. The following analyses examined the relationship between the RMS of the auditory signal for wide band (WB) and two frequency sub-bands (approximately in the F1 and F2 range) for PCs 1 6 across the different in noise conditions. These analyses were conducted using the data from a single participant (male, aged 45). Given that there might be an asynchrony between the auditory and visible speech data, B-spline basis functions were fitted to the auditory RMS data in order to time-align it with the PC movement data. The correlation coefficient was calculated as the zeroth lag of the normalized covariance function. Figure 8 shows the average correlation coefficients between Wide Band (WB) RMS energy and PCs 1-6 for normal speech and for an average of the speech in noise conditions. Figure 9. Average correlation coefficients between F1 Band RMS and PCs 1 6 for the normal and in-noise speech. The pattern of the correlations shown in Figure 9 closely resembles those found for WB speech (Fig 8). Once again, there was a stronger correlation between those movement components that relate to jaw and mouth opening and head rotation (nodding) for speech produced in noise than for that for normal speech. Figure 10. Average correlation coefficients between F2 Band RMS and PCs 1 6 for the normal and in-noise speech. Figure 8. Average correlation coefficients between WB RMS and PCs 1 6 for the normal and in-noise speech. As can be seen in the figure, the correlations for PC1, 2 and 6 all increase for the speech produced in noise conditions, whereas those of components 3, 4 and 5 do not. Overall the general pattern of correlations shown in Figure 10 is similar to that found for WB and F1 band speech. It should be noted, however, that the correlations for the first two PCs for the F2 data are considerably higher than those for the WB or F1 band signals. This result of an increased coupling between auditory energy fluctuations and mouth and jaw movement confirms the results of previous studies [11, 12]. This increase in the speech energy and movement correlation was not shown for PC 6 (head rotation in the Z axis). 20

4. DISCUSSION Head and face movement undergo a complex pattern of change when a person talks in noisy conditions compared to when they talk in quiet. We examined the first six PCs of head and face movement for different types of background noise and different noise presentation methods. In general, it was found that the largest changes from the normal to noise conditions were in jaw and mouth movement, face expansion and contraction and head rotation (it should be noted that these differences extend to the other PCs as an main analysis of PCs 7 26 also revealed a significant difference between speech related movement across the normal and the in-noise conditions [F(1,19) = 41.35, p < 0.05]). However, not all of the movement PCs analyzed showed a change across normal to in-noise conditions; there was no change in head translation in the Z axis (although this movement PC did show an increase for whispered speech). Apart from gross differences associated with the change from silent to noisy conditions, there were also different movement patterns displayed across the different types of noise and presentation methods. Speech produced in multi-talker babble tended to have more jaw and mouth movement than that produced in white noise. This pattern was also evident for the ear plug vs. loud speaker contrast, with larger jaw and mouth movement occurring with the ear plugs. The difference between the noise types and presentation methods might be related to the perception of the loudness/interfering properties of the noise (certainly the audibility of one s own voice differs under these listening conditions). Analysis of the correlation between auditory intensity (RMS) and the measure of each PCs contribution also revealed an intricate series of differences between the correlations found for normal speech and those of the speech in noise conditions. Speaking in noise increased the correlation between RMS energy and the PCs that involved jaw and mouth movement (PC1 and 2). Likewise, speaking in noise also increased the correlation between auditory RMS and rigid head rotation in the Z axis. The increased association of this nodding movement with speech energy might be a bi-product of the amplification of jaw movement, pulmonary effort and increased syllabic stress. The stronger correlation between RMS energy profile in the F2 band and jaw and mouth related movement (compared to WB and F1 RMS) appeared to reduce the size of the increase in correlation associated with the in-noise conditions. This suggests that the two effects (band and noise) might be affective on the same mechanism. In sum, it is clear that speech related head and face movement (visible speech) are different for speech produced in quiet and in noise. These results call for the new studies examining AV speech perception in noise conditions with visual speech actually produced under various noise conditions (in contrast to those that have used visual speech recorded under quiet conditions). The current results are also pertinent to studies of speech perception in noise that have used indices of AV correlation collected in quiet speech conditions. Acknowledgments: This work was supported by the Australian Research Council. We would like to thank Denis Burnham, Marcia Riley and Kuratate Takaaki for their help in conducting the Optotrak recordings. 5. REFERENCES [1]. Lombard, E. (1911). Le signe de l'elevation de la voix. ANN. MAL. OREIL. LARYNX, 37, 101-199. [2]. Junqua J-C. (1996). The influence of acoustics on speech production: A noise-induced stress phenomenon known as the Lombard reflex. Speech Communication, 20, 13-22. [3]. Amazi DK, Garber SR. (1982). The Lombard sign as a function of age and task. Journal of Speech and Hearing Research. 25, 581-585. [4]. Sinnott, J. M., Stebbins, W. C. and Moody, D. B. (1975). Regulation of voice amplitude by the monkey. Journal of the Acoustical Society of America. 58,412-414. [5]. Potash, L. M. (1972). Noice-induced changes in calls of the Japanese quail. Psychonom. Science. 26,252-254. [6]. Pick, H.L., Siegel, G.M., Fox, P.W., Garber, S.R. & Kearney, J.K., (1989). Inhibiting the Lombard effect. Journal of the Acoustical Society of America, 85 (2), 894-900. [7]. Tonkinson, S. (1994). The Lombard effect in choral singing. Journal of Voice, 8, 24-29. [8]. Tranmuller, H. & Eriksson, A. (2000). Acoustic effects of variation in vocal effort by men, women, and children. Journal of the Acoustical Society of America, 107, 3438-3451. [9]. Harvard sentences: Appendix of: IEEE Subcommittee on Subjective Measurements IEEE Recommended Practices for Speech Quality Measurements. IEEE Transactions on Audio and Electroacoustics. 17, 227-46, 1969. [10]. Ramsay J.O. & Silverman B.W. (1997). Functional Data Analysis. Springer. [11]. Grant, K. W., & Seitz, P.F (2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197-1208. [12]. Kim, J. & Davis, C. (2003). Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception, 32, 111-120. 21