Cognitive resources in audiovisual speech perception

Size: px
Start display at page:

Download "Cognitive resources in audiovisual speech perception"

Transcription

1 Cognitive resources in audiovisual speech perception by Julie Noelle Buchan A thesis submitted to the Department of Psychology in conformity with the requirements for the degree of Doctor of Philosophy Queen s University Kingston, Ontario, Canada September 2011 Copyright c Julie Noelle Buchan, 2011

2 Abstract Most events that we encounter in everyday life provide our different senses with correlated information, and audiovisual speech perception is a familiar instance of multisensory integration. Several approaches will be used to further examine the role of cognitive factors on audiovisual speech perception. The main focuses of this thesis will be to examine the influences of cognitive load and selective attention on audiovisual speech perception, as well as the integration of auditory and visual information in talking distractor faces. The influence of cognitive factors on the temporal integration of auditory and visual speech, and gaze behaviour during audiovisual speech will also be addressed. The overall results of the experiments presented here suggest that the integration of auditory and visual speech information is quite robust to various attempts to modulate the integration. Adding a cognitive load task shows minimal disruption of the integration of auditory and visual speech information. Changing attentional instructions to get subjects to selectively attend to either the auditory or visual speech information also has a rather modest influence on the observed integration of auditory and visual speech information. Generally, the integration of temporally offset auditory and visual information seems rather insensitive to cognitive load or selective attentional manipulations. The i

3 processing of visual information from distractor faces seems to be limited. The language of the visually articulating distractors doesn t appear to provide information that is helpful for matching together the auditory and visual speech streams. Audiovisual speech distractors are not really any more distracting than auditory distractor speech paired with a still image, suggesting a limited processing or integration of the visual and auditory distractor information. The gaze behaviour during audiovisual speech perception appears to be relatively unaffected by an increase in cognitive load, but is somewhat influenced by attentional instructions to selectively attend to the auditory and visual information. Additionally, both the congruency of the consonant, and the temporal offset of the auditory and visual stimuli have small but rather robust influences on gaze. ii

4 Co-Authorship Kevin G. Munhall is a co-author for the papers that appear in Chapters 3 and 4. iii

5 Acknowledgments First, I would like to thank the Academy (since that s probably the only time I m going to be able to say that!), my supervisor, Dr. Kevin Munhall, and my committee, Dr. Ingrid Johnsrude and Dr. Daryl Wilson. Thank you Kevin, without your help and support (and selective application of patience and nagging) this thesis never, ever, would have been written. You have been the best supervisor (though I should caveat that we are dealing with a small sample size, n = 1), and there is absolutely no way that I would have been able to go through this journey without you. Thank you Daryl and Ingrid for your helpful and insightful comments on the draft of the thesis. I would also like to thank all the members of my committee, including my external Dr. Vincent Gracco and my internal/external Dr. Greg Lessard, for making the defense a positive experience. A big thank you goes out to everyone in the Speech Perception and Production Lab, both current and past members. Special thanks go out to Paul Plante and Agnès Alsius. Paul, thanks so much for all your help in creating stimuli and analyzing eyetracking data. Without your help there would not have been a set of 2781 symbols for the Experiments in Chapter 2, the videos for Experiment 3b would still have all their high spatial frequency components, and the initial extraction of the eyetracking data from the.asc files for Experiments 2a, 2b, 3a, 3b and 6c would have taken much iv

6 much much longer. Agnès, thanks so much for all of your help, support and collaboration on the Experiments in Chapter 6 (Experiments 5, 6a, 6b and 6c). Thanks for all your help and input in designing the studies, filming the videos, putting together the stimuli, running the studies and analyzing the data, I couldn t (and probably wouldn t!) have done it without you. A huge hug and thank you needs to go to my parents, Nancy and Jack Buchan, and to my sister, Leah Buchan. Thanks Mom and Dad for everything. Without your emotional (and sometimes financial!) support I never would have been able to finish. Thanks for listening and cheering me on. Leah, thanks for all the cromulent proofreading. If anyone finds any typos or awkward sentences, they can address them to Leah Buchan c/o (... just kidding. Leah Buchan is a brilliant woman with lots of well thought-out, practical ideas. Oh yes, and her personal hygiene is above reproach). Thank you Alana for the interpretive dance of Chapter 2. I m still waiting for the video on YouTube. A big thank you to my aunt, Heather Buchan. Thank you for all your support over the years, and more recently, your B & B. Thank you to everyone else who has been along for the ride and had to listen to me to regale them tales of thesis woe and/or research, and yet still picked up the phone when they saw my name on caller ID. So, thank you Sarah Sjoholm, Jolene Lamoureux, Amber Williamson, Ryan Lake and Sau-Ling Hum for listening. While you will never read this, nor have any idea what any of these words even mean, thanks Andie, Snickers and Sam for all your hugs. Oh, and thank you, Piled Higher and Deeper, a.k.a. PhD comics ( for keeping me sane. Finally, a huge thank you to my significant other, Fred Kroon. Thanks for helping v

7 me with much of the data analysis in some way or another, with some of the figures, and with the formatting of the thesis (without your help, the page numbers would still be in the wrong place on the page, and all of the appendices would still be labled A ). Thank you for and introducing me to LaTeX, and python, and Linux, since they all wound up playing a huge role in this thesis. Thank you for listening, thank you for being there, and thank you for believing in me when I wasn t able to believe in myself. Take that, space coyote! vi

8 Table of Contents Abstract i Co-Authorship iii Acknowledgments iv Table of Contents vii List of Tables xi List of Figures xii Chapter 1: Introduction Is the fusion of visual and auditory speech automatic? Overview Ethics Published material Chapter 2: Cognitive Load Experiment 1a vii

9 2.2 Experiment 1b Chapter 3: Cognitive Load and Temporal Offsets Abstract Introduction Methods Results Discussion Chapter 4: Selective Attention Abstract Introduction Methods Results Discussion Chapter 5: Gaze Centralization Experiment Chapter 6: Distractor Faces Experiment Experiment 6a viii

10 6.3 Experiment 6b Experiment 6c Chapter 7: General Discussion Summary Discussion Some limitations of the current research Some considerations for future research Conclusion Appendix A: Ethics board approval letters A.1 General Research Ethics Board approval letters A.2 Completion of Course in Human Research Participant Protection Appendix B: Pilot Experiment for Chapter B.1 Pilot Experiment Appendix C: Additional data for Chapter C.1 Gaze on screen by task condition and offset Appendix D: Appendix to Chapter ix

11 Appendix E: Sentence stimuli in Chapter E.1 Example sentences for Experiment E.2 Target sentences for Experiment References x

12 List of Tables 4.1 Experiment 3 d stimulus/response Experiment 3a Gaze on lower half of the screen Experiment 3b Gaze on lower half of the screen Experiment 4 original experiments Experiment 6c Distractor words Experiment 6c Distractor words C.1 Exp 4a Appendix samples on screen C.2 Exp 4b Appendix samples on screen xi

13 List of Figures 2.1 Experiment 1 symbols Experiment 1a Speech task Experiment 1a Symbols task Experiment 1b Speech task Experiment 1b Symbols task Experiment 1b First block only Speech task Experiment 1b First block only Symbols task Experiment 2 Stimuli Experiment 2a Speech task - congruent Experiment 2a Speech task - incongruent Experiment 2a Numbers task Experiment 2b Speech task - congruent Experiment 2b Speech task - incongruent Experiment 2b Numbers task Experiment 2a Gaze centralization Experiment 2a Gaze on eyes Experiment 2a Gaze on mouth Experiment 2b Gaze centralization xii

14 3.12 Experiment 2b Gaze on eyes Experiment 2b Gaze on mouth Experiment 3a Stimuli Experiment 3b Stimuli- 16.1cpf Experiment 3b Stimuli- 4.3cpf Experiment 3a Speech task congurent Experiment 3a Speech task incongruent Experiment 3a Controls Experiment 3a d values Experiment 3b Speech task congruent Experiment 3b Speech task incongruent Experiment 3b Controls Experiment 3b d values Experiment 3a Gaze behavior Experiment 3b Gaze behavior Experiment 4a Time Series y axis Experiment 4b Time Series y axis Experiment 4c Time Series y - Baseline Experiment 4c Attend audio vs attend video Experiment 4c Time Series y - Attend-audio Experiment 4c Time Series y - Attend-video Experiment 4d Time Series y - Baseline Experiment 4d Time Series y - Attend-audio Experiment 4d Time Series y - Attend-video xiii

15 6.1 Experiment 5 Stimuli Experiment 5 Reaction times Screen image for Experiment 6a Experiment 6a Speech task Screen image for Experiment 6b Experiment 6b Speech task Experiment 6c Speech task A.1 Ethics board approval A.2 Ethics board approval A.3 Ethics board approval A.4 Ethics board approval A.5 Course in research ethics B.1 Pilot Experiment Speech task B.2 Pilot Experiment Symbols task B.3 Pilot Experiment Working memory task B.4 Pilot Experiment Working memory task with controls D.1 Experiment 4a Time Series x axis D.2 Experiment 4b Time Series x axis D.3 Experiment 4c Time Series x - Baseline D.4 Experiment 4c Time Series x - Attend-audio D.5 Experiment 4c Time Series x- Attend-video D.6 Experiment 4d Time Series x - Baseline D.7 Experiment 4d Time Series x - Attend-audio xiv

16 D.8 Experiment 4d Time Series x - Attend-video xv

17 Chapter 1 Introduction Most events that we encounter in everyday life provide our different senses with correlated information, and audiovisual speech perception is a familiar instance of multisensory integration. Visual speech information can also increase the intelligibility of acoustically degraded information. In acoustically noisy environments, the intelligibility of acoustic speech can be enhanced by allowing people to see a talker articulating (Sumby & Pollack, 1954; O Neill, 1954; Neely, 1956; Erber, 1969; Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007). Visual information can also influence the perception of perceptually ambiguous replicas of natural speech. Visual speech information has been shown to increase the intelligibility of sine wave speech (Remez, Fellowes, Pisoni, Goh, & Rubin, 1998), and aid in the learning of noise vocoded speech (Pilling & Thomas, 2011). In addition to aiding the perception of speech in acoustically degraded environments, visual information can also modify the perception of perfectly audible speech. Visual speech information has been shown to increase the intelligibility of difficult to understand clear speech (Reisberg, McLean, & Goldfield, 1987), and presenting conflicting visual information can influence the perception of 1

18 CHAPTER 1. INTRODUCTION auditory speech (McGurk & MacDonald, 1976; Summerfield & McGrath, 1984). This conflicting auditory and visual speech can be integrated, causing an illusory percept of a sound not present in the actual acoustic speech (i.e., McGurk effect). The extent to which the integration of audiovisual speech information is automatic, or can be modified with attentional manipulations is currently a topic of debate (Navarra, Alsius, Soto-Faraco, & Spence, 2010). 1.1 Is the fusion of visual and auditory speech automatic? Evidence that the integration is automatic Several lines of evidence suggest that the integration of auditory and visual speech information may be largely obligatory and automatic. For example, McGurk and MacDonald (1976) made an observation that the experimenters still experienced the McGurk illusion despite the fact that they were well aware of the dubbing process used to create the illusion, and a later study by Liberman (1982) showed that awareness of the dubbing process used to create the McGurk stimuli does not seem to have much of an effect on the illusion. The McGurk effect seems to be robust across a wide range of situations, and seems to occur regardless of the knowledge that the listener may have about the stimuli. For example, the McGurk effect even occurs when the auditory and visual speech information streams come from talkers of different genders (Green, Kuhl, Meltzoff, & Stevens, 1991; Walker, Bruce, & O Malley, 1995), suggesting that specific matches between faces and voices are not critical for the visual speech information to influence the perception of the acoustic speech information. Prelinguistic infants have 2

19 CHAPTER 1. INTRODUCTION also been shown to be susceptible to the McGurk effect (Rosenblum, Schmuckler, & Johnson, 1997; Burnham & Dodd, 2004). The McGurk effect can even occur when subjects are not aware that they are looking at a face, as shown in a study using point-light displays of talking faces (Rosenblum & Saldaña, 1996). There also doesn t seem to be a reaction time cost in terms of processing the illusory McGurk percept as compared to the actual speech token in a speeded classification task (Soto-Faraco, Navarra, & Alsius, 2004). Soto-Faraco et al. (2004) used a syllabic interference task that involved classifying the first syllable of a bisyllable non-word where the second (irrelevant) syllable varies from trial to trial. When the second syllable varies from trial to trial response latencies to the first syllable have been shown to increase. Soto-Faraco et al. (2004) used the McGurk effect to both produce and eliminate this effect using bimodal speech stimuli. In one experiment they used stimuli for the second syllable that were acoustically identical, yet were perceived differently. One stimulus was accompanied by congruent visual information, and the other stimulus was accompanied by incongruent visual information (i.e. produced the McGurk effect). In this case even though the stimuli were acoustically identical, they were perceived as being different and showed increased response latencies to the first syllable. In a second experiment they used stimuli that were acoustically different yet were perceived as being the same. Again, one stimulus was accompanied by congruent visual information and the other stimulus was accompanied by incongruent visual information to produce the McGurk effect. In this case, even though the stimuli were acoustically different, they were perceived as being the same and therefore did not increase response latencies. Soto-Faraco et al. (2004) showed that the actual speech token or McGurk percept can equally provide a benefit or interfere with a concurrent 3

20 CHAPTER 1. INTRODUCTION syllable categorization task. This benefit or interference depends on the perceived syllable and not the actual auditory syllable, suggesting that the integration of the visual information with the auditory information takes place even if it is costly to the task. Several studies that have used event-related potential (ERP) electroencephalography (EEG) and event-related magnetoencephalography (MEG) to examine the integration of auditory and visual speech information have also suggested that this integration is likely automatic and occurs without attention. Several of these studies use the well-known electrophysiological component, mismatch negativity (MMN) (Colin, Radeau, Soquet, & Deltenre, 2004; Colin et al., 2002; Kislyuk, Möttönen, & Sams, 2008), or its magnetic counterpart (MMNm) (Sams et al., 1991), which is elicited by an infrequent discriminable change ( oddball ) in a repetitive aspect ( standard ) of an auditory stimulus (Näätänen, 1999). MMNs can be generated by differences in the physical aspects of a sound, for example duration, intensity, or frequency. While, MMNs are often shown for spectrally simple stimuli such as sine waves, they have also been shown for changes between standard and deviant stimuli in spectrally complex stimuli such as phonemes (Näätänen, 1999). Most importantly, MMNs can be elicited in the seeming absence of attention, for example the MMN is still seen when subjects are engaged in a secondary task such as reading (Näätänen, 1991), and in patients who are comatose (Näätänen, 2000). Because of this, MMNs are taken as evidence that attention is not required for a particular task. These MMN studies examining the integration of auditory and visual speech information (Sams et al., 1991; Colin et al., 2004, 2002; Saint-Amour, Sanctis, Molholm, Ritter, & Foxe, 2007; Kislyuk et al., 2008) have shown that the illusory acoustic percept produced 4

21 CHAPTER 1. INTRODUCTION by the McGurk effect produces the same early response as the actual acoustic token for that percept. Thus, the visual stimuli in these experiments has been shown to influence how the sounds in these experiments are acoustically categorized. Since the MMN has been shown to occur in the absence of attention, and since similar MMNs were elicited by the McGurk effect as the actual acoustic token, the MMN research suggests that this integration can occur in the absence of attention Cognitive influences on audiovisual integration While there is considerable evidence suggesting that the integration of auditory and visual speech information is automatic, cognitive influences have been shown in the perception of audiovisual speech. For example, knowledge of the correspondence of the auditory and visual information has been shown to influence the perception of the McGurk effect. While auditory and visual speech information can still be integrated when the information streams come from talkers of different genders (Green et al., 1991; Walker et al., 1995), the susceptibility to this effect decreases with talker familiarity (Walker et al., 1995). Additionally, mismatching talker genders can influence temporal order judgments, making it easier to judge which modality has been presented first when talkers are mismatched (Vatakis & Spence, 2007; Vatakis, Ghazanfar, & Spence, 2008). The integration of ambiguous auditory or visual speech information can be dependant on whether the ambiguous stimuli is perceived as speech. That is, the perception of the ambiguous stimuli as talking faces or voices can influence the strength of the McGurk effect. For example, Munhall, ten Hove, Brammer, and Paré (2009) have shown that the perception of the McGurk effect is related to the perception of the 5

22 CHAPTER 1. INTRODUCTION bistable stimulus. They used a dynamic version of Rubin s vase-face illusion where the vase turns and the faces speak a visual vowel-consonant-vowel syllable which is different from the acoustic vowel-consonant-vowel syllable. With the stimuli held constant, significantly more McGurk responses were reported when the faces were the figure percept than when the vase was the figure percept. Perceptually ambiguous replicas of natural speech (sine wave speech, see Remez, Rubin, Pisoni, & Carrell, 1981) have also been shown to be influenced by visual speech information (Tuomainen, Andersen, Tiippana, & Sams, 2005). This study showed that visual speech information can influence the perception of sine wave speech (producing a McGurk effect) when participants perceived the audio as speech, but are only negligibly influenced by the visual information when it was not perceived as speech. Attentional modulation of audiovisual speech perception Attentional manipulations have been shown to influence the McGurk effect. Directing visual attention has been shown to modulate the influence of the visual speech information on perception of auditory information in the McGurk effect (Tiippana, Andersen, & Sams, 2004; Andersen, Tiippana, Laarni, Kojo, & Sams, 2009). Performing demanding visual, auditory (Alsius, Navarra, Campbell, & Soto-Faraco, 2005) and tactile (Alsius, Navarra, & Soto-Faraco, 2007) tasks at the same time as a speech task has also been shown to modulate the influence of the visual speech information in the McGurk effect. Directing visual attention to either a face or a concurrently presented leaf on a screen (Tiippana et al., 2004) has been shown to change the influence of the visual information on the McGurk effect. The stimuli consisted of consonant-vowel-consonant 6

23 CHAPTER 1. INTRODUCTION stimuli designed to elicit the McGurk effect. During each trial, as an utterance was spoken, a semi-transparent leaf floated in front of the participant s face, near the mouth without obscuring it. The stimuli were the same in each attention condition, only the instructions differed by condition. In one condition the instructions were to attend to the face, and in the other, the instructions were to attend to the leaf. They found fewer McGurk responses (i.e. responses differing from the auditory syllable) to incongruent stimuli when subjects were attending to the leaf instead of the face. The manipulation of visual spatial attention using a Posner cueing paradigm to one of two simultaneously presented talkers has also been shown to change the influence of the visual information on the McGurk effect. Andersen et al. (2009) presented talkers on either side of a fixation cross, and an arrow was used to indicated which side of the screen the subject was to attend. Although the effect was rather modest, the attended talker had a greater influence on the perception of the auditory syllable than the unattended talker. Performing demanding tasks at the same time as a speech task has also been shown to modulate the influence of the visual speech information in the McGurk effect. In Alsius et al. (2005) subjects had to perform a McGurk task either alone or with a secondary task. In both the single and dual task conditions the stimuli were the same. The stimuli for the secondary task were either superimposed over the video (for a concurrently presented visual task in one experiment), or over the audio (for a concurrently presented auditory task in a second experiment). In the single task condition, subjects just had to respond to the speech task. In the dual task condition subjects had to respond to both the speech task and the secondary task. There were fewer McGurk responses in the dual task conditions than in the 7

24 CHAPTER 1. INTRODUCTION single task conditions, suggesting less audiovisual integration. However, it appears as though the stimuli for the secondary task may have been masking some of the speech information. While attempts were made to avoid direct masking of the stimuli, it is clear that the superimposed line drawings and everyday sounds were influencing the perception of the McGurk effect in the single task condition. Ideally, in the single task condition where subjects just had to respond to the speech task, there should have been no difference in performance between the speech stimuli with the superimposed visual stimuli and superimposed auditory stimuli. Unfortunately this was not the case. There was a large difference in the single task conditions depending on the superimposed stimuli for the secondary task. The number of McGurk responses in the single task decreased from 81% with the superimposed auditory stimuli to 33% with the superimposed visual stimuli. The number of auditory responses in the single task decreased from 61% with the superimposed visual stimuli to only 16% with the superimposed auditory stimuli. Even though there was an overall effect of performing in the single or the dual task condition, the competing information for the secondary task was having a profound effect even when it was irrelevant to the task. Alsius et al. (2007) performed a follow up experiment to Alsius et al. (2005) where they used a secondary tactile task to avoid competing information in the auditory and visual modalities. There was still an effect of having the subjects perform a single versus a dual task, but this difference was much less pronounced than in Alsius et al. (2005). For instance, the difference between the McGurk responses in the single and dual task conditions fell from an average difference of 24% fewer McGurk responses (Alsius et al., 2005) to an average difference of 11% in (Alsius et al., 2007). 8

25 CHAPTER 1. INTRODUCTION 1.2 Overview Several approaches will be used to further examine the role of cognitive factors on audiovisual speech perception. The main focuses of this thesis will be to examine the influences of cognitive load and selective attention on audiovisual speech perception, as well as the integration of auditory and visual information in talking distractor faces. The influence of cognitive factors on the temporal integration of auditory and visual speech, and gaze behaviour during audiovisual speech will also be addressed The influence of cognitive load on audiovisual speech perception The above studies suggesting that attentional resources play a role in the integration of auditory and visual speech information (Tiippana et al., 2004; Alsius et al., 2005, 2007; Andersen et al., 2009, i.e.) all had concurrent competing perceptual information which had to be either monitored or ignored during the speech task. These studies suggest that challenging attentional resources to selectively attend or ignore competing perceptual information can influence how audiovisual speech information is perceived. On the other hand, all of the above studies that suggest that the integration of auditory and visual information is automatic did not have competing perceptual information (Soto-Faraco et al., 2004; Sams et al., 1991; Colin et al., 2004, 2002; Saint-Amour et al., 2007; Kislyuk et al., 2008, for e.g.). It remains unclear as to what extent attentional influences may be a result of interference with gathering the perceptual speech information. Experiments presented in Chapters 2 and 3 will 9

26 CHAPTER 1. INTRODUCTION examine the influence of adding a cognitive task without competing perceptual information on audiovisual speech perception. This will be done by having participants perform a concurrent working memory task as a cognitive load task. The cognitive demands of the task are overlapped, but the stimuli for each task are not presented at the same time; the perceptual information for the speech task is presented without other experimentally relevant competing perceptual information The influence of selective attention to auditory and visual speech information Experiments presented in Chapter 4 will address the extent that it is possible to break the integration of audiovisual speech information by selectively attending to either the auditory or the visual information. Research with non-speech stimuli has shown that selective attention to either the auditory or visual modality can attenuate multisensory integration (Mozolic, Hugenschmidt, Peiffer, & Laurienti, 2008). On the other hand, some earlier research on selective attention with speech stimuli (and only 6 subjects) has been done looking at the integration of videos of a talker paired with synthetic speech (Massaro, 1987). This work suggests that it may be easier to pay attention to the visual speech information than the auditory speech information. However, the results of each attentional condition were not directly compared with one another. The experiments in Chapter 4 will make direct comparisons between the attention conditions. Additionally, the experiments will examine whether the effect of attentional instructions can be influenced by weakening the influence of the visual speech information. 10

27 CHAPTER 1. INTRODUCTION The influence of cognitive factors on temporal integration The integration of auditory and visual speech information occurs not just for synchronous speech stimuli, but occurs over a range of temporal asynchronies. The tolerance for asynchrony, or synchrony window, tends to be asymmetric with a greater tolerance for visual stimuli leading the auditory stimuli, rather than viceversa, and the integration of auditory and visual information tends to decrease as the amount of asynchrony is increased (Conrey & Pisoni, 2006; Munhall, Gribble, Sacco, & Ward, 1996; van Wassenhove, Grant, & Poeppel, 2007; Grant, van Wassenhove, & Poeppel, 2004; Dixon & Spitz, 1980). It is not yet known if taxing cognitive resources can influence the synchrony window for speech. Experiments in Chapters 2 and 3 will examine this by looking at the influence of a concurrent working memory task at various temporal offsets between the auditory and visual speech. It is also not yet known whether selectively attending to the auditory or visual speech information can be influenced by the temporal congruency of the auditory and visual information. An experiment in Chapter 4 will examine whether increasing the temporal offset between the auditory and visual speech stimuli will make it easier to selectively attend to either the auditory or visual speech information Gaze behaviour during audiovisual speech perception Gaze behaviour during audiovisual speech perception will be examined in Chapters 3, 4 and 5. Previous research looking at gaze behaviour during audiovisual speech perception found that gaze tended to become more centralized on the face, clustering 11

28 CHAPTER 1. INTRODUCTION around the nose, when moderate acoustic noise was added to a sentence comprehension task as compared to when participants heard the sentences without noise (Buchan, Paré, & Munhall, 2007, 2008). Experiments in Chapter 3 will examine whether this centralization can be explained due to an increase in cognitive load. The influence of task instruction on gaze behaviour will be examined in Experiments in Chapter 4. Task instructions have previously been shown to modify gaze behaviour in audiovisual speech tasks (Lansing & McConkie, 1999; Buchan et al., 2007). Chapter 4 will examine whether different strategies are used to gather visual information depending on whether subjects are trying to selectively attend to the auditory and visual information. Chapter 5 will examine the influence of congruent and incongruent trials, and of temporal offset between the auditory and visual speech information on gaze behaviour The integration of auditory and visual information in talking distractor faces Chapter 6 will focus on the integration of auditory and visual information in distracting faces. Language experience has been shown to influence the perception of audiovisual speech (Navarra, Alsius, Velasco, Soto-Faraco, & Spence, 2010). The question of whether a cognitive factor, in this case knowledge of a language, can influence the binding of auditory speech information to a target face among several distractor faces will be addressed in an experiment. Other experiments in Chapter 6 will further explore the integration of auditory and visual speech information in distractor faces. This will be done by examining the extent to which the auditory and visual speech information of distractors are integrated by looking at the influence 12

29 CHAPTER 1. INTRODUCTION of audiovisual speech distractors on the processing of target audiovisual speech. 1.3 Ethics All procedures were approved by Queen s University s General Research Ethics Board. Board approval letters, and a certificate of completion for Queen s Course in Human Research Participant Protection, are included in Appendix 1.4 Published material Everything in Chapter 4 (except the paragraph above the abstract) has been accepted for pubication. Everything in Chapter 3 has been accepted for publication pending minor revisions which have been submitted. Everything in those two chapters represents the versions of the manuscripts that have either been accepted, or accepted pending minor changes. This includes the data, analyses, figures etc. (The numbering for the figures is different. In some cases the figure captions are slightly different, but only because the published versions contain Figures 3a, 3b and 3c for example, with only one caption for Figure 3, whereas in the thesis, each figure has its own caption. In these cases, the captions are essentially identical, but have been broken apart to refer to the separate figures.) Chapter 3 has been sent to Seeing and Perceiving, and Chapter 4 has been sent to Perception. Some additional data for Chapter 4 is included in an appendix mentioned at the beginning of the chapter. Chapter 3 Buchan, J.N. & Munhall, K.G. The Effect of a Concurrent Cognitive Load Task and Temporal Offsets on the Integration of Auditory and Visual Speech 13

30 CHAPTER 1. INTRODUCTION Information. Revisions submitted to Seeing and Perceiving Special Issue on Multisensory Integration September 29, Chapter 4 Buchan, J.N. & Munhall, K.G. The Influence of Selective Attention to Auditory and Visual Speech on the Integration of Audiovisual Speech Information. Perception. Manuscript accepted for publication September 04,

31 Chapter 2 The influence of cognitive load on audiovisual speech integration Studies suggesting that attentional resources play a role in the integration of auditory and visual speech information have all had concurrent competing perceptual information which had to be either monitored or ignored during the speech task. For example, paying attention to concurrent irrelevant visual, auditory (Alsius et al., 2005) and tactile (Alsius et al., 2007) stimuli as a secondary task has been shown to reduce the amount of audiovisual integration in a McGurk task. A reduction in audiovisual integration has also been shown by getting participants to direct visual attention to either a face or a concurrently presented leaf on a screen (Tiippana et al., 2004). Directing visual spatial attention to one of two simultaneously presented talkers (Andersen et al., 2009) has also been shown to influence the perception of the McGurk effect. However, what is not yet understood is the extent to which these attentional influences are due to interference with gathering the perceptual speech information, 15

32 CHAPTER 2. COGNITIVE LOAD possibly due to attentional capture by information from the competing non-speech task. For instance, in Alsius et al. (2005), subjects had to perform a McGurk task with a secondary task. This secondary task was either a concurrently presented visual task (in one experiment), or a concurrently presented auditory task (in a second experiment). There was both a single and a dual task condition. In the single task condition, subjects just had to respond to the speech task. In the dual task condition subjects had to respond to both the speech task and the secondary task. There were fewer McGurk responses in the dual task conditions than in the single task conditions, suggesting less audiovisual integration. However, there is reason to suspect that the stimuli for the secondary task (which were present in both the single and dual tasks conditions) had a larger influence on audiovisual integration than the actual secondary task. In the single task condition, using the same speech stimuli, with the secondary visual stimuli there were 33% McGurk responses, whereas with the secondary auditory stimuli there were 81% McGurk responses. Thus, there were more McGurk responses when there was competing auditory information, than when there was competing visual information. This suggests that the speech stimuli may have been somewhat masked by the secondary task stimuli, which could have been due to something akin to either energetic or informational masking (Leek, Brown, & Dorman, 1991; Brungart, 2001) of either the auditory or visual speech stimuli. Alsius et al. (2007) performed a follow up experiment to Alsius et al. (2005) where they used a secondary tactile task to avoid competing information in the auditory and visual modalities. There was still an effect of having the subjects perform a single versus a dual task, but the effect was not as pronounced falling from an average difference of 24% fewer McGurk responses in the dual task condition in Alsius et al. 16

33 CHAPTER 2. COGNITIVE LOAD (2005) compared to 11% in Alsius et al. (2007). This difference suggests the effect of the dual task in Alsius et al. (2005) was likely modulated by the competing stimuli in the auditory and visual modalities. A second consideration is the fact that both tasks overlapped in time, and thus (verbal) responses to the speech task and the (button press) responses to the secondary task could also have overlapped depending on when the target was presented in the speech task. The manipulation of visual spatial attention using a Posner cueing paradigm to one of two simultaneously presented talkers may also be influenced by perceptual load. Andersen et al. (2009) included both bilateral trials (with two simultaneously presented visual talkers paired with an auditory syllable) and unilateral trials (with only one visual talker paired with an auditory syllable) in their study. The talkers were presented to the left and right of a fixation cross, and an arrow was used to indicate to which side of the screen the subject was to attend. Depending on the specific combination of stimuli, the difference between the bilateral and unilateral presentation conditions varied from a few percent to as much as 31%. There were no statistically significant main effects between the bilateral and unilateral presentation conditions. However, there were some significant interactions of the bilateral versus unilateral presentation conditions with the attended talker showing that generally the instructions to attend to a face were less influential when two faces were present at the same time (bilateral) than when only one face was present (unilateral). The influences of perceptual load and cognitive load have been shown to have different effects on visual distractor processing (Lavie, 2005). It is not entirely clear exactly how these two types of load would interact in a speech task. There are not necessarily any distractors in these tasks, and when there are distractors, they 17

34 CHAPTER 2. COGNITIVE LOAD have not been audiovisual talkers. Additionally, the stimuli are multisensory and the findings on visual distractor processing may not generalize to multisensory distractors. For example, work by Santangelo and Spence (2007) has manipulated perceptual load using both unimodal and audiovisual cues. In this work visual and audiovisual cues are all effective at capturing visual attention in the absence of a perceptual load, which would follow from Lavie (2005). Auditory and visual cues are not effective at capturing attention under high perceptual load, which again would follow from Lavie (2005). However, unlike unimodal cues, the audiovisual cues used in Santangelo and Spence (2007) were effective at capturing attention in high perceptual load conditions, which would not have been directly predicted based on the work mentioned in Lavie (2005). However, the concepts of perceptual and cognitive load are still useful in an audiovisual context. It seems as though some of the effects of task on audiovisual integration may result from an interaction between perceptual and cognitive load. Links have been shown between attention and working memory (Awh & Jonides, 2001), and cognitive load tasks used in Lavie (2005) are working memory tasks. The experiments presented in Chapter 2 will look at the effect of cognitive load on the integration of auditory and visual speech information. In the experiments presented in this chapter (and in Chapter 3) the cognitive demands of the concurrent working memory task and a speech task are overlapped, but the stimuli for each task are not presented at the same time. The perceptual information for the speech task is presented before the speech task, and the response to the working memory task is asked for after the response for the speech task. The working memory tasks in the experiments in Chapter 2 were chosen to be fairly easy, since the results of the study by Tiippana et al. (2004) with directed 18

35 CHAPTER 2. COGNITIVE LOAD attention to a slightly pixelated drawing of a leaf suggested that this integration should be relatively easy to break. Many working memory tasks also involve a verbal component, and it is possible that verbal rehearsal could interfere with the perception of speech. The working memory tasks in Chapter 2 were also chosen because there was no clear verbal rehearsal strategy available to participants. While this certainly doesn t rule out verbal rehearsal, the lack of a pre-associated word with the shape or sound, and the relatively fast presentation speed of the working memory stimuli should minimize verbal rehearsal during the speech task in Experiments 1a and 1b. 2.1 Experiment 1a Experiment 1a looked at the effect of a working memory task on both audio-only and audiovisual speech using a speech-in-noise task. There are two speech conditions, an audio-only and an audiovisual condition. Two levels of the working memory task were chosen, 1 symbol and 7 symbols. In Experiment 1a the symbols were chosen from a set of 2781 randomly generated shapes Methods Participants There were 32 subjects (22 females) with a mean age of years (18-35). All subjects were native speakers of English and reported normal or corrected to normal vision, and no speech or hearing difficulties. 19

36 CHAPTER 2. COGNITIVE LOAD Stimuli Speech task Forty-eight Central Institute for the Deaf (CID) Everyday Sentences were used from lists A through F (Davis & Silverman, 1970). The stimuli were filmed using digital audio and video recording equipment, and edited into clips in Final Cut Pro. The audio level was normalized using custom MATLAB software. The intelligibility of the speech was decreased by adding a commercial twenty talker noise babble (Auditec, St.Louis, MO). The relative levels of the talker s voice to the noise babble were chosen based on performance in a pilot experiment (see Appendix B for further details). Loose key word scoring (Bench & Bamford, 1979) was used to score the standard CID key words. The sentences were grouped into four approximately equally difficult groups based on the scores for each sentence across all subjects in a pilot experiment (see Appendix B). For each subject, two of the sentence groups were presented audio-only and the other two were presented audiovisually. The relative levels of the talker s voice to the noise babble were chosen based on performance in a pilot experiment (see Appendix B for further details). The overall volume of the speech and noise played together was approximately 60dB(A). Symbols task There were 2781 shapes used in this experiment. See Figure 2.1 for examples. Participants were presented with either 1 or 7 symbols (depending on the condition) at the beginning of each trial. The symbols were presented sequentially, each at the centre of the screen for 550 ms. At the end of the trial subjects were presented with one symbol, and had to respond either Yes or No on two marked keys on the keyboard whether or not they had seen that symbol in the previous set. 20

37 CHAPTER 2. COGNITIVE LOAD Figure 2.1: Here are 12 of the 2781 symbols that were used in Experiments 1a and 1b. 21

38 CHAPTER 2. COGNITIVE LOAD Experimental procedures The experiments took place in a single walled sound booth and participants were seated approximately 57 cm away from a 22in flat CRT monitor (ViewSonic P220f). The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. The working memory tasks and speech tasks were blocked and counter balanced (see Design and Analysis for details). The speech task was sandwiched between the working memory task. The to-be remembered stimuli were presented first, followed by the speech task. Participants had to respond verbally with the words that they heard the talker say. When they were finished, they hit a key that would display a symbol or sound, and they had to respond on the keyboard whether the symbol or sound was in the set at the beginning of the trial. Participants were told that their primary task was the symbols task, and were asked to be as accurate as they could be with both tasks. Design and Analysis Experiment 1a had a within-subjects design. There were two speech conditions, an audio-only and an audiovisual condition, and two working memory levels, either 1 or 7 symbols. The experiment was split into 4 blocks, and one sentence group was used per block. Each of the speech levels (audio-only and audiovisual speech) was paired with each working memory level (1 or 7 symbols), creating four different conditions (audio-only with 1 symbol, audio-only with 7 symbols, audiovisual with 1 symbol and audiovisual with 7 symbols). The sentence groups, speech levels, and working memory levels were all counterbalanced so that across the entire experiment each word list appeared an equal number of times in each of the four block positions, each 22

39 CHAPTER 2. COGNITIVE LOAD sentence group was presented the same number of times audio-only and audiovisually, and sentence group was paired with each symbol condition an equal number of times. Additionally, each speech type and working memory level appeared in each block position an equal number of times. A 2 2 (audio-only vs audiovisual speech task 1 vs 7 symbols working memory task) within-subjects repeated measures ANOVA was used to analyze the data Results and Discussion For the speech task, as expected, performance in the audiovisual condition was significantly higher than performance in the audio-only condition [F (1, 31) = , p <.001], see Figure 2.2. There was no significant effect of the number of symbols in the working memory task (p >.05), nor was there a significant interaction between the speech type (audio-only versus audiovisual) and the symbols level (1 versus 7). For the symbols task, performance on the 1 symbol task was significantly higher than performance on the 7 symbols task [F (1, 31) = 63.23, p <.001]. Interestingly, there was an influence of the speech type on performance of the symbols task [F (1, 31) = 31.87, p <.001]. Performance on the 7 symbols task was somewhat worse for the audio-only condition as compared to the audiovisual condition [F (1, 31) = 11.36, p <.001]. See Figure 2.3. There are several potential explanations for the lack of effect of the working memory task on either the audiovisual performance, or the difference in performance between the auditory-only and audiovisual conditions. First, the tasks chosen may not have been difficult enough. The speech task was quite difficult, and a pilot experiment using an easier version of the task also had no effect, so it seems unlikely 23

40 CHAPTER 2. COGNITIVE LOAD Percent words correct symbol 7 symbols 0 Audio-only Audiovisual Figure 2.2: This shows performance on the speech task for Experiment 1a for the audio-only and audiovisual conditions by symbols condition. The error bars indicate standard error of the mean. 24

41 CHAPTER 2. COGNITIVE LOAD Percent correct Audio-only Audiovisual 0 1 symbol 7 symbols Figure 2.3: This shows performance on the symbols task for Experiment 1a by speech task condition. The error bars indicate standard error of the mean. that the difficulty of the speech task is responsible for the lack of effect. The working memory task was not all that difficult, although this will be explored in Chapter 3 in Experiments 2a and 2b, where a more difficult working memory task is used. A second possibility is that speech task is more salient, or more engaging that the working memory task. In Experiment 1a, when the speech task was more difficult (i.e. the audio-only condition), performance on the more difficult symbols task (7 symbols) suffered. This could be due to subjects prioritizing the speech task over the symbols task. This seems likely given the decrease in performance seen in the 7 symbols task when it was paired with the audio-only condition. Although both the visual speech benefit to speech-in-noise, and the McGurk effect, are often cited as evidence of audiovisual speech integration, it is possible that tasks that can disrupt the McGurk effect don t have as noticeable an effect on speech-in-noise tasks. It is 25

42 CHAPTER 2. COGNITIVE LOAD possible that the same symbols task could have an influence on the performance of a McGurk task. 2.2 Experiment 1b Experiment 1b used the same working memory task as Experiment 1a, but a McGurk task is used instead of a speech-in-noise task. Experiment 1b also weakened the McGurk effect by presenting the stimuli at several audiovisual offsets. The integration of auditory and visual speech information occurs not just for synchronous speech stimuli, but occurs over a range of temporal asynchronies. The tolerance for asynchrony, or synchrony window, tends to be asymmetric with a greater tolerance for visual stimuli leading the auditory stimuli, rather than vice-versa, and the integration of auditory and visual information tends to decrease as the amount of asynchrony is increased (Conrey & Pisoni, 2006; Munhall et al., 1996; van Wassenhove et al., 2007; Grant et al., 2004; Dixon & Spitz, 1980). Additionally, the synchrony window for audiovisual speech stimuli tends to be more generous than that observed for non-speech stimuli (see, Dixon & Spitz, 1980; Lewkowicz, 1996; Conrey & Pisoni, 2006). Speech information is usually longer in duration and more complex than the non-speech stimuli used (for instance, a hammer hitting a nail (Dixon & Spitz, 1980), or a bouncing green disk (Lewkowicz, 1996)). It is possible that the recruitment of working memory is responsible for this greater tolerance, allowing information to be maintained and integrated over a longer time period. Working memory likely plays a role in speech perception. It is not yet known whether taxing cognitive resources will have an effect on the duration of the synchrony 26

43 CHAPTER 2. COGNITIVE LOAD window for speech. Will adding a secondary working memory task lead to a greater decrease in the influence of the visual information as the offset between the audio and video is increased? Methods Participants There were 19 subjects (13 females) with a mean age of 19.1 years (range 18-30). All subjects were native speakers of English and reported normal or corrected to normal vision, and no speech or hearing difficulties. Stimuli Only incongruent stimuli created to elicit the McGurk effect were used in this experiment. Congruent stimuli were omitted to keep the experiment under an hour. To create the incongruent stimuli the auditory syllable /aba/ was dubbed into videos of the syllables /ada/, /aga/, /atha/ and /ava/. To maintain the timing with the original soundtrack, the approximate acoustic releases of the consonants of the dubbed syllables were aligned to the acoustic releases of the consonants in the original acoustic syllable. Offsets Four offsets were chosen with the video leading the audio by 0, 100, 200 or 300 ms. Video leading asynchronies were chosen because they tend to be more naturalistic. These offsets were chosen based on results from Conrey and Pisoni (2006), showing that synchronous judgements drop off substantially between 100 and 200 ms offsets, and fall off to a very low level by 300 ms. Synchrony judgements do 27

44 CHAPTER 2. COGNITIVE LOAD roughly parallel the perceptual integration of speech. For example, Munhall et al. (1996) also showed a small reduction in the McGurk effect between 100 and 200 ms, and a noticeable reduction in the McGurk effect as shown by increase in auditory responses between 200 and 300 ms. The offsets were created using custom MATLAB software. To create the 100 ms, 200 ms and 300 ms offsets, the onset of the syllable was selected, and then offset so that the audio trailed the video by either 100, 200 or 300 ms, respectively. The beginning of the audio track was zero padded to make the audio and video of equal duration. Experimental procedures The experiments took place in a single walled sound booth and participants were seated approximately 57 cm away from a 22in flat CRT monitor (ViewSonic P220f). The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. Speech task Participants were instructed to watch the talker the entire time he was on the screen, and when the video was done, to use a key press to select which consonant sound they heard him say. The responses choices were B, D, G, TH, V and other. Secondary working memory task The working memory task was the same task as that used in Experiment 1a. Subjects were presented with either 1 or 7 symbols at the beginning of the trials. The speech task was then presented. After subjects made their response to the speech task on the keyboard, they were presented with a symbol and had to respond on the keyboard yes or no whether they had seen that 28

45 CHAPTER 2. COGNITIVE LOAD symbol in the set at the beginning of the trial. Design and Analysis The experiment was run as a within-subjects design. The working memory condition was blocked, with one block for the 1 symbols condition, and another block for the 7 symbols condition. The number of occurrences of each syllable at each offset was the same for each block. Block order was counterbalanced. Performance on the speech task and the symbols task were analyzed separately, each with a 4 2 ANOVA (4 speech offsets 2 symbol numbers) Results and Discussion For the speech task, there was a significant effect of offset [F (1.36, 24.52) = 14.38, p <.001]. There were progressively fewer McGurk responses as the offset was increased between the audio and the video. There was no significant main effect of the working memory tasks (p >.05), nor was there a significant interaction between offset and the difficulty of the working memory task (number of symbols) (p >.05) on the number of McGurk responses. See Figure 2.4. For the symbols task, not surprisingly, performance was lower on the seven symbols task than on the one symbol task [F (1, 18) = , p <.001]. There was no influence of the offset of the speech stimuli on performance in the symbols task (p >.05), nor was there a significant interaction between offset on the speech task and performance on the symbols task at the 1 and 7 symbols levels (p >.05). See Figure 2.5. Although the symbols task did not have an overall influence of reducing the number of McGurk responses in the speech task, the data was further explored to see 29

46 CHAPTER 2. COGNITIVE LOAD 1 Proportion McGurk Responses symbol 7 symbols Offset between video and audio (ms) in speech task Figure 2.4: This shows performance on the speech task for Experiment 1b by offset and symbols condition. The error bars indicate standard error of the mean. 30

47 CHAPTER 2. COGNITIVE LOAD Proportion Correct symbol 7 symbols Offset between video and audio (ms) in speech task Figure 2.5: This shows performance on the symbols task for Experiment 1b by symbol condition and by the offset between the video and the audio that occurred in the speech task. The error bars indicate standard error of the mean. 31

48 CHAPTER 2. COGNITIVE LOAD 1 Proportion McGurk Responses symbol 7 symbols Offset between video and audio (ms) in speech task Figure 2.6: This shows performance on the speech task for just the first block of Experiment 1b by offset and symbols condition. The error bars indicate standard error of the mean. The error bars indicate standard error of the mean. whether or not a follow up experiment was in order. The experiment was rather monotonous and repetitive, so for exploratory purposes, only trials from the 1st blocks were analyzed in the speech task with a mixed design ANOVA. Results from just the first blocks can be seen in Figure 2.6. While the effect of the symbols condition on the McGurk effect is not significant (p >.05), there is a noticeable trend to a difference in the number of McGurk responses as the offset between the audio is increased. Additionally, the symbols task was also analyzed for just the first block. Here, as one can see in Figure 2.7, performance in the 7 symbols condition is much higher than in the overall experiment. In just the first block, performance on the 7 symbols task is not significantly different from performance on the 1 symbols task. This could suggest 32

49 CHAPTER 2. COGNITIVE LOAD Proportion Correct symbol 7 symbols Offset between video and audio (ms) in speech task Figure 2.7: This shows performance on the symbols task for just the first block of Experiment 1b by symbol condition and by the offset between the video and the audio that occurred in the speech task. The error bars indicate standard error of the mean. that performing well on the 7 symbols task in the first block could be responsible for the slight trend towards fewer McGurk responses at the 300 ms offset in the 7 versus 1 symbol condition in the first block. Results from analyzing just the first blocks suggest that it may be possible to influence performance on a McGurk task by adding a secondary working memory task, although this task would likely need to be more difficult. Experiments 2a and 2b in Chapter 3 will further explore this. The results from just the first blocks of Experiment 1b suggest that there could be a trade-off between performance on the speech task and performance on the working memory task. To encourage subjects to devote more effort to the working memory task a more difficult working memory 33

50 CHAPTER 2. COGNITIVE LOAD task will be used. Experiments 2a and 2b will use a working memory task adapted from de Fockert, Rees, Frith, and Lavie (2001) that involves memorizing the order of 7 digits. At the end of the trial subjects were presented with one digit, and they needed to respond with the next digit in the sequence. This task is also likely expected to interfere with the phonological loop, which will likely play a role in both the speech perception task and the working memory task (Baddeley, 2000). If the two processes share a common resource, then we could expect greater interference when they are competing for that resource. In Experiment 1b, the McGurk effect did not show much of a reduction with increasing offset, and was not particularly reduced even at the 300 ms offset. It is possible that because all of the stimuli were incongruent, and thus likely not good examples of the consonant category, that this biased subjects responses. Because the McGurk effect was quite strong, there would have been few instances where subjects would have heard a /b/ (the actual acoustic consonant), and perhaps have made them more unlikely to respond /b/. A bias away from responding /b/ could have increased the number of McGurk responses at all offsets. To address this possibility, Experiments 2a and 2b will include at least one congruent syllable that corresponds to the auditory syllable used in the McGurk stimuli. Also, based on the results of Experiment 1b, a greater range of offsets will be used. The offsets between the audio and video chosen for Experiments 2a and 2b are 0 ms, 175 ms and 350 ms. It is expected that the McGurk effect should fall off noticeably at 350 ms. Lastly, eye movements were not monitored in Alsius et al. (2005), (Alsius et al., 2007) and (Tiippana et al., 2004), nor in Experiments 1a and 1b. It is always possible that subjects could be altering their gaze behaviour (or closing their eyes) during 34

51 CHAPTER 2. COGNITIVE LOAD some of the trials. The lack of difference between working memory conditions in Experiments 1a and 1b, and the high number of McGurk responses in Experiments 1b suggests that subjects were probably adopting similar gaze behaviour in both working memory conditions. However, when the task becomes more difficult in Experiments 2a and 2b, it is possible that subjects may be more tempted to alter their eye movements. To monitor eye movements, gaze will be tracked with an eyetracker. 35

52 Chapter 3 The Effect of a Concurrent Cognitive Load Task and Temporal Offsets on the Integration of Audiovisual Speech Information This chapter has been submitted as a revised paper to Seeing and Perceiving (special issue on multisensory integration). Experiments 2a and 2b in the thesis are referred as Experiments 1 and 2, respectively, in the paper submitted to Seeing and Perceiving. Also, the figure and tables numbers have been modified in the thesis. 3.1 Abstract Audiovisual speech perception is an everyday occurrence of multisensory integration. Conflicting visual speech information can influence the perception of acoustic speech 36

53 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS (namely the McGurk effect), and auditory and visual speech are integrated over a rather wide range of temporal offsets. This research examined whether the addition of a concurrent cognitive load task would affect the audiovisual integration in a McGurk speech task and whether the cognitive load task would cause more interference at increasing offsets. The amount of integration was measured by the proportion of responses in incongruent trials that did not correspond to the audio (McGurk response). An eyetracker was also used to examine whether the amount of temporal offset and the presence of a concurrent cognitive load task would influence gaze behaviour. Results from this experiment show a very modest but statistically significant decrease in the number of McGurk responses when subjects also perform a cognitive load task, and that this effect is relatively constant across the various temporal offsets. Participant s gaze behaviour was also influenced by the addition of a cognitive load task. Gaze was less centralized on the face, less time was spent looking at the mouth and more time was spent looking at the eyes, when a concurrent cognitive load task was added to the speech task. 3.2 Introduction Speech perception is an example of how we integrate information from different senses in our everyday lives. It has been known for some time that visual speech information can influence the perception of acoustic speech. The presence of visual speech information in acoustically degraded conditions can increase the intelligibility of acoustic speech (Sumby & Pollack, 1954; Erber, 1969; Ross et al., 2007). Visual speech information can also be perceptually useful when the acoustic information has not been degraded. For example, visual speech information can increase the intelligibility of 37

54 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS difficult to understand clear speech (Reisberg et al., 1987). Presenting conflicting visual information can influence the perception of auditory speech (e.g., the McGurk effect McGurk & MacDonald, 1976; Summerfield & McGrath, 1984). This auditory and visual speech can be integrated, causing an illusory percept of a sound not present in the actual acoustic speech. The McGurk effect has been used extensively in the literature to study the integration of auditory and visual speech information. Several lines of evidence suggest that this integration may be largely automatic. Young infants seem to be susceptible to the McGurk effect (Rosenblum et al., 1997), and informing participants about the mismatch between the auditory and visual stimuli doesn t seem to influence the McGurk effect (Liberman, 1982). Further, there doesn t seem to be a reaction time cost in terms of processing the illusory McGurk percept as compared to the actual acoustic speech token in a speeded classification task (Soto-Faraco et al., 2004). The actual speech token or McGurk percept can equally provide a benefit or interfere with a concurrent syllable categorization task (using a syllabic interference paradigm) depending on the perceived syllable and not the actual auditory syllable. On the other hand, attention may play a role in the integration of auditory and visual speech information. For example, paying attention to concurrent irrelevant visual, auditory (Alsius et al., 2005) and tactile (Alsius et al., 2007) stimuli has been shown to reduce the amount of audiovisual integration in a McGurk task. A reduction in audiovisual integration has also been shown by getting participants to direct visual attention to either a face or a concurrently presented leaf on a screen (Tiippana et al., 2004). Directing visual spatial attention to one of two simultaneously presented talkers (Andersen et al., 2009) has also been shown to influence the perception of the 38

55 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS McGurk effect. The above studies suggesting that attentional resources play a role in the integration of auditory and visual speech information all had concurrent competing perceptual information which had to be either monitored or ignored during the speech task. These studies suggest that challenging attentional resources to selectively attend or ignore competing perceptual information can influence how audiovisual speech information is perceived. However, what is not yet understood is the extent to which these attentional influences are due to interference with gathering the perceptual speech information, possibly due to attentional capture by information from the competing non-speech task. The current paper seeks to extend the findings of the small literature on attentional influences on audiovisual speech perception by having participants perform a concurrent working memory task as a cognitive load task. The cognitive demands of the task are overlapped, but the stimuli for each task are not presented at the same time; the perceptual information for the speech task is presented without other experimentally relevant competing perceptual information. A speech task was presented either alone (speech-only condition) or was placed within a cognitive load task (speech-numbers condition). In the speech-numbers condition the numbers to be memorized in the cognitive load task were presented before the speech task, and participants had to make their response to the cognitive load task after the speech task. Additionally, the cognitive load task was also presented alone for comparison (numbers-only condition). The integration of auditory and visual speech information occurs not just for synchronous speech stimuli, but occurs over a range of temporal asynchronies. The tolerance for asynchrony, or synchrony window, tends to be asymmetric with a 39

56 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS greater tolerance for visual stimuli leading the auditory stimuli, rather than viceversa, and the integration of auditory and visual information tends to decrease as the amount of asynchrony is increased (Conrey & Pisoni, 2006; Munhall et al., 1996; van Wassenhove et al., 2007; Grant et al., 2004; Dixon & Spitz, 1980). It is not yet known whether taxing cognitive resources will have an effect on the duration of the synchrony window for speech. It is possible that this synchrony window will become narrower as cognitive resources are taxed, causing less integration to be shown at more extreme offsets. Such a finding would imply that asynchronous integration was more costly in terms of cognitive resources. In this study, the audio and video were either synchronous (0 ms) or asynchronous. Two video-leading asynchronies were chosen where the video led the audio by 175 or 350 ms. Finally, this research will examine whether the concurrent cognitive load task alters the gaze behaviour used to gather the visual speech information. Previous research looking at gaze behaviour during audiovisual speech perception found that gaze tended to become more centralized on the face, clustering around the nose, when moderate acoustic noise was added to a sentence comprehension task as compared to when participants heard the sentences without noise (Buchan et al., 2007, 2008). This occurred, despite the visual stimuli being the same when acoustic noise was present and when it was absent during the speech task. Since the visual stimuli were the same, changes in visual stimulus properties could not be responsible for this change in gaze behaviour (Parkhurst, Law, & Niebur, 2002; Parkhurst & Niebur, 2003). The reason for this change in gaze behaviour with addition of acoustic noise is not understood. One possibility is that when acoustic noise is present the speech task becomes more cognitively demanding, and it is the cognitive demands of the task driving this gaze 40

57 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS centralization on the face. The centralization of gaze behaviour will be examined by looking at the average distance of the eyetracker samples from the centre of the nose. Eye tracking data will be compared for the speech tasks with and without the concurrent cognitive load task. Eye tracking data will also be compared across the audiovisual offsets to examine whether the offset between the auditory and visual speech stimuli has an effect on gaze behaviour. 3.3 Methods All procedures were approved by Queen s University s General Research Ethics Board Participants Participants were native English speakers and reported having normal or corrected to normal vision, and no speech or hearing difficulties. Written consent was obtained from each participant. There were 25 participants (22 females) in Experiment 2a with a mean age of 21.4 years (range 19-35), and 25 participants (20 females) in Experiment 2b with a mean age of 20.5 years (range 18-90) Stimuli Experiment 2a In Experiment 2a the syllables /aba/, /ada/, /atha/, /ava/, /ibi/, /idi/, /ithi/ and /ivi/ were used. The syllables /aba/ and /ibi/ were used for the congruent stimuli. The congruent stimuli had the auditory /aba/ paired with the visual /aba/, and the auditory /ibi/ paired with the visual /ibi/. The congruent stimuli were used to 41

58 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS ensure that participants were performing the speech task correctly and to serve as a baseline to contrast with the McGurk effect. The incongruent stimuli were created to elicit the McGurk effect by dubbing the auditory syllable /aba/ onto the videos of the syllables /ada/, /atha/ and /ava/, and the auditory syllable /ibi/ onto the videos for /idi, /ithi/ and /ivi/ using custom MATLAB software. To maintain the timing with the original soundtrack, the approximate acoustic releases of the consonants in the dubbed syllables were aligned to the acoustic releases of the consonants in the original acoustic syllable. Experiment 2b In Experiment 2b the syllables /aba/, /ava/, /ibi/ and /ivi/ were used. For congruent stimuli the auditory /aba/, /ava/, /ibi/ and /ivi/ were paired with the visual /aba/, /ava/, /ibi/ and /ivi/, respectively. For incongruent stimuli, an acoustic syllable with the same vowel as the one articulated in the video but different auditory consonant was dubbed onto the video using custom MATLAB software. That is, the /aba/ and /ava/ syllables were paired with one another, and could each be either the visual or the auditory token. The /ibi/ and /ivi/ syllables were paired with one another, and could each be either the visual or the auditory token. That is, an auditory /aba/ was paired with a visual /ava/, an auditory /ava/ was paired with a visual /aba/, an auditory /ibi/ was paired with a visual /ivi/, and an auditory /ivi/ was paired with a visual /ibi/. The acoustic syllable of each member of the pair was dubbed onto the video of the other member of the pair to create incongruent stimuli. As in Experiment 2a, to maintain the timing with the original soundtrack, the approximate acoustic releases of the consonants of the dubbed syllables were aligned 42

59 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS to the acoustic releases of the consonants in the original acoustic syllable Temporal offsets In both experiments the strength of the McGurk effect (as measured by the proportion of responses that do not correspond to the auditory token) was manipulated by varying the temporal offsets of the auditory and visual streams. Video-leading asynchronies were chosen since they tend to be more naturalistic, and show a greater asynchrony tolerance for video-leading speech stimuli than for auditory-leading speech. While the influence of the visual information at a 350 ms offset tends not to be as strong as when the audio and video are synchronous, the influence of the video on the auditory token in a McGurk task has been shown in several studies to extend out to rather large video-leading offsets. In many studies, at about a 350 ms offset, around 30% of responses still do not correspond to the auditory token. For example, Jones and Jarick (2006) have shown that a 360 ms offset still produced about 45% non-auditory token responses, Munhall et al. (1996) have shown that a 360 ms offset produced about 30-40% non-auditory token responses. Grant et al. (2004) and van Wassenhove et al. (2007) have also found that there is still an noticeable influence of the visual token on response between 333 and 467 ms, with about 30-40% of the responses of the responses corresponding to non-auditory token responses. The offsets were created using custom Matlab software. To create the 175 ms and 350 ms offsets, the onset of the syllable was selected, and then offset so that the audio trailed the video by either 175 or 350 ms. The beginning of the audio track was zero padded to make the audio and video of equal duration. 43

60 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Experimental task The experiments were both carried out as within-subjects designs. There were two tasks, a speech task and a numbers task. Speech task The speech task involved watching and listening to the talker say a syllable and choosing which consonant sound was heard. In Experiment 2a the response choices were B, D, TH, V and other. In Experiment 2b the response choices were B and V. Participants were informed that they had to wait until the video was finished to respond, and that their key press would not be recorded until after the video was finished playing. Numbers task For the numbers task participants were presented with a random set of eight digits from the digits 0-9 at the beginning of each trial. The digits were randomized without replacement so that each digit could only appear once in each sequence. A new set of eight digits was generated every trial. The numbers were presented sequentially. Each number was on the screen for 550 ms. After the last digit, two masker screens with greyscale noise were presented for 550 ms each to reduce an afterimage of the last digit during the presentation of the video. Participants were asked to remember the order of the digits, and at the end of the trial were presented with one digit from the set, and asked to respond on the keyboard which digit came after that digit in the series. If the digit happened to be at the end of the series, then participants were to report the number that came at the beginning of the series. The digit remained on the 44

61 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS screen until participants made their response. Participants were asked to try and be as accurate as they possibly could be with the numbers task, and they were instructed that it might be helpful if they rehearsed the numbers silently to themselves. Experimental conditions These two tasks were used to create three experimental conditions: 1) the speechonly condition where participants were just given the speech task, 2) the numbers-only condition where participants were just given the numbers task, and 3) the speechnumbers condition where participant were given the speech task sandwiched between the numbers task. In the speech-numbers condition participants were first presented with the digit series from the numbers task, then the speech stimulus was presented. After the speech stimulus was presented, participants responded to the speech task, then were presented with a digit from the numbers task and made their response to the numbers task Experimental equipment The experiments took place in a single walled sound booth and participants were seated approximately 57cm away from a 22in flat CRT computer monitor (ViewSonic P220f). Participants heads were stabilized with a chin rest. The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. Eye position was monitored with an Eyelink II eye tracking system (SR Research, Osgoode, Canada) (see Eyetracking analysis for further details.). 45

62 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Speech task analysis Congruent and incongruent trials were analyzed separately. Congruent trials were measured using proportion of trials correct. Incongruent trials were measured using proportion of trials showing the McGurk effect. Trials in which participants reported hearing a consonant sound other than the one present in the audio file were considered to show the McGurk effect. Participant responses to the speech task were analyzed using a 2 3 (task conditions containing a speech task temporal offsets) repeated measures ANOVA. In instances where there was a violation of sphericity, a Greenhouse-Geisser correction was used. Participant responses to the numbers task were analyzed with a repeated measures ANOVA. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons Numbers task analysis Participant responses to the numbers task were analyzed using an ANOVA. Performance on the numbers task in the numbers-only condition was compared directly with performance on the numbers task for each of the audiovisual offsets in the speech task in the speech-numbers task. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons Eyetracking analysis Eye tracking data was analyzed for the two conditions with speech tasks (the speechonly and the speech-numbers conditions). Eye tracking data from one participant in Experiment 2a was not collected due to equipment problems. Eye position was monitored using an Eyelink II eye tracking system (SR Research, Osgoode, Canada) 46

63 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS using dark pupil tracking with a sampling rate of 500Hz. Each sample contains an x and y coordinate which corresponds to the location of gaze on the screen. A ninepoint calibration and validation procedure was used. The maximum average error was 1.0 visual degrees, and maximum error on a single point was 1.2 visual degrees with the exception of the central point which was always less than 1.0 degrees. A drift correction was performed before each trial. Four of the videos (/aba/, /ada/, /atha/ and /ava/) had been used in a previous experiment where the positions of eyes, nose and mouth in each frame had been coded. The videos for /ibi/, /idi/, /ithi/ and /ivi/ show very similar head position and movement but had not been coded. For each experiment, the position of the eyes, nose and mouth in each trial were estimated based on average position of the eyes, nose and mouth the /a*a/ syllables. To further describe the eyetracking data, and allow for some comparison across experiments, the overall proportion of samples in each experiment falling within both 4 and 10 degrees of visual angle of the nose are reported. Based on the coding of eyes, nose and mouth, three analyses of the eyetracking data were performed. The first gaze analysis was a gaze centralization analysis where the average distance of the eyetracking samples from the centre of the nose was calculated for each trial (containing a video of the talker) for each participant, for each task condition and offset. Paré, Richler, Hove, and Munhall (2003) showed similar gaze patterns for congruent and incongruent trials, so gaze for congruent and incongruent trials were pooled together. The average distance of the eyetracking samples from the centre of the nose was analyzed using a 2 3 (task conditions containing a speech task temporal offsets) repeated measures ANOVA. In instances where 47

64 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS there was a violation of sphericity, a Greenhouse-Geisser correction was used. The second and third gaze analyses looked at the proportion of each trial that participants spent looking at the eyes and the mouth, respectively. Based on previous coding of the videos mentioned in the preceding paragraph, boxes 3.1 visual degrees of visual angle on the x axis by 2.5 degrees of visual angle on the y axis were centered around the average position of each eye. A box 5 degrees of visual angle on the x axis, and 3.1 degrees of visual angle on the y axis was positioned around the centre of the leftmost, rightmost, topmost and bottommost boundaries of the mouth. The box around the mouth was large enough to contain the maximal mouth movements in the coded videos. See Figure 3.1. For each of the proportion of the trial spent looking at the eyes, and the proportion of the trial spent looking at the mouth, a 2 3 (task conditions containing a speech task temporal offsets) repeated measures ANOVA was performed. In instances where there was a violation of sphericity, a Greenhouse- Geisser correction was used. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons. 3.4 Results behavioural Data Experiment 2a Performance on the speech task was compared between the speech-only condition and the speech-numbers task. Performance was very high, with a proportion of at least 0.89 correct, for the congruent trials in the speech task in both the speechonly and speech-numbers task (see Figure 3.2). The proportion of correct responses 48

65 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Figure 3.1: The white boxes illustrate the regions used for the proportion of the trial spent looking at the eyes and the mouth. The space between each black line around the edge is approximately 2 degrees of visual angle. The screen subtended approximately 45 degrees of visual angle along the horizontal, and the video of the talker subtended approximately 40 degrees of visual angle. 49

66 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion correct responses Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.2: Responses to the speech task for congruent trials in Experiment 2a. This shows the proportion of correct responses for the congruent trials. The error bars indicate standard errors of the mean. in the congruent trials was not affected by the concurrent cognitive load task (p >.05), although performance in the congruent trials was somewhat affected by offset [F (1.56, 37.40) = 4.54, p =.025]. While this influence was statistically significant, the extent of the influence was rather subtle (see Figure 3.2). In the incongruent trials in the speech task, the proportion of trials showing the McGurk effect was lower in the speech-numbers task than it was in the speech-only task [F (1, 24) = 14.07, p =.001]. The difference in proportions of McGurk responses between the speech-only task and the speech-numbers task was rather modest, ranging from (see Figure 3.3). As expected, the proportion of trials showing the McGurk effect was affected by offset [F (1.13, 27.21) = 13.67, p =.001], with fewer 50

67 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion McGurk responses Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.3: This shows the proportion of McGurk responses for the incongruent trials. The error bars indicate standard errors of the mean. McGurk responses as the offset between the audio and video was increased. There was no significant interaction between the task condition and offset (p >.05). The presence of a concurrent speech task had an effect on performance on the numbers task. Not surprisingly, performance on the numbers task was a bit higher when participants didn t have to do the concurrent speech task (see Figure 3.4). Performance on the numbers task was higher in the numbers-only condition than in 0 ms speech offset [t(24) = 4.33, p =.001], 175 ms offset [t(24) = 4.84, p <.001] and 350 ms offset [t(24) = 8.96, p <.001] in the speech-numbers task. On average, the proportion of correct responses was about higher when participants only had to do the numbers task. There was no difference in performance on the numbers task across the speech offsets in the speech-numbers condition (p >.05). 51

68 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion of trials correct in the numbers task Condition Numbers-only Speech-numbers 0 ms Speech-numbers 175 ms Speech-numbers 350 ms Figure 3.4: Performance on the concurrent cognitive load task (the numbers task) by condition for Experiment 2a. The error bars indicate standard errors of the mean. 52

69 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS The concurrent numbers task did have an effect on the proportion of trials showing the McGurk effect in the speech task, but this effect was very modest. The auditory and visual speech stimuli used in this task were very conducive to integration, as evidenced by the high proportion of trials showing McGurk responses. The modest effect of the concurrent cognitive load task on the speech task could be due to the fact that the speech stimuli seem to elicit a strong McGurk effect. While the increasing offset did weaken the McGurk effect, it was still quite strong at 350 ms. The influence of the visual information at 350 ms was more pronounced than in some other studies (e.g., Jones & Jarick, 2006; Grant et al., 2004; Conrey & Pisoni, 2006). This could be due to the particular talker used in the experiment, who in previous experiments with no offsets has consistently produced very strong McGurk effects (Buchan, Wilson, Paré, & Munhall, 2005; A. Wilson, Wilson, Hove, Paré, & Munhall, 2008), and the particular speech tokens used. For instance, both talker and speech token can influence the number of McGurk responses (Paré et al., 2003). Would stimuli and response combinations that produced a weaker McGurk effect as the offset is increased show more interference from the cognitive load task? In Experiment 2b, stimuli were used from another experiment that had been shown to produce a strong McGurk effect when the audio and video were aligned, but become much weaker at the 350 ms offset. Experiment 2b In Experiment 2b participants were very good at discriminating the consonant sound in the congruent trials. Performance on the congruent trials was at ceiling in all conditions (see Figure 3.5). Unlike Experiment 2a there was no influence of offset on 53

70 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion correct responses Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.5: This shows the proportion of correct responses for the congruent trials. The error bars indicate standard errors of the mean. performance in the congruent trials (p >.05). For the incongruent trials, there was an obvious influence of offset [F (1.37, 32.97) = 47.12, p <.001], with the proportion of McGurk responses dropping from an average of 0.89 at 0 ms, to 0.80 at 175 ms, to 0.56 at 350 ms (see Figure 3.6). This is not surprising since the particular stimuli were chosen because they show considerably less audiovisual integration as offset is increased. As in Experiment 2a, there was an effect of the cognitive load task [F (1, 24) = 8.05, p =.009], with slightly more McGurk responses in the speechonly task compared with the speech-numbers task. The difference in proportion of McGurk responses between the speech-only task and speech-numbers task ranged between and across offsets. Like Experiment 2a, there was no significant interaction between the task condition and audiovisual offset (p >.05). 54

71 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion McGurk responses Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.6: Responses to the speech task for incongruent trials in Experiment 2b. The error bars indicate standard errors of the mean. 55

72 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS While the overall pattern of performance on the numbers task in Experiment 2b was similar to Experiment 2a (see Figure 3.7),the difference at the 0 ms speech offset between the numbers-only task and the speech-numbers task was not significant (p >.05). Performance on the numbers task was significantly different between the numbers-only condition and the 175 ms and 350 ms speech offsets in the speechnumbers task [t(24) = 3.53, p =.011] and [t(24) = 4.86, p <.001], respectively. The average difference in the proportion of correct responses on the numbers task between numbers-only condition and the 175 and 350 ms offsets was The 175 ms speech offset and the 350 ms speech offset in the speech-numbers were not significantly different from one another (p >.05). The proportion of correct trials for the numbers task at the 0 ms offset was not significantly different from the 175 or 350 ms offset (p >.05) Gaze bahavior Experiment 2a Overall, eyetracking samples tended to fall quite close to the centre of the nose. Approximately 0.68 of all eyetracker samples in Experiment 2a fell within 4 degrees of visual angle from the centre of the nose. Most samples fell either on the face, or very close to the face. Approximately 0.90 of eyetracker samples in Experiment 2a fell within 10 degrees of visual angle from the centre of the nose. The presence of a concurrent cognitive load task in the speech-numbers condition did influence centralization of gaze compared to the speech-only condition (see Figure 3.8), although this influence was fairly subtle [F (1, 23) = 5.18, p =.033]. Surprisingly, gaze was more centralized in the speech-only condition than in the speech-numbers 56

73 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Proportion of trials correct in the numbers task Condition Numbers-only Speech-numbers 0 ms Speech-numbers 175 ms Speech-numbers 350 ms Figure 3.7: Performance on the concurrent cognitive load task (the numbers task) by condition for Experiment 2b. The error bars indicate standard errors of the mean. 57

74 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Average degrees of visual angle from centre of the nose Offset between video and audio (ms) Speech-only Speech-numbers Figure 3.8: This shows the average distance of the eye tracking samples from the centre of the nose in degrees of visual angle, showing the amount of gaze centralization on the face for Experiment 2a. The error bars indicate standard errors of the mean. condition. The offset in the speech task also had an effect on gaze centralization [F (1.36, 31.28) = 15.60, p <.001], with gaze showing a general tendency to become more centralized as the offset was increased. There was no significant interaction between task condition and offset (p >.05). The presence of a cognitive load task did influence the amount of time spent looking at the eyes and mouth in Experiment 2a. The addition of a cognitive load task to the speech task is accompanied by a shift of gaze away from the mouth, and towards the eyes. A larger proportion of the trial was spent looking at the eyes in the speechnumbers condition compared with the speech-only condition [F (1, 23) = 21.52, p < 58

75 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS 1 Proportion of the trial looking at the eyes Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.9: This shows the proportion of the trial spent looking at the eyes in Experiment 2a. The error bars indicate standard errors of the mean..001] (see Figure 3.9). There was no significant effect of offset, nor a significant interaction between task condition and offset (p >.05). A larger proportion of the trial was spent looking at the mouth in the speech-only condition compared with the speech-numbers condition [F (1, 23) = 13.28, p =.001] (see Figure 3.10). A slightly, though significantly, larger proportion of the trial was also spent looking at the mouth with increasing offset [F (1.40, 32.11) = 4.52, p =.030]. The were significant differences between the 0 ms offset and the 350 ms offset [t(23) = 2.75, p =.034], and the 175 ms offset and the 350 ms offset [t(23) = 3.55, p =.003]. There was no significant interaction between task condition and offset (p >.05). 59

76 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS 1 Proportion of the trial looking at the mouth Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.10: This shows the proportion of the trial spent looking at the mouth in Experiment 2a. The error bars indicate standard errors of the mean. 60

77 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Experiment 2b Like Experiment 2a, the overall eyetracking samples tended to fall quite close to the centre of the nose. Approximately 0.54 of eyetracker samples in Experiment 2b fell within 4 degrees of visual angle from the centre of the nose. Most samples fell either on the face, or very close to the face. As in Experiment 2a, approximately 0.90 of samples in Experiment 2b fell within 10 degrees of visual angle. Gaze behaviour in Experiment 2b was very similar to that seen in Experiment 2a, although overall gaze seemed to be more centralized in Experiment 2b (see Figure 3.11). The addition of the cognitive load task causes gaze to become less centralized. Gaze was more centralized in the speech-only condition than in the speech-numbers condition [F (1, 24) = 19.36, p <.001]. Gaze also tended to become more centralized as offset was increased [F (1.15, 27.59) = 8.94, p =.004]. There was also no significant interaction between task condition and offset (p >.05). As in Experiment 2a, the addition of a cognitive load task in Experiment 2b also caused a slight shift of gaze away from the mouth, and towards the eyes. A larger proportion of the trial was spent looking at the eyes in the speech-numbers condition compared with the speech-only condition [F (1, 24) = 21.51, p <.001] (see Figure 3.12). There was no significant effect of offset, nor a significant interaction between task condition and offset (p >.05) to the proportion of the trial looking at the eyes. A larger proportion of the trial was spent looking at the mouth in the speech-only condition compared with the speech-numbers condition [F (1, 24) = 31.59, p =.001] (see Figure 3.13). There was no significant effect of offset, nor a significant interaction between task condition and offset (p >.05) to the proportion of the trial looking at the mouth. 61

78 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS Average degrees of visual angle from centre of the nose Offset between video and audio (ms) Speech-only Speech-numbers Figure 3.11: This shows the average distance of the eye tracking samples from the centre of the nose in degrees of visual angle, showing the amount of gaze centralization on the face for Experiment 2b. The error bars indicate standard errors of the mean. 62

79 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS 1 Proportion of the trial looking at the eyes Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.12: This shows the proportion of the trial spent looking at the eyes in Experiment 2b. The error bars indicate standard errors of the mean. 63

80 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS 1 Proportion of the trial looking at the mouth Speech-only Speech-numbers Offset between video and audio (ms) Figure 3.13: This shows the proportion of the trial spent looking at the mouth in Experiment 2b. The error bars indicate standard errors of the mean. 64

81 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS 3.5 Discussion The results of both experiments show an effect of a concurrent cognitive load task on the integration of audiovisual speech. Slightly less audiovisual integration was observed when participants had to perform the secondary cognitive load task. Although the reduction in reported audiovisual integration was quite modest, this effect was replicated across both experiments. As expected, increasing the temporal offset between the auditory and visual speech information decreased the observed integration in both experiments, although the decrease was much more pronounced in Experiment 2b. The overall influence of the secondary task on the integration of auditory and visual speech information in the current experiments was quite modest. The very modest nature of the interference is interesting since the cognitive load task (numbers task) in this experiment was reasonably difficult. For example, a similar task to the one used in the current experiment was used by de de Fockert et al. (2001), and was shown to influence the processing of distractors in a visual attention task. The task used in the current study was likely more difficult than the task used in de de Fockert et al. (2001), since in the current experiment participants had to remember the order of eight digits rather than the five digits required of de Folkert et al. s participants. However, the influence of visual speech information on the perception of speech has previously been shown not to be influenced by a concurrent cognitive load task. Baart and Vroomen (2010) found no influence of either a secondary visuospatial or verbal cognitive load task on phonetic recalibration, a phenomenon where lipread information can adjust the phonetic category between two speech categories. The verbal cognitive load task that was used by Baart and Vroomen (2010) appeared to be 65

82 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS somewhat easier than the task used in the current paper. Baart and Vroomen (2010) used either three five or seven memory items, while the current paper used eight. In addition, their task was a recognition task, whereas participants were required to memorize the order of the digits in the current task. It is possible that in both the current paper and the Baart and Vroomen (2010) paper that a more difficult cognitive load task could have a greater influence on audiovisual integration, although at some point it is possible that participants would start to give up on the task. The fact that a concurrent cognitive load task had a slight effect on audiovisual integration shows that a reduction of audiovisual integration can be achieved in the absence of competing perceptual information. The exact nature of the interference remains unclear. A general issue in dual task paradigms is that participants adopt strategies in order to perform both tasks. The interference could possibly be due to participants silently repeating the items from the cognitive load task to themselves during the speech task. However, as Baart and Vroomen (2010) have pointed out, there is no guarantee that participants were silently repeating the items during the speech task. While participants may have used a verbal strategy to help them with the cognitive load task, high performance on the congruent trials during the speech task would suggest that the strategy used by participants during the cognitive load task was not particularly detrimental to the speech task. It is certainly possible that a different secondary task could show greater deficits in the integration of auditory and visual speech information. However, the lack of interaction between the effect of the cognitive load task and the effect of offset on the McGurk responses suggest that the large size of the synchrony window is relatively independently of cognitive resources that overlap with the cognitive load task. The 66

83 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS window for synchrony perception for audiovisual speech stimuli tends to be more generous than that observed for non-speech audiovisual speech stimuli (see Dixon & Spitz, 1980; Lewkowicz, 1996; Conrey & Pisoni, 2006). It seems likely that the rather large size of the synchrony window observed in audiovisual speech is determined by the inherent dynamic properties of the stimuli, with speech generally having a richer time series in both the visual and auditory modalities than the stimuli tested in the non-speech tasks. For example, Arrighi, Alais, and Burr (2006) showed that video sequences of conga drumming with natural biological speed variations showed generally greater temporal delays for perceptual synchrony than for artificial stimuli based on the videos that moved at a constant speed. The non-speech stimuli used in studies on perceptual synchrony (e.g. Dixon & Spitz, 1980; Lewkowicz, 1996) have less dynamic variation than the speech stimuli. Both the concurrent cognitive load task and offset did have an influence on gaze behaviour. Gaze was somewhat less centralized on the video of the talker during the concurrent cognitive load task than when no concurrent cognitive load task was present. Increasing the temporal offset generally showed an increase in the amount of gaze centralization on the face. Increasing the temporal offset could have increased the difficulty of the speech task. However, in the behavioural data in the incongruent trials there was no interaction between the the effect of the cognitive load task and the effect of offset. This increase in gaze centralization with the addition of a cognitive load task was also accompanied by a decrease in the proportion of the trial spent looking at the mouth, and an increase in the proportion of the trial spent looking at the eyes. The general proportions of the trial spent looking at the eyes and mouth are in line with Paré et al. (2003); Buchan et al. (2005) who also looked at gaze using 67

84 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS a McGurk task, although Paré et al. (2003) did show more fixations on the mouth. It is interesting that experiments using longer stimuli such as sentences (Buchan et al., 2007), and extended monologues (Vatikiotis-Bateson, Eigsti, Yano, & Munhall, 1998) do show far more time spent looking at the eyes than both the current study and Paré et al. (2003) that both used short vowel-consonant-vowel stimuli. The gaze centralization observed in previous speech-in-noise studies (Buchan et al., 2007, 2008) is not strictly due to the increased cognitive demands caused by the acoustic stimuli being harder to hear with the addition of acoustic noise. The addition of a cognitive load task actually caused a decrease in gaze centralization. This decrease in gaze centralization with the addition of a cognitive load task was also accompanied by a tendency to look slightly less at the mouth, and slightly more at the eyes. It is possible that different gaze patterns seen in the speech task with the speech task alone compared to the speech task with the concurrent cognitive load task could be driving the decrease in integration seen with the addition of the cognitive load task. However, this seems unlikely as visual speech information can still be gathered without direct fixations on the mouth (Paré et al., 2003; Andersen et al., 2009). Also, highly detailed visual information is not necessary for the visual speech information to be acquired and integrated with auditory and visual speech information (Munhall, Kroos, Jozan, & VatikiotisBateson, 2004; MacDonald, Andersen, & Bachmann, 2000). Paré et al. (2003) showed that fixating on either the mouth, eyes or hairline of a talking face seems to provide rather similar vantage points in terms of gathering visual information during audiovisual speech processing in a McGurk task. It is not until gaze is fixed more than degrees away from the mouth that the influence of 68

85 CHAPTER 3. COGNITIVE LOAD AND TEMPORAL OFFSETS the visual information on the McGurk effect is significantly lessened, and some visual speech information persists even at 40 degrees of eccentricity. In summary, the data presented here show a very small but statistically significant decrease in the number of McGurk responses when subjects perform a concurrent cognitive load task. This suggests a rather modest role for cognitive resources such as working memory in the integration of audiovisual speech information. While a distracting cognitive load task can slightly modulate the multisensory integration of auditory and visual speech information, it appears that integration of audiovisual speech occurs relatively independent of cognitive resources such as working memory and further suggests that this integration is primarily an automatic process. 69

86 Chapter 4 The influence of selective attention on the integration of audiovisual speech information This chapter has been accepted for publication in Perception pending minor changes. (Note that in the version in this chapter submitted to Perception, Experiment 3a was referred to as Experiment 1, and Experiment 3b was referred to as Experiment 2. Also, the figure numbers have been changed in the thesis.) Additional data not included in the paper submitted to Perception is included in Appendix C. 4.1 Abstract Conflicting visual speech information can influence the perception of acoustic speech, causing an illusory percept of a sound not present in the actual acoustic speech (namely the McGurk effect). The current research examined whether participants 70

87 CHAPTER 4. SELECTIVE ATTENTION can voluntarily selectively attend to either the auditory or visual modality by instructing participants to pay attention to the information in one modality and to ignore competing information from the other modality. This research also examined how performance under these instructions was affected by weakening the influence of the visual information by manipulating the temporal offset between the audio and video channels (Experiment 3a), and the spatial frequency information present in the video (Experiment 3b). Gaze behavior was also monitored to examine whether attentional instructions influenced the gathering of visual information. While task instructions did have an influence on the observed integration of auditory and visual speech information, participants were unable to completely ignore conflicting information, particularly information from the visual stream. Manipulating temporal offset had a more pronounced interaction with task instructions than manipulating the amount of visual information. Participants gaze behavior suggests that the attended modality influences the gathering of visual information in audiovisual speech perception. 4.2 Introduction The perception of speech is not a purely acoustic phenomenon, and it has been known for some time that visual speech information can influence the perception of acoustic speech. For example, the intelligibility of acoustic speech can be increased when a talker can be seen as well as heard under both acoustically degraded conditions (Sumby & Pollack, 1954; O Neill, 1954; Erber, 1969) and difficult to understand clear speech (Reisberg et al., 1987). Visual speech information can also influence the perception of the spatial location of the acoustic speech (Radeau & Bertelson, 1974), 71

88 CHAPTER 4. SELECTIVE ATTENTION and aid in separating sound streams of two different talkers coming from the same location (Driver, 1996). Additionally, presenting conflicting visual information can also influence the perception of auditory speech in perfectly audible conditions (e.g., the McGurk effect McGurk & MacDonald, 1976; Summerfield & McGrath, 1984). This conflicting visual information tends to influence the perceived phonetic category of the acoustic stimuli, creating an illusory acoustic percept. The McGurk effect has been used extensively to study the integration of auditory and visual speech information. However, the extent to which this integration of audiovisual speech information is automatic, or can be modified with attentional manipulations is currently a topic of debate (Navarra, Alsius, Soto-Faraco, & Spence, 2010). First, there are several lines of research that suggest that this integration may be automatic. Being naïve about the illusion is not necessary for it to occur. Informing adult participants that the stimuli had been dubbed doesn t seem to affect the McGurk effect (Liberman, 1982). The McGurk effect can occur when subjects are not aware that they are looking at a face, as shown in a study using point-light displays of talking faces (Rosenblum & Saldaña, 1996). Cross-modal integration of speech even occurs when the auditory and visual speech information streams come from talkers of different genders (Green et al., 1991; Walker et al., 1995), suggesting that specific matches between faces and voices are not critical for the visual speech information to influence the perception of the acoustic speech information. There also doesn t seem to be a reaction time cost in terms of processing the illusory McGurk percept as compared to the actual speech token in a speeded classification task (Soto- Faraco et al., 2004). The actual speech token or McGurk percept can equally provide a benefit or interfere with a concurrent syllable categorization task (using a syllabic 72

89 CHAPTER 4. SELECTIVE ATTENTION interference paradigm). This benefit or interference depends on the perceived syllable and not the actual auditory syllable, suggesting that the integration takes place even if it is costly to the task. Several studies have examined the integration of auditory and visual speech information using event-related potential (ERP) electroencephalography (EEG) (Kislyuk et al., 2008; Saint-Amour et al., 2007; Colin et al., 2002, 2004) and event-related magnetoencephalography (MEG) (Sams et al., 1991). These EEG and MEG studies used the well-known electrophysiological component, mismatch negativity (MMN), or its magnetic counterpart (MMNm). The MMN is elicited by a discriminable change in a repetitive aspect of an auditory stimulus (Näätänen, 1999), can be elicited in the seeming absence of attention (Näätänen, 1991, 2000), and has even been elicited in some comatose patients, although the amplitude may be attenuated (Näätänen, 2000). These EEG and MEG studies show that the illusory acoustic percept produced by the McGurk effect produces the same early response as the actual acoustic token. The MMN has been shown to occur in the absence of attention, and similar MMNs were elicited by the McGurk effect as the actual acoustic token suggesting that this integration can occur in the absence of attention. Other ERP studies using the N1-P2 complex, another electrophysiological component, have generally reached similar conclusions and argue against many attentional explanations for the modulations of the N1-P2 complex in audiovisual speech perception (Besle, Fort, Delpuech, & Giard, 2007; van Wassenhove, Grant, & Poeppel, 2005) (although see Pilling, 2009, for discussion). On the other hand, cognition can influence the perception of audiovisual speech. 73

90 CHAPTER 4. SELECTIVE ATTENTION While auditory and visual speech information can still be integrated when the information streams come from talkers of different genders (Green et al., 1991; Walker et al., 1995), the susceptibility to this effect decreases with talker familiarity (Walker et al., 1995). Additionally, mismatching talker genders can influence temporal order judgments, making it easier to judge which modality has been presented first when talkers are mismatched (Vatakis & Spence, 2007; Vatakis et al., 2008)). People are able to match the identity of faces and voices across modalities better than chance (Kamachi, Hill, & Lander, 2003), and this combined with a general knowledge of speech and gender can influence how audiovisual speech is integrated. Whether or not ambiguous stimuli are perceived to be talking faces or voices can influence the strength of the McGurk effect. For example, Munhall et al. (2009) have shown that the perception of the McGurk effect is related to the perception of the bistable stimulus. They used a dynamic version of Rubin s vase-face illusion where the vase turns and the faces speak a visual vowel-consonant-vowel syllable which is different from the acoustic vowel-consonant-vowel syllable. With the stimuli held constant, significantly more McGurk responses were reported when the faces were the figure percept than when the vase was the figure percept. Perceptually ambiguous replicas of natural speech (sine wave speech, see Remez et al., 1981) have also been shown to be influenced by visual speech information (Tuomainen et al., 2005). This study showed that visual speech information can influence the perception of sine wave speech (producing a McGurk effect) when participants perceived the audio as speech, but are only negligibly influenced by the visual information when it was not perceived as speech. Lastly, attentional manipulations have been shown to influence the McGurk effect. 74

91 CHAPTER 4. SELECTIVE ATTENTION This has been shown for concurrent visual, auditory (Alsius et al 2005) and tactile (Alsius et al 2007) tasks which addressed the effect of perceptual load on the mechanisms for audiovisual integration. Additionally, directing visual spatial attention to one of two simultaneously presented talkers (Andersen et al 2009), or directing visual attention to either a face or a concurrently presented leaf (Tiippana et al 2004) has also been shown to reduce the influence of the visual information on the reported percept. However, while the perceptual load and directed attention tasks in these studies have shown a modulation of the McGurk effect, these manipulations have not been able to completely break the integration. To what extent is it possible to break the McGurk effect by selectively attending to the auditory and visual information? Research with non-speech stimuli has shown that selective attention to either the auditory or visual modality can attenuate multisensory integration (Mozolic et al., 2008). The present research examines the extent to which selective attention to each modality in audiovisual speech can affect the degree of audiovisual integration. Preliminary research on selective attention with speech stimuli has been done looking at the integration of videos of a talker paired with synthetic speech (Massaro, 1987) [pp66-74]. Six participants ran in four conditions. Participants were instructed to pay attention to either the audio or the video (and produce one response), to both the audio and the video together (and produce one response), or to both the audio and the video separately (produce two responses, one for the audio and one for the video). This work suggests that it may be easier to pay attention to the visual speech information than the auditory speech information. The current paper seeks to extend the findings by making direct comparisons between the attention conditions and examining whether the effect of attentional instructions 75

92 CHAPTER 4. SELECTIVE ATTENTION can be influenced by weakening the influence of the visual information. Gaze behavior will be monitored in order to make sure that participants are actually looking at the video, and to examine whether attentional instructions influence gaze behavior. In the current paper, selective attention was manipulated by instructing participants to pay attention to either the auditory or the visual speech. An initial baseline condition was run prior to the selective attention conditions, where participants were naïve to the McGurk effect, and were told to watch the talker, and report what consonant sound they heard the talker say. Control auditory and visual only conditions were also run to establish unimodal baselines for comparison. In the selective attention conditions (attend-audio and attend-video conditions) participants were instructed to determine the consonant presented in the selected modality, and ignore the information coming from the competing modality. Although previous research (Liberman, 1982) suggests that knowledge of the McGurk effect does not seem to affect how participants observe the McGurk effect, participants were informed about the McGurk effect just prior to the selective attention conditions. Participants were also informed about the stimulus properties that were manipulated in the experiment in order to give participants the best chance of reducing the influence of the competing speech information. Participants gaze was monitored with an eye tracker in order to ensure that participants were watching the screen in all conditions with visual information. Monitoring gaze also provided evidence that in the attend-audio condition participants were adopting an attentional strategy and not simply diverting their gaze away from the screen. This research further examines the extent to which the instructions to selectively 76

93 CHAPTER 4. SELECTIVE ATTENTION pay attention to either the auditory or the visual information are affected by manipulating the stimulus to produce a weaker McGurk effect. Two stimulus properties that are known to affect the audiovisual integration seen in the McGurk effect are the amount of temporal offset between the auditory and visual stimuli (Munhall et al., 1996; Jones & Jarick, 2006; van Wassenhove et al., 2007), and the amount of visual information present in the video at different spatial frequencies (Buchan et al., 2005; Thomas & Jordan, 2002). It has been shown that audiovisual integration occurs not just for synchronous speech stimuli, but occurs over a range of asynchronies. This synchrony window varies depending on task and stimuli; though it tends to be asymmetric with a greater tolerance for visual stimuli leading the auditory stimuli than vice-versa (Conrey & Pisoni, 2006; Munhall et al., 1996; Dixon & Spitz, 1980), and the integration of auditory and visual information tends to fall off as the amount of asynchrony is increased. Three offsets were chosen for Experiment 3a. The audio and video were either synchronous (0 ms), or the video led the audio by 175 or 350 ms. Visual spatial frequency information present in the stimuli can also affect the observed McGurk effect (Munhall et al., 2004). Around 16 cycles per face (cpf) seems to contain a considerable amount of spatial frequency information for the integration of auditory and visual speech. High spatial frequency information can be removed above a particular cut off, effectively blurring the image. Although quite blurry, a 16 cpf cut off still provides enough information to produce approximately the same number of trials showing the McGurk effect as an unfiltered image. High pass cutoffs below 16 cpf (i.e. 8cpf and 4cpf) remove enough visual information to noticeably reduce the number of trials showing the McGurk effect (Buchan et al., 2005). For Experiment 3b, two spatial frequency cutoffs were selected; a cutoff of 16.1 cpf which 77

94 CHAPTER 4. SELECTIVE ATTENTION should provide approximately as much visual speech information for integration as an unfiltered video, and a cutoff of 4.3 cpf which will provide noticeably less visual speech information for integration. Additionally, this research will examine whether different task instructions alter gaze behavior used to gather the visual speech information. While stimulus properties affect gaze behavior (Parkhurst et al., 2002; Parkhurst & Niebur, 2003), the locations selected for visual processing are also known to be knowledge-driven (Henderson, 2003; Findlay & Gilchrist, 2003). The spatial distribution of gaze on an image can be modified depending on the task asked of a subject (Yarbus, 1967; Henderson, Weeks Jr, & Hollingworth, 1999). Task has also been shown to modify gaze behavior with dynamic speech stimuli in a silent speech reading task (Lansing & McConkie, 1999), and in an audiovisual speech task using sentences (Buchan et al., 2007). In the latter two papers the tasks were either to make speech perception judgments, to judge intonation (Lansing & McConkie, 1999) or to judge emotion (Buchan et al., 2007). The current study will examine whether instructing participants to pay attention to the auditory and visual information will influence gaze behavior. It was noticed in pilot work done without an eye tracker that there was a reported tendency to look more towards the lower part of the face when specifically trying to pay attention to the visual information. Gaze behavior will be examined by looking at whether task instructions alter the amount of time spent looking at the lower half of the screen as measured by the number of eye tracker samples falling on the lower half of the screen. The lower half of the screen corresponds roughly to the lower half of the talker s face (see Figure 4.1). Eye tracking data will be compared between the three audiovisual instruction conditions as well as the video-only control conditions. Because the stimuli 78

95 CHAPTER 4. SELECTIVE ATTENTION are identical across tasks, image properties known to influence fixations, such as color and spatial frequency (Parkhurst et al., 2002; Parkhurst & Niebur, 2003) cannot account for differences found between the different task instructions. 4.3 Methods Subjects All subjects were native English speakers and reported having normal or corrected to normal vision, and no speech or hearing difficulties. The experiments were undertaken with written consent from each subject. All procedures were approved by Queen s University s General Research Ethics Board, and comply with the Canadian Tri- Council Policy Statement on ethical contact for research involving humans, and are in accordance with the World Medical Association Helsinki Declaration. There were 33 participants (18 females) with a mean age of 18.6 years (range 17-23) in Experiment 3a. Experiment 3b had 36 participants (27 females) with a mean age of 20.5 years (range 19-26) Stimuli For both experiments, a male volunteer was used as the talker, and was filmed in color saying the vowel-consonant-vowel nonsense syllables /aba/, /ava/, /ibi/ and /ivi/. The video was edited into clips in Final Cut Pro. For congruent stimuli the auditory and visual tokens were the same. The incongruent stimuli were created to elicit the McGurk effect. These stimuli were created by dubbing a different auditory consonant, but the same auditory vowel as the one articulated in the video, onto 79

96 CHAPTER 4. SELECTIVE ATTENTION Figure 4.1: This shows the talker on the screen during the experiment. The screen subtended approximately 45 degrees of visual angle along the horizontal. The video was centered inside a medium grey border and subtended approximately 40 degrees of visual angle. The talker s eyes are located approximately 4 degrees of visual angle from the centre of the nose, and the mouth is located approximately 2 degrees of visual angle from the centre of the nose. The black line illustrates the lower half of the screen that was used in the eye tracking analyses. 80

97 CHAPTER 4. SELECTIVE ATTENTION the video using custom MATLAB software. An auditory /aba/ was paired with a visual /ava/, an auditory /ava/ was paired with a visual /aba/, an auditory /ibi/ was paired with a visual /ivi/, and an auditory /ivi/ was paired with a visual /ibi/. To maintain the timing with the original soundtrack, the approximate acoustic releases of the consonants in the dubbed syllables were aligned to the acoustic releases of the consonants in the original acoustic syllable. Additionally, there were an equal number of congruent and incongruent trials, and in each of these /b/ and /v/ were equally likely to occur as the auditory token, and the visual token, in the congruent and incongruent trials. Temporal offsets (Experiment 3a) In Experiment 3a, the strength of the McGurk effect (as measured by the proportion of responses that correspond to the auditory token) was manipulated by varying the temporal offsets of the auditory and visual streams. Video-leading asynchronies were chosen since they tend to be more naturalistic, and show a greater asynchrony tolerance than auditory-leading speech. The influence of the visual information tends to be strongest when the video leads the audio by 0 ms to ms, and generally tends to fall off fairly steeply after that. However, the influence of the video on the auditory token in a McGurk task been shown in several studies to extend out to rather large video-leading offsets. For example, Jones and Jarick (2006) showed that a 360 ms offset still produced 45% non-auditory token responses, and Munhall et al. (1996) showed that a 360 ms offset produced about 30-40% non-auditory token responses. Grant et al. (2004) and van Wassenhove et al. (2007) also showed that while the auditory percept starts to be reported more often than the visually influenced percept 81

98 CHAPTER 4. SELECTIVE ATTENTION (in these cases fusion responses) as the visual-leading offset is increased to somewhere between 200 and 350 ms, there is still an noticeable influence of the visual token on response between 333 and 467 ms, with about 30-40% of the responses of the responses corresponding to non-auditory token responses. The offsets were created using custom MATLAB software. To create the 175 ms and 350 ms offsets, the onset of the syllable was offset so that the audio trailed the video by either 175 or 350 ms. The beginning of the audio track was zero padded, and the end was cut to make the audio and video of equal duration. Spatial frequency filtering (Experiment 3b) In Experiment 3b in order to reduce the influence of the visual information on the perception of the acoustic syllable, visual information was removed using spatial frequency filtering (Gaussian). Two spatial frequency cutoffs were chosen to provide a condition that would perform similar to an unfiltered video, and one condition in which the visual information would have considerably less influence. Based on previous research using both McGurk tasks (Buchan et al., 2005; Thomas & Jordan, 2002) and speech-in-noise tasks (Munhall et al., 2004), the 16.1 cpf condition should perform similarly to unfiltered video, whereas 4.3 cpf condition will contain considerably less visual speech information (also see Figures 4.2 and 4.3) Experimental task The experiments were both carried out as a within-subjects design. Both experiments had three audiovisual conditions: a baseline condition, an attend-audio condition, and an attend-video condition, with the same audiovisual speech stimuli used in all 82

99 CHAPTER 4. SELECTIVE ATTENTION Figure 4.2: Illustration the 16.1 cpf spatial frequency filtering applied to the videos in Experiment 3b 83

100 CHAPTER 4. SELECTIVE ATTENTION Figure 4.3: Illustration the 4.3 cpf spatial frequency filtering applied to the videos in Experiment 3b 84

101 CHAPTER 4. SELECTIVE ATTENTION three conditions. In addition, auditory-only and visual-only control conditions were included to determine how well participants could discriminate between the two consonant tokens in each modality. In both experiments, each participant was first run in the baseline condition, where they were instructed to watch the talker for the entire trial, and respond on the keyboard which consonant sound they heard. Keypress responses were used for all conditions. For the baseline condition, participants were told that the video might look a bit odd, since in the 350 ms offset the misalignment of the audio and video is noticeable. Then, in each experiment, a visual-only and an auditory-only control condition was run where participants had to report which consonant sound that they saw or heard, respectively. The stimuli for the visualonly and auditory-only control condition were identical to the baseline condition, except that they were presented unimodally. The visual-only and auditory-only stimuli were presented in separate blocks, since eyetracking data was not gathered from the auditory-only condition. Participants were then informed about the McGurk effect, and the stimuli used in this experiment. They were told that there was an equal likelihood that they would be presented with matched or mismatched auditory and visual consonants. They were also told that both /b/ and /v/ were equally likely to appear as an audio token, and equally likely to appear as a video token. They were then asked to determine either which consonant they heard in the attend-audio condition or which consonant sound they saw in attend-video condition, and to try and ignore information coming from the other modality. Participants were randomly presented with the attend-audio and video conditions, with a cartoon of an ear briefly presented on the screen to cue the participant to attend to the audio, and a cartoon of an eye briefly presented on the screen to cue the participant to attend to the video. 85

102 CHAPTER 4. SELECTIVE ATTENTION In Experiment 3a, the experiment was broken up into two one hour sessions. The baseline condition and audio-only and video-only control conditions were run in the first session. The attend-audio and attend-video conditions, as well as a second audio-only and video-only conditions were run in the second session. In Experiment 3b, the experiment was all run in a single one hour session, with the baseline condition followed by the video-only and audio-only conditions, and finished off with the attendaudio and attend-video conditions Experimental equipment Both experiments took place in a single walled sound booth. Subjects were seated approximately 57cm away from a 22in flat CRT computer monitor (ViewSonic P220f). Subjects heads were stabilized with a chin rest. The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. Experiment Builder (SR Research, Osgoode, Canada) was used to present the stimuli and record keypresses. To ensure that participants were watching the screen, eye position was monitored using an Eyelink II eye tracking system (SR Research, Osgoode, Canada). See Eye tracking analysis (section 4.3.6) for further details Speech task analysis While the strength of the McGurk effect is often measured by the proportion of responses that do not correspond to the auditory token, because there are only two possibilities, the strength of the McGurk effect can be inferred by the proportion of auditory responses. On each trial, participants had to make a choice as to whether 86

103 CHAPTER 4. SELECTIVE ATTENTION they heard the consonant b or v. In order to directly compare the results from each condition, the proportion of trials that correspond to the auditory token was taken as a measure. Because there are only two choices, in the visual-only condition, the proportion of correct responses to the visual token is equal to one minus the proportion corresponding to the auditory token. Congruent and incongruent trials were analyzed separately. For Experiment 3a, participant responses to the speech task were analyzed using a 3 3 (task conditions temporal offsets) repeated measures ANOVA. For Experiment 3b, participant responses to the speech task were analyzed using a 3 2 (task conditions spatial frequency) repeated measured ANOVA. In instances where there was a violation of sphericity, a Greenhouse-Geisser correction was used. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons. A signal detection analysis was also run on the data to look at changes in discriminability of the auditory token. Results of the behavioral tests were used to calculate d, see Table 1 for a classification of the responses, which is based on the classification of responses in Kislyuk et al. (2008). When the proportion of trials corresponding to the auditory token was calculated (mentioned above), the congruent trials essentially represent hits, and incongruent trials essentially represent correct rejections (see Table 4.1). The d values were also calculated for the auditory-only control conditions using just the auditory tokens from Table 4.1. Similar to Kislyuk et al. (2008) repeated measures ANOVAs were used to analyze the d values. In Experiment 3a, a 3 3 (task conditions temporal offsets) repeated measures ANOVA was used, and in Experiment 3b, 3 2 (task conditions spatial frequency) repeated measured ANOVA was used. In instances where there was a violation of sphericity, a 87

104 CHAPTER 4. SELECTIVE ATTENTION Greenhouse-Geisser correction was used. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons. Auditory token Visual Token Response Classification /ava/ or /ivi/ /ava/ or /ivi/ V hit /ava/ or /ivi/ /ava/ or /ivi/ B miss /aba/ or /ibi/ /ava/ or /ivi/ V false alarm /aba/ or /ibi/ /ava/ or /ivi/ B correct rejection /aba/ or /ibi/ /aba/ or /ibi/ B hit /aba/ or /ibi/ /aba/ or /ibi/ V miss /ava/ or /ivi/ /aba/ or /ibi/ B false alarm /ava/ or /ivi/ /aba/ or /ibi/ V correct rejection Table 4.1: This shows how the stimuli/response combinations were classified to calculate d. This classification is based on a classification used by Kislyuk et al. (2008). Note that the vowels of the auditory token were always matched with the vowels of the visual token. Only the consonants were mismatched Eye tracking analysis Eye tracking data was analyzed for all conditions that showed a video of the talker (i.e., baseline, attend-audio, attend-video and video-only control conditions). Eye position was monitored using an Eyelink II eye tracking system (SR Research, Osgoode, Canada) using dark pupil tracking with a sampling rate of 500Hz. Each sample contains an x and y coordinate which corresponds to the location of gaze on the screen. A nine-point calibration and validation procedure was used. The maximum average error was 1.0 visual degree, and maximum error on a single point was 1.2 visual degrees with the exception of the central point which was always less than 1.0 degrees. A drift correction was performed before each trial. Every sample recorded from the onset to the offset of the video was analyzed to 88

105 CHAPTER 4. SELECTIVE ATTENTION determine whether it fell on the screen. Custom software was used to determine the average proportion of samples falling on the screen from the total number of samples recorded during the duration of the video. The proportions of samples falling on the screen were quite high (at least 0.96) in each condition. Only samples that fell on the screen were used to calculate the total proportion of samples falling on the lower half of the screen. Research by Paré et al. (2003) suggests that as long as participants were looking at the screen they would be significantly influenced by the visual information of the talker s face. Any sample falling on the screen, and falling below half of the vertical axis of the screen was deemed to be on the lower half of the screen. The lower half of the screen corresponds roughly to the lower half of the talker s face. See Figure 4.1. There was very little motion of the talker s head. Two of the videos, /aba/ and /ava/, had been used in a previous experiment where the positions of the nose had been coded for another analysis. The videos for /ibi/ and /ivi/ show similar head movement but had not been coded. Using the data from /aba/ and /ava/ it was determined that the maximum head movement of the talker during the videos in the current experimental setup was 0.3 degrees of visual angle. The proportion of samples falling on the lower half of the screen for the baseline, attend-audio, attend-video and video-only tasks were analyzed using a one-way repeated measures ANOVA. Because behavioral performance was not statistically significantly different between the two video-only control condition sessions, the eyetracking data for the video-only condition was averaged across both sessions. In instances where there was a violation of sphericity, a Greenhouse-Geisser correction was used. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons. 89

106 CHAPTER 4. SELECTIVE ATTENTION Task instructions, rather than the offset between the video and the audio stimuli, or the spatial frequency filtering of the video, appeared to be the major factor in determining the proportion of samples falling on the lower half of the face (see Tables 4.2 and 4.3). To examine this question in Experiment 3a, a further 3 3 (baseline, attend-video and attend-audio task conditions temporal offsets) repeated measures ANOVA was also run. A paired-samples t-test was also run between the video-only and attend-video at 0 ms conditions. For Experiment 3b, a further 4 2 (attention conditions and video-only condition spatial frequency) repeated measured ANOVA was also run. Where there was a violation of sphericity assumption in the ANOVAs, a Greenhouse-Geisser correction was used. Pairwise comparisons were done with paired samples t-tests with Bonferroni corrections used for multiple comparisons. 4.4 Results Behavioral Data Experiment 3a Participants were very good at discriminating between the consonant sounds in the congruent trials (see Figure 4.4). For the congruent trials there were minor effects of offset and task instructions on the proportion of responses corresponding to the auditory token. The effects of task instructions [F (1.46, 46.79) = 4.54, p =.025] and offset [F (2, 64) = 5.64, p =.006] were statistically significant. Performance in the both the audio-only and video-only control conditions were not significantly different between sessions (p.05), so only the means of the two sessions were used in any further analyses. 90

107 CHAPTER 4. SELECTIVE ATTENTION Pr correspo Offset between video and audio (ms) Figure 4.4: The proportion of trials where participants reported the auditory token for Experiment 3a. This shows the responses for the congruent trials for each set of task instructions, by offset. The error bars indicate standard errors of the mean 91

108 CHAPTER 4. SELECTIVE ATTENTION In the incongruent trials the different task instructions affected the proportion of responses corresponding to the auditory token [F (1.51, 48.18) = 21.17, p <.001] suggesting differences in the influence of the visual information on the auditory percept (see Figure 4.5). Across the three temporal offsets in the incongruent trials, pairwise comparisons showed significantly more responses corresponding to the auditory token in the attend-audio condition [t(32) = 2.88, p =.021] than the baseline condition, and more auditory token responses in the baseline than the attend-video condition [t(32) = 5.08, p <.001]. The amount of temporal offset in the incongruent trials between the audio and video also significantly affected the proportion of auditory token responses [F (2, 64) = , p <.001], in line with other research examining the integration of auditory and visual speech information (Conrey & Pisoni, 2006). As can be seen in Figure 4.5, the overall interaction between task instructions and offset was also significant [F (4, 1) = 20.88, p <.001]. Task instructions altered the number of responses corresponding to the auditory token, showing that selective attention to either modality has an effect on the amount of audiovisual integration shown. However, the effect of task instructions was quite modest at 0 ms. In the incongruent condition, the difference between the three task instruction conditions at the 0 ms offset was not significant (p >.05). The attend-video condition did diverge from the baseline as offset is increased. However, the difference between the attend-audio and the baseline remains small and fairly constant across the three offsets in the incongruent condition (see Figure 4.5). In both the attend-audio and attend-video conditions there seemed to be some interference from the speech information in the to be ignored modality, since performance didn t reach the same levels as the audio-only and video-only control conditions. 92

109 CHAPTER 4. SELECTIVE ATTENTION Pr correspo Offset between video and audio (ms) Figure 4.5: The proportion of trials where participants reported the auditory token for Experiment 3a. This shows the responses for the incongruent trials for each set of task instructions, by offset. The error bars indicate standard errors of the mean 93

110 CHAPTER 4. SELECTIVE ATTENTION Participants were slightly better at determining the visual token in the visual-only control condition than in the attend-video condition. Paired samples t-tests using a Bonferroni correction showed a significant difference between the video-only control condition and the attend-video condition at the 0 ms [t(32) = 20.13, p <.001], 175 ms [t(32) = 13.23, p <.001], and 350 ms [t(32) = 9.25, p <.001] offsets. As is quite apparent from a comparison between Figures 4.5 and 4.6, participants had a great deal of difficulty determining the auditory token in the attend-audio condition as compared with the audio-only control condition. Paired samples tests also showed a significant difference between the audio-only control condition and the attend-audio conditions at the 0 ms [t(32) = 3.42, p =.012], 175 ms [t(32) = 4.73, p <.001] and 350 ms [t(32) = 5.45, p <.001] offsets. The d values for Experiment 3a can be seen in Figure fig:fig5adprime. The effects of task instructions [F (1.45, 46.49) = 30.09, p <.001] and offset [F (2, 64) = 78.90, p <.001] are both significant, as is the interaction between those factors [F (4, 128) = 17.48, p <.001]. Pairwise comparisons for the task instructions show that the attend-video condition differs significantly from the attend-audio [t(32) = 8.15, p <.001] and baseline [t(32) = 6.00, p <.001] conditions. Although the d values are slightly higher in the attend-audio condition as compared with the baseline condition, this difference is not significant (p >.05). The lack of significance is likely due to the fact that the hit rate was at ceiling in the congruent trials (see Figure fig:iands3a), even though there was a significant effect on the correct rejections between the attend-audio and baseline conditions in the incongruent trials (see Figure fig:iands3b). Pairwise comparisons for the offsets show that the 0 ms offset differs significantly from the 175 ms [t(32) = 5.48, p <.001] and

111 CHAPTER 4. SELECTIVE ATTENTION Propo Correspond Session 1 Session 2 Figure 4.6: This shows the responses for the audio-only and video-only control conditions for Experiment 3a. To allow for easy comparison across all figures, performance for both audio-only and video-only conditions is reported as a proportion of trials where participants reported the auditory token. The error bars indicate standard errors of the mean. 95

112 CHAPTER 4. SELECTIVE ATTENTION Figure 4.7: This shows the d values for Experiment 3a. standard errors of the mean. The error bars indicate ms [t(32) = 11.89, p <.001] offsets. The 175 ms and 350 ms offsets were also significantly different from one another [t(32) = 7.42, p <.001]. The differences between each of the three task instructions at the 0 ms offset did not reach significance (p >.05). For comparison with Figure fig:fig5adprime, the d value for combined two sessions of the auditory-only condition was 3.60 (SE 0.13). The comparison between the control and the attention conditions in the incongruent trials showed that competing visual speech information seems to be especially difficult to ignore. One possibility is that because the visual information is somewhat more accurate for determining the speech token than the auditory information in this experiment fig:iands3c, the visual information is more salient, making it harder to ignore. In Experiment 3b, some visual information will be removed to see if the visual information becomes easier to ignore compared with the baseline. 96

113 CHAPTER 4. SELECTIVE ATTENTION Figure 4.8: The proportion of trials where participants reported the auditory token for Experiment 3b. This shows the responses for the congruent trials for each set of task instructions by offset. The error bars indicate standard errors of the mean. Experiment 3b As expected, removing visual information by using spatial frequency filtering influenced the discriminability of the visual tokens. Visual-only performance was worse in the 4.3 cpf condition than the 16.1 cpf condition [t(35) = 4.64, p <.001] (see Figure 4.10). It is also worth noting that performance in attend-video condition in the incongruent condition at 4.3 cpf is noticeably worse than performance in the attend-video condition at 0 ms in Experiment 3a (compare Figures 4.9 and 4.5). Interestingly, and somewhat unexpectedly, instructions had a noticeable effect on performance in the congruent trials. This is unexpected because the auditory and visual information is redundant in this case. There was a significant effect of instructions 97

114 CHAPTER 4. SELECTIVE ATTENTION Pr correspo Spatial frequency (cycles per face) Figure 4.9: The proportion of trials where participants reported the auditory token for Experiment 3b. This shows the responses for the incongruent trials for each set of task instructions, by offset. The error bars indicate standard errors of the mean. 98

115 CHAPTER 4. SELECTIVE ATTENTION Propo Correspond Condition Figure 4.10: This shows the responses for the audio-only and video-only control conditions for Experiment 3b. To allow for easy comparison across all figures, performance for both audio-only and video-only conditions is reported as a proportion of trials where participants reported the auditory token. The error bars indicate standard errors of the mean. 99

116 CHAPTER 4. SELECTIVE ATTENTION [F (1.29, 45.16) = 13.58, p <.001], and spatial frequency [F (1, 35) = 17.04, p <.001], as well as a significant interaction [F (2, 70) = 8.02, p =.001]. As can be seen in Figure4.8, the effects of instructions, spatial frequency, and the interaction between these factors was primarily driven by the attend-video condition. Performance in the attend-video condition was significantly lower than in the baseline condition [t(35) = 4.94, p <.001], and the attend-audio conditions [t(35) = 3.14, p =.010]. A pairwise comparison showed that performance in the 4.3 cpf attend-video condition in the congruent trials was significant lower than performance in the 16.1 attend-video condition [t(35) = 4.37, p <.001]. Performance in the 4.3 cpf condition was also significantly worse than performance in the audio-only condition [t(35) = 3.99, p <.001]. Better performance in the audio-only condition than the audiovisual congruent condition suggests that participants are somewhat able to ignore useful redundant auditory speech information. For the incongruent trials, there was an overall effect of instructions in Experiment 3b [F (2, 70) = 5.30, p =.007]. Pairwise comparisons show that the significant differences were between the baseline and the attend-audio conditions [t(35) = 2.50, p =.046], and the attend-audio and attend-video conditions [t(35) = 2.84, p = 0.23] (see Figure 4.9. As expected, there was an effect of spatial frequency, with significantly more responses corresponding to the auditory token in 4.3 cpf condition than the 16.1 cpf condition [F (1, 35) = 67.79, p <.001], but the interaction between instructions and the spatial frequency filtering was not significant [F (2, 70) = 1.62, p >.05]. The d values for Experiment 3b can be seen in Figure Like in Experiment 3a, there was a significant effect of task instructions [F (2, 70) = 11.63, p <.001]. The effect of spatial frequency was also significant [F (1, 35) = 22.95, p <.001], however 100

117 CHAPTER 4. SELECTIVE ATTENTION Figure 4.11: This shows the d values for Experiment 3b. The error bars indicate standard errors of the mean. there was no significant interaction between task instructions and spatial frequency (p.05). Pairwise comparisons for the task instructions show that the attend-video condition differs significantly from the attend-audio [t(32) = 4.15, p =.012] and baseline [t(32) = 3.06, p =.001] conditions. As in Experiment 3a, although the d values are slightly higher in the attend-audio condition as compared with the baseline condition, that this difference is not significant (p >.05). For comparison with Figure 4.11, the d value for the auditory-only condition was 3.76 (SE 0.10). In order to allow for comparisons between the behavioral results of Experiment 3a and Experiment 3b, the mean of the two video-only sessions in Experiment 3a was also compared to the 16.1 cpf video-only condition in Experiment 3b. There were no significant differences between the video-only condition in Experiment 3a, and the 16.1 cpf video-only condition in Experiment 3b [t(47.5) = 0.97, p <.05], 101

118 CHAPTER 4. SELECTIVE ATTENTION suggesting that the 16.1 cpf face video provided a comparably useful amount of visual information available for audiovisual integration. Performance was also compared across the 0 ms baseline condition from Experiment 3a and the 16.1 cpf baseline condition from Experiment 3b. There were no significant differences between those conditions [t(57.3) = 1.61, p >.05], showing that both those conditions were as effective at providing visual information that can be integrated with the auditory information to produce the McGurk effect Gaze behavior In both experiments, in all conditions, participants spent the overwhelming majority of the trial looking at the screen, showing that participants did follow the instructions to watch the screen in every condition. Experiment 3a Behavioral performance was not statistically significantly different between the two video-only control condition sessions and in both sessions participants spent roughly the same amount of time looking at the screen. Because of this, the gaze behavior data, like the behavioral data, was also averaged across both sessions. Results show that overall, task instructions did influence gaze behavior [F (2.04, 65.31) = 27.54, p <.001] even though the video information was the same in the all of the conditions (see Figure 4.12). Gaze behavior was similar between the baseline and attend-audio conditions (p >.05) and the attend-video and video-only conditions (p >.05). Pairwise comparisons show that participants spent more time looking at the lower half of the screen in the attend-video condition than in the baseline [t(35) = 6.03, p <.001] or 102

119 CHAPTER 4. SELECTIVE ATTENTION Figure 4.12: This shows the proportion of eye tracking samples falling on the lower half of the screen (see Figure 4.1 for an illustration) for Experiment 3a. The error bars indicate standard errors of the mean. the attend-audio [t(35) = 6.16, p <.001] conditions. Participants also spent more time looking at the lower half of the screen in the video-only condition compared to the baseline [t(35) = 7.39, p <.001] and the attend-audio [t(35) = 6.22, p <.001] conditions. Task instructions, and not the offset between the audio and the video, seems to be responsible for the increase in the proportion of time spent looking at the lower half of the screen in the attend-video and video-only conditions compared with the baseline and attend-audio conditions (see Table 4.2). There was no influence of offset on the proportion of samples falling on the lower half of the screen in the audiovisual conditions, nor was there an interaction between task instructions and offset (p >.05). The video-only condition was also not statistically different from the 0 ms attend-video condition (p >.05). 103

120 CHAPTER 4. SELECTIVE ATTENTION Experiment 3a Baseline Attend-audio Attend-video Video-only 0 ms offset 0.60 (SE 0.04) 0.64 (SE 0.04) 0.83 (SE 0.02) 0.87 (SE 0.02) 175 ms offset 0.62 (SE 0.04) 0.64 (SE 0.04) 0.83 (SE 0.02) 350 ms offset 0.64 (SE 0.04) 0.63 (SE 0.04) 0.84 (SE 0.02) Table 4.2: This shows the mean proportion of eye tracker samples falling on the lower half of the screen for Experiment 3a by instruction and stimulus condition. Standard errors of the mean are shown in parentheses. Task instructions, rather than stimulus properties, are the main driving force behind the gaze behavior strategy of looking towards the lower half of the screen in these experiments. Experiment 3b Gaze behavior in Experiment 3b shows the same patterns as gaze behavior in Experiment 3a despite the different spatial frequencies in the videos. Overall, task had a significant effect on the proportion of time participants spent looking at the lower half of the screen [F (2.42), = 24.65, p <.001] (see Figure 4.13). Like Experiment 3a, gaze behavior was similar between the baseline and attend-audio conditions (p >.05) and the attend-video and video-only conditions (p >.05). Pairwise comparisons show that participants spent more time looking at the lower half of the screen in the attend-video condition than in the baseline [t(35) = 4.78, p <.001] or the attendaudio [t(35) = 5.68, p <.001] conditions. Participants also spent more time looking at the lower half of the screen in the video-only condition compared to the baseline [t(35) = 7.39, p <.001] and the attend-audio [t(35) = 6.50, p <.001] conditions. Task instructions, and not spatial frequency, is primarily responsible for the increase in the proportion of time spent looking at the lower half of the screen in the attend-video and video-only conditions compared with the baseline and attend-audio conditions (see Table 4.3). There was no influence of spatial frequency on the proportion of 104

121 CHAPTER 4. SELECTIVE ATTENTION Figure 4.13: This shows the proportion of eye tracking samples falling on the lower half of the screen (see Figure 4.1 for an illustration) for Experiment 3b. The error bars indicate standard errors of the mean. samples falling on the lower half of the screen in the audiovisual conditions, nor was there an interaction between task instructions and offset (p >.05). 4.5 Discussion The results of these experiments show that, overall, directing attention to either the auditory or visual information in an audiovisual speech task can have an influence on the integration of audiovisual speech information. However, despite participants being fully informed as to the stimuli in the attention conditions, they were still unable to completely ignore competing information in the other modality. Participants were rather unsuccessful at ignoring the competing visual information in the 105

122 CHAPTER 4. SELECTIVE ATTENTION Experiment 3b Baseline Attend-audio Attend-video Video-only 16.1 cpf 0.62 (SE 0.04) 0.63 (SE 0.04) 0.82 (SE 0.02) 0.83 (SE 0.03) 4.3 cpf 0.64 (SE 0.04) 0.61 (SE 0.04) 0.83 (SE 0.03) 0.85 (SE 0.02) Table 4.3: This shows the mean proportion of eye tracker samples falling on the lower half of the screen for Experiment 3b by instruction and stimulus condition. Standard errors of the mean are shown in parentheses. Task instructions, rather than stimulus properties, are the main driving force behind the gaze behavior strategy of looking towards the lower half of the screen in these experiments. attend-audio condition. Overall, participants weren t much better at identifying the auditory token when they were trying to specifically attend to the audio and ignore the visual information (attend-audio condition), than when they were just watching the talker and reported what they heard him say (baseline condition). Across both experiments there was a rather large difference in performance when participants were asked to report what they heard when only the acoustic speech information was present (audio-only condition) versus attending to the auditory information when competing visual information was present (incongruent trials, attend-audio condition). Participants were very good at discriminating the speech sounds when just the acoustic speech information was present, but were only moderately successful at the same discrimination when competing visual information was present and they were asked to selectively pay attention to what they heard. In Experiment 3a, offsetting the auditory and visual speech information allowed participants to diverge their behavior from the baseline condition in the attend-video condition. While participants were not as influenced by the visual information at 350 ms in the baseline condition compared to the 0 ms condition, they were still able to selectively attend to this information in the attend-video condition. It was somewhat 106

123 CHAPTER 4. SELECTIVE ATTENTION surprising that the difference between baseline and attend-audio conditions did not increase with increasing offset. At the 350 ms offset the asynchrony is quite noticeable, and participants should have had no trouble telling that the video preceded the audio (for example, see Vatakis & Spence, 2007 who showed this at a 300 ms offset). Even though participants in the current experiment likely realized that the video lead the audio in the 350 ms offset condition, they did not seem to make use of the asynchronies to help them in separating the auditory and visual speech information. This could be because separate processes may underlie the perception of synchrony between the auditory and visual speech, and the perceptual integration of the auditory and visual speech. For example, participants can experience the illusory percepts from the McGurk effect and at the same time make accurate temporal judgments (Soto-Faraco & Alsius, 2007). An fmri study has also suggested a dissociation between neural systems involved in the evaluation of cross-modal coincidence and those that mediate the perceptual binding (Miller & D Esposito, 2005). This dissociation between the synchrony perception and perceptual binding could explain why participants did not seem to use the offset between the auditory and visual stimuli when they were trying to selectively attend to the auditory or visual information. In the case of attending to the auditory information, the difference between the baseline condition and the attend-audio condition remained quite consistent across offsets. In the attend-video condition, while performance did diverge from the baseline condition as the offset was increased, the overall performance in the attend-video condition was very similar across offsets. In Experiment 3a, participants were still unable to break the McGurk effect and ignore the visual information to an appreciable degree. Participants did respond 107

124 CHAPTER 4. SELECTIVE ATTENTION slightly more often to the auditory token when asked to attend to the audio compared to the baseline. However, the overall sensitivity to the auditory token was not affected by the instructions to attend to the auditory information, as compared to the baseline. This is especially interesting since participants were very good at discriminating the stimuli unimodally. However, this inability to eliminate the McGurk effect is in line with the effects of attentional load (Alsius et al., 2005, 2007), directed visual attention tasks (Tiippana et al., 2004; Andersen et al., 2009). While those studies showed a modulation of the McGurk effect, the McGurk effect certainly did not go from ceiling in one condition to floor in another. In Experiment 3b removing spatial frequency did not make it easier for participants to selectively attend to the auditory information compared to the baseline when the auditory and visual information were mismatched. Removing visual information did have an influence when the auditory and visual information were matched. In the attend-video condition 4.3 cpf congruent trials, participants were actually performing worse in the audiovisual condition than they were in the audio-only control condition, even though the auditory information was redundant and more reliable than the visual information. However, removing spatial frequency information may have not only removed visual information useful for disambiguating the consonants, but may have actually made the visual stimuli similar to other consonants that were not used in the experiment. This may have made the 4.3 cpf visual information somewhat misleading when the auditory and visual information were matched. Although the visual 4.3 cpf information was not as informative for the task as the auditory information in Experiment 3b, a strong influence of the visual information persisted. This 108

125 CHAPTER 4. SELECTIVE ATTENTION could be because generally, on a day-to-day basis, visual speech information is informative. For example, visual information can be useful for disambiguating aurally similar phonemes based on place of articulation (Binnie, Montgomery, & Jackson, 1974). Previous research on bimodal integration suggests that this integration is relatively optimal, and that information is generally weighted depending on its reliability (Ernst & Banks, 2002; Alais & Burr, 2004; Wozny, Beierholm, & Shams, 2008). However, speech perception is a very over-practiced task, and the weighting of the auditory and visual speech information likely reflects a weighting learned over a life time. While the fusion of information between senses is not necessarily mandatory (Hillis, Ernst, Banks, & Landy, 2002), the over-practiced nature of audiovisual speech perception may help to explain why the visual information was influencing the perception of the auditory speech even when the reliability of the visual speech information was reduced. For instance, this weighting most likely includes some knowledge about how likely the auditory and visual signals are to come from the same sensory event, i.e. the unity assumption (Vatakis & Spence, 2007). Subsequent research will need to test whether the general results found in the current study can be extended to other consonants. If the difficulty in ignoring the visual information is based on the likelihood of the visual and auditory information coming from the same event, then it seems likely that the results will generalize to other consonant pairings used to produce the McGurk effect. Attentional instructions did influence gaze behavior. The overall influence of task instructions on gaze behavior is in line with other eye tracking research on the effect of task in visual-only (Lansing & McConkie, 1999) and audiovisual speech perception (Buchan et al., 2007). The overall gaze strategy remained the same over both 109

126 CHAPTER 4. SELECTIVE ATTENTION experiments, despite differences in visual stimuli. The strategies for gathering visual information from the lower half of the screen in this study seem to be driven primarily by whether they are trying to attend specifically to the visual speech information or not, and not by the properties of the visual stimuli. This is consistent with other eyetracking research showing cognitive factors, rather than visual stimulus properties, playing a dominant role in active gaze control (Henderson, Brockmole, Castelhano, & Mack, 2007). It is interesting to note that the baseline condition and attend-audio conditions show similar gaze patterns, suggesting similar strategies to gather visual information in those two conditions. Participants in those conditions actually spent less time looking towards the lower part of the screen than when they were attending specifically to the video, suggesting that they may have been attempting to ignore (although not particularly successfully) speech information from the lower half of the face. This is also interesting because despite the contributions of visual speech information to the perception of acoustic speech, the similarities between gaze in the baseline and attend-audio conditions suggest that participants in the baseline condition are treating it like the attend-audio task, at least in terms of gathering visual information. The differences in gaze between the attend-audio and attend-video conditions suggest that participants have somewhat different goals in terms of gathering visual speech information in the two conditions. While visual speech information occurs mainly in the mouth region (Thomas & Jordan, 2004), it is not strictly limited to the mouth and is more broadly distributed across the face (Benoît, Guiard-Marigny, Le Groff, & Adjoudani, 1996). Extra-oral face movements alone (with the mouth digitally removed) do provide some visual speech information (Thomas & Jordan, 110

127 CHAPTER 4. SELECTIVE ATTENTION 2004). The similarities in gaze to the lower part of the face in the baseline and attendaudio conditions, with more time spent looking away from the lower half of the face than in the attend-video and video-only conditions, suggest that the gathering of lip movement information in audiovisual speech may not be the only priority during faceto-face communication. The gaze patterns probably reflect the concurrent gathering of other facial social information, e.g. location of gaze, emotional information, etc., which is either not available or not confined to the lower part of the face, respectively. Gaze location is often taken to be an indicator of visual attention (Findlay & Gilchrist, 2003), and in many cases it likely is a good indicator of visual attention. On the other hand, we need not fixate directly on the information in the visual field that we are attending to, and may not direct our eyes to what we are attending to if there is no benefit to doing so (Posner, 1980). It was only when instructed to pay attention to the visual information, or when only provided with visual information that participants really seemed to focus their gaze near the visible speech information on the lower half of the face. Direct fixation tends to be necessary for highly detailed visual information (for example in reading tasks), but is not strictly necessary to gather visual information for audiovisual speech perception. Visual speech information can still be gathered without directly fixating on the lower part of the face (Paré et al., 2003; Andersen et al., 2009). Highly detailed visual information is not necessary for the visual speech information to be acquired and integrated with auditory and visual speech information (Munhall et al., 2004; MacDonald et al., 2000), and visible speech information is conveyed across a broad range of spatial frequencies (Munhall et al., 2004). Although spatial frequency filtering can influence audiovisual speech perception, both distance and face size seem to have fairly negligible effects. At distances 111

128 CHAPTER 4. SELECTIVE ATTENTION of up to 20m, incongruent visual information can still produce a McGurk effect and visual information in the congruent task allows for better accuracy compared with audio-only performance (Jordan & Sergeant, 2000). The size of the face can be enlarged up to approximately five times life size (Vatikiotis-Bateson et al., 1998), or reduced to about ten percent of its original size (Jordan & Sergeant, 1998) without substantially altering the contribution of visual information to the perception of audiovisual speech. Complementing the findings of the above studies, Paré et al. (2003) showed that fixating on either the mouth or the eyes seems to provide very similar vantage points in terms of gathering visual information during audiovisual speech processing in a McGurk task. It is not until gaze is fixed beyond degrees away from the mouth that the influence of the visual information the McGurk effect is significantly lessened, and some visual speech information persists even at 40 degrees of eccentricity. In summary, the data presented here suggest only a modest role for attentional effects in audiovisual speech perception. Despite the effect of attentional instructions, audiovisual integration, as shown by the McGurk effect, still occurred to some degree under all incongruent stimulus and attentional conditions. The results support the idea that while attentional instructions can modulate the integration of audiovisual speech, the multisensory integration in this well learned task is difficult to voluntarily ignore. 112

129 Chapter 5 Further analyses of gaze centralization from Experiments 2 and 3 In Experiments 2a and 2b, the addition of a working memory task didn t have much influence on gaze centralization, and to the extent that it did, the addition of the concurrent working memory task led to less, not more centralization. One explanation for the cause of the increased gaze centralization seen in Buchan et al. (2007, 2008) with the addition of acoustic noise was an increased cognitive load due to increased task difficulty. However, the failure of Experiments 2a and 2b to show an increased gaze centralization with the addition of a cognitive load task suggests that this is unlikely. In order to better understand the tendency towards gaze centralization seen in Buchan et al. (2007, 2008), exploratory data analysis was performed on the eyetracking data first for Experiment 2a, and later for Experiments 2b, 3a and 3b. The eyetracking data for the congruent and incongruent trials in Experiments 2 113

130 CHAPTER 5. GAZE CENTRALIZATION and 3 (in Chapters 3 and 4) had been pooled together because previous eyetracking studies using the McGurk effect have shown very similar gaze patterns for congruent and incongruent trials (Paré et al., 2003; Buchan et al., 2005; Buchan, 2006). In addition, there are several other lines of evidence that suggest that that congruent and incongruent stimuli are processed very similarly. For instance, data from a speeded classification task (Soto-Faraco et al., 2004) showed no difference in reaction times between congruent and incongruent stimuli that were perceived as the same consonant. Event related potential studies show similar ERPs for congruent and incongruent stimuli perceived as the same consonant (e.g., Saint-Amour et al., 2007) (see Chapter 1 for further details). Taken together, this would suggest similar processing and gathering by the perceptual/cognitive system. However, I was somewhat surprised by the very modest effect of the working memory task on gaze behaviour in Experiments 2a and 2b. Anecdotally, however, the incongruent stimuli can sometimes seem to be less clear, or harder to understand than congruent stimuli. Most of the stimuli used in Experiments 2 and 3 are really quite compelling, particularly the acoustic B dubbed onto the visual V, and the visual V dubbed onto the visual B used in Experiments 2b, 3a and 3b. Both of those stimuli produce a large proportion of McGurk responses. That said, the illusory consonant produced by the incongruent tokens may still not be a very good exemplar of that consonant. Despite very similar gaze patterns having been shown for congruent and incongruent stimuli (Paré et al., 2003; Buchan et al., 2005; Buchan, 2006) it seems reasonable to examine the possible influence of congruency on gaze centralization. If there is an influence of congruency on gaze centralization, a similar pattern 114

131 CHAPTER 5. GAZE CENTRALIZATION of differences should emerge across Experiments 2a, 2b, 3a and 3b, which all have congruent and incongruent stimuli. Because the focus of the exploratory re-analysis of the gaze data is to explore the influence of congruency on gaze centralization, the reanalysis will be referred to as Experiment 4. Table 5.1 shows the original experiments as well as additional factors that were manipulated in addition to congruency. Experiment 4 Original experiment Other experimental factors 4a 2a Working memory task, Temporal offset 4b 2b Working memory task, Temporal offset 4c 3a Attention condition, Temporal offset 4d 3b Attention condition, Spatial frequency Table 5.1: This shows for Experiment 4 the original experiments that were reanalyzed, as well as additional factors that were manipulated in addition to congruency. Two analyses will be performed. For the first analysis, gaze centralization will be analyzed for Experiments 4a, 4b, 4c and 4d to see if congruency is a significant factor across all four experiments. Gaze centralization will be measured using the same methods as described in Experiment 2 (Chapter 3), using the the average distance from the centre of the screen. Gaze centralization will be analyzed using multifactorial repeated measured ANOVAs (see Methods for details), and is averaged across the entire trial. If the gaze centralization is consistently influenced by congruency, then we should see a divergence between the gaze behaviour on the incongruent and congruent trials emerge during the speech. While there would be some coarticulatory information about the consonant in the initial vowel, it seems that the consonant burst would be the most critical piece of information about the consonant. For the second analysis a descriptive time series analysis will be performed with the gaze for 115

132 CHAPTER 5. GAZE CENTRALIZATION the trial in order to examine if there are any consistent patterns that emerge over time across the different experiments. 5.1 Experiment Gaze centralization analysis Methods Data from all participants from Experiments 2a, 2b, 3a and 3b were included in the re-analysis.see Chapter 3 for details about the eyetracking gaze centralization measure. The gaze centralization data for Experiments 4a and 4b were analyzed with (congruency working memory task offsets) ANOVAs. For Experiment 4c a (congruency attention condition offset) ANOVA was used, and for Experiment 4d a (congruency attention condition spatial frequency) ANOVA was used. Results Gaze was more centralized in incongruent trials than in congruent trials, although this difference was more pronounced depending on the experiment. In Experiment 4a this difference between congruent and incongruent trials was approximately 1.5 degrees of visual angle. In Experiment 4b, the difference was approximately 0.4 degrees of visual angle. In Experiments 4c and 4d, the difference was approximately 0.1 degrees of visual angle. Congruency had a significant influence on gaze centralization in Experiments 4a [F (1, 23) = 26.36, p <.001], 4b [F (1, 23) = 6.46, p =.018] and 116

133 CHAPTER 5. GAZE CENTRALIZATION 4c [F (1, 32) = 4.21, p =.048]. Although the degree of difference was the same in Experiments 4c and 4d, the difference in Experiment 4d failed to reach significance [F (1, 35) = 3.82, p =.059]. The influence of temporal offset has already been explored in Chapter 3 for Experiments 4a and 4b. The same pattern of gaze becoming more centralized with increasing offset that was seen in Chapter 3 has also been found in the re-analysis of Experiments 4a, 4b and 4c. The difference between the 0 ms offset and the 350 ms offset varies between 1.3 (Experiment 4a) and 0.3 (Experiment 4b) degrees of visual angle.this influence was significant for all three experiments with temporal offsets: Experiments 4a [F (2, 46) = 8.35, p =.002], 4b [F (1.24, 28.10) = 7.78, p =.006], and 4c [F (2, 64) = 4.20, p =.019]. The influence of the working memory task was not completely consistent across Experiments 4a and 4b. The influence of working memory was significant for Experiment 4b [F (1, 23) = 18.01, p <.001], but just failed to reach significance for Experiment 4a [F (1, 23) = 4.08, p =.055]. There were no significant interactions of congruency or temporal offset with the working memory on gaze centralization in either Experiment 4a or 4b (p >.05). There was no significant main effect of attention condition in either Experiment 4c or 4d (p >.05), although Experiment 4d was close [F (1.32, 46.35) = 3.82, p =.054]. In Experiment 4c, however, there were significant interactions of both congruency [F (2, 64) = 3.49, p =.037] and temporal offset [F (2.78, 89.01) = 4.56, p =.006] with the attention condition. In Experiment 4d, there were no significant interactions (p >.05) of any factors. 117

134 CHAPTER 5. GAZE CENTRALIZATION Time series graphs Congruency seems to be a factor affecting gaze centralization. In three of the four experiments (4a, 4b and 4c), there was a significant difference between congruent and incongruent trials, and a near significant difference in the fourth experiment (4d). The difference between the gaze centralization in congruent and incongruent trials must emerge over time because the difference between the congruent and incongruent trials cannot be detected at the beginning of the trial. While there is some coarticulatory information about the consonant in the initial vowel, it seems likely that the most useful information to distinguish between a congruent and incongruent trial would be the information in the consonant of the VCV syllable. For experiments with temporal offset as a factor (Experiments 4a, 4b and 4c), temporal offset also appears to reliably influence gaze centralization. The temporal offset between the video and audio is not apparent at the beginning of the trial and, like the influence of congruency, must also emerge over time. If both the congruency and the offset of the auditory and visual information influences the gathering of the visual information, then we should see the difference between gaze in the congruent and incongruent trials shift with increasing offset. Although there were no significant influences of either congruency or spatial frequency, graphs were still included for Experiment 4d to see if there was any similarities with the time series graphs for Experiments 4a,b and c. For each of the horizontal (x axis) and the vertical (y axis) time series graphs of the distance from the centre of the nose over time was plotted. This was done to see if similar patterns would emerge between congruent and incongruent trials in the different experiments. Differences in gaze that emerge after the consonant burst 118

135 CHAPTER 5. GAZE CENTRALIZATION could reflect decision-making processes about which consonant sound was produced. It is quite possible that subjects will either look down towards the keyboard, or at least deflect their eyes down somewhat when making a response. It is expected that differences based on congruency would be most likely evident on the vertical axis. Graphs have also been included for the horizontal axis, although it is not expected that there would be any meaningful influences of congruency or offset on gaze behaviour on the horizontal axis. In order to pick ahead of time which graphs would be more meaningful, rather than create every possible graph and pick among those that seemed to show a similar pattern, the results of the gaze centralization analysis, as well as the results from the original analysis of the original studies, were used to pick which conditions should be collapsed together or separated out in the graphs. Congruency was separated for all graphs, so there are separate lines on the graphs for congruent and incongruent trials. Temporal offset also seems to be to a factor of interest in this analysis, so the three temporal offsets (0 ms, 175 ms and 350 ms) have been separated out for the experiments that have these factors (Experiments 4a, 4b and 4c). To allow for some comparison between the graphs, the same conditions were collapsed or separated for Experiment 4a and 4b, and for Experiments 4c and 4d. For Experiments 4a and 4b, there were no significant interactions of congruency or temporal offset with the working memory condition. Because of this, and the fact that congruency and temporal offset were the main focus of this analysis, data were collapsed across both working memory conditions. For Experiments 4c and 4d, the data was separated according to the attention condition. There were significant interactions between the attention condition and both the congruency and temporal offset 119

136 CHAPTER 5. GAZE CENTRALIZATION factors. This interaction with the attention condition is likely due to the different overall gaze strategies in Experiments 4c and 4d. More time was spent looking towards the bottom of the screen in the attend-video condition as compared with the baseline and attend-audio conditions. Although the attention condition was only shown to interact with congruency in Experiment 4c, for Experiments 4c and 4d separate graphs have been created for each of the different attention conditions. A graph has been included for Experiment 4c to further examine the overall gaze strategy of looking towards or away from the lower half of the screen for the attend-audio and attend-video conditions. Because there were no significant influences of spatial frequency, nor interactions of spatial frequency with congruency or temporal offset, the data for Experiment 4d was collapsed across spatial frequency. Methods The eyetracking data was sampled at 500 Hz. For both the x and y coordinates the average position from the centre of the talker s nose (as reported in the original methods for Experiments 2 and 3) was used. For each trial, for each participant the positions of the samples were binned in 50 ms increments, and the average position of all these samples is recorded on the graph. A 50 ms bin size was used in order to smooth the data. This binned data was then sorted by condition, and data for a given trial condition for all participants was averaged together for each corresponding bin. Because these experiments were not designed with this (re)analysis in mind, the length of the videos and the placement of the speech were not exactly identical in all of the stimuli. While the overal lengths of each speech stimulus, as well as the durations 120

137 CHAPTER 5. GAZE CENTRALIZATION and temporal locations of the speech, are very similar, there are small differences between the stimuli. In particular, the lengths of the videos with the vowel a, while very similar in length, were not identical to those with the vowel i. This latter discrepancy should not affect the congruency analysis, as only the consonants were incongruent, while the vowels were always matched together. In order to group all of the congruent trials together, and the incongruent trials together, the trials were aligned according to the start of the consonant. This was accomplished by locating the start of the acoustic burst for the consonant in the auditory stimuli (the same location used to align the audio and video consonants to create the incongruent stimuli) for the 0 ms (or non-offset) stimuli. Because the videos were of slightly different lengths, the initial two 50 ms bins, and the final 50 ms bins were trimmed from the graphs because they did not contain data from all of the conditions. The additional offsets with the audio trailing the video were aligned based on where the burst would have been at 0 ms. This was done so that all of the videos were aligned to the consonant burst in the auditory file. The location of the acoustic burst at 0 ms is noted on each graph. For graphs of Experiments 4a,b and c, the location of the acoustic burst at 175 and 350 ms are also noted on the graph. The screen resolution in the experiments was 800 horizontal pixels by 600 vertical pixels. The axes of the graphs are in pixels. Fifty pixels corresponds to approximately 2.8 degrees of visual angle. The graphs are plotted with 0 as the centre point of the nose, which was used in the gaze centralization analyses. Located very close to the centre of the screen at 400(x) and 300(y), the actual coordinates of the nose were 413(x) and 313(y). This difference between the centre of the screen and the nose location is approximately a quarter of a degree of visual angle. 121

138 CHAPTER 5. GAZE CENTRALIZATION The distance from the centre of the nose for each of the horizontal (x axis) and the vertical (y axis) axes are plotted on the graphs. Graphs The congruency of the stimuli seems to have an influence on the average vertical position on the screen. Overall, a similar pattern can be seen in the graphs for Experiments 4a, 4b and 4c-baseline (see Figures 5.1, 5.2 and 5.3. In both congruent and incongruent trials, the average position on the vertical axis remains quite close to the centre of the screen, until sometime after the consonant burst where the gaze dips downwards towards the bottom of the screen. After the consonant burst, there appears to be a slight divergence between the congruent and incongruent trials. Gaze on the vertical axis for the incongruent trials appears to be shifted slightly later in time compared to gaze position for the congruent trials. This pattern is present most clearly in the graphs for Experiments 4a, 4b and the baseline condition for Experiment 4c (see Figures 5.1, 5.2, and 5.3, respectively), although the same general tendency can still be seen in the baseline condition for Experiment 4c, and the attend-audio conditions for Experiment 4c and 4d. There does not appear to be much difference between the congruent and incongruent trials in the attend-video conditions for Experiments 4c and 4d. In the original experiments 3a and 3b, participants generally seemed to be overall more able to selectively ignore the auditory information than they were the visual information. When paying attention to the visual information, they may have been relatively successful at ignoring the auditory information, and therefore less likely to be influenced by the congruency of the auditory and visual stimuli. 122

139 CHAPTER 5. GAZE CENTRALIZATION The temporal offset between the auditory and visual stimuli also seems to have an influence on the average gaze behaviour on the vertical position on the screen. Gaze behaviour is similar across the three offsets, yet as the offset is increased, the divergence between the congruent and incongruent stimuli is shifted later in time. There is an overall pattern where the divergence in gaze between the congruent and incongruent trials shifts later in time as the offset is increased. This can be seen in Experiments 4a, 4b and 4c, although the actual size of the time shift by offset does vary from experiment to experiment. For instance, the appearance of a divergence between the congruent and incongruent gaze seems to be occurring later in Experiments 4b (Figure 5.2) and 4c-baseline (Figure 5.3) than it does in Experiment 4a (Figure 5.1). Remember that gaze had been aligned to the moment of the acoustic burst in the audio file at 0 ms. The offsets were created by having the audio trail the video by either 175 or 350 ms, the videos have essentially been aligned to the same point. Thus, the auditory acoustic burst for the 175 ms and 350 ms offsets will be 175 ms, and 350 ms, respectively, after the point of alignment. There does not appear to be any influence of either congruency or temporal offset on the average gaze behaviour on the horizontal position. These figures can be found in Appendix D Discussion Gaze behaviour in audiovisual speech has been previously shown to be influenced by auditory information. For example, Buchan et al. (2007) and Buchan et al. (2008) showed that increasing the acoustic noise levels in an audiovisual speech-in-noise task 123

140 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Incongruent, 0ms Congruent, 0ms Incongruent, 175ms Congruent, 175ms Incongruent, 350ms Congruent, 350ms time (msec) Figure 5.1: This shows the average position, on the vertical axis of the screen during the video, for Experiment 4a. The time of the acoustic burst at 0 ms is marked on the graph, as is the acoustic burst shifted by 175 and 350 ms. 124

141 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Incongruent, 0ms Congruent, 0ms Incongruent, 175ms Congruent, 175ms Incongruent, 350ms Congruent, 350ms time (msec) Figure 5.2: This shows the average position, on the vertical axis of the screen during the video, for Experiment 4b. The time of the acoustic burst at 0 ms is marked on the graph, as is the acoustic burst shifted by 175 and 350 ms. 125

142 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Incongruent, 0ms Congruent, 0ms Incongruent, 175ms Congruent, 175ms Incongruent, 350ms Congruent, 350ms time (msec) Figure 5.3: This shows the average position, on the vertical axis of the screen during the video, for the baseline condition in Experiment 4c. The time of the acoustic burst at 0 ms is marked on the graph, as is the acoustic burst shifted by 175 and 350 ms. 126

143 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Audio Video time (msec) Figure 5.4: This compares the average position, on the vertical axis of the screen during the video, for the attend-audio and attend-video conditions in Experiment 4c. The time of the acoustic burst at 0 ms is marked on the graph. This is included to show that the average gaze position is lower on the vertical axis of the screen for the attend-video as compared with the attend-audio condition. This is the same pattern shown in the original experiment (Experiment 3a, see Chapter 4 for more details). This difference in gaze strategies between the two conditions is adopted quite early in the trial, and persists for most of the trial. It is because of this overarching gaze strategy adopted by subjects that the graphs for Experiments 4c and 4d were split by attention condition. 127

144 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Incongruent, 0ms Congruent, 0ms Incongruent, 175ms Congruent, 175ms Incongruent, 350ms Congruent, 350ms time (msec) Figure 5.5: This shows the average position, on the vertical axis during the video, for the attend-audio condition in Experiment 4c. The time of the acoustic burst at 0 ms is marked on the graph, as is the burst shifted by 175 and 350 ms. 128

145 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Incongruent, 0ms Congruent, 0ms Incongruent, 175ms Congruent, 175ms Incongruent, 350ms Congruent, 350ms time (msec) Figure 5.6: This shows the average position, on the vertical axis during the video, for the attend-video condition in Experiment 4c. The time of the acoustic burst at 0 ms is marked on the graph, as is the burst shifted by 175 and 350 ms. 129

146 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Congruent Incongruent time (msec) Figure 5.7: This shows the average position, on the vertical axis during the video, for the baseline condition in Experiment 4d. The time of the acoustic burst at 0 ms is marked on the graph. 130

147 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Congruent Incongruent time (msec) Figure 5.8: This shows the average position on the vertical axis during the video for the attend-audio condition in Experiment 4d. The time of the acoustic burst at 0 ms is marked on the graph. 131

148 CHAPTER 5. GAZE CENTRALIZATION 50 0 Vertical position (pixels) Congruent Incongruent time (msec) Figure 5.9: This shows the average position on the vertical axis during the video for the attend-video condition in Experiment 4d. The time of the acoustic burst at 0 ms is marked on the graph. 132

149 CHAPTER 5. GAZE CENTRALIZATION led to an increased tendency to look away from the eyes and move gaze towards the nose. However, the divergence in gaze in a McGurk task between the congruent and incongruent trials is interesting, especially considering that Paré et al. (2003); Buchan et al. (2005); Buchan (2006) have not shown differences in gaze behaviour between congruent and incongruent trials. Paré et al. (2003) looked at the number of fixations falling on the right eye, left eye and mouth at the start of the acoustic speech, and the start of the acoustic burst, and saw that the numbers were very similar for congruent and incongruent trials. Buchan et al. (2005) and Buchan (2006) looked at the proportion of the trial spent looking at the right eye, left eye, nose and mouth and found the proportions to be similar for both congruent and incongruent trials. A likely explanation for showing no gaze difference between congruent and incongruent trials and the current study is that previous studies (Paré et al., 2003; Buchan et al., 2005; Buchan, 2006) all used region of interest analyses looking at areas of the face such as the eyes, nose and mouth. The differences in gaze shown in Experiment 4 between the incongruent and congruent trials are rather subtle, and it is quite probable that the gaze centralization analysis and the time series graphs revealed differences that would not have been shown using a region of interest analysis. Several studies presented in Chapters 1, 3 and 4 have argued that the processing of congruent and incongruent trials is very similar. For instance the EEG studies (Kislyuk et al., 2008; Saint-Amour et al., 2007; Colin et al., 2002, 2004), and MEG study (Sams et al., 1991) mentioned in earlier chapters suggest that congruent and incongruent trials are processed according to how they are perceived, and not according to the auditory token. The perceived acoustic token and the actual acoustic token are 133

150 CHAPTER 5. GAZE CENTRALIZATION processed in the same manner according to the Mismatch negativity (MMN) components. Additionally, Soto-Faraco et al. (2004) s speeded reaction time, also previously mentioned in Chapters 1, 3, and 4, study show similar reaction times to congruent and incongruent stimuli. It is possible that analyzing eye gaze data may offer a way to show a subtle difference in the processing of incongruent and congruent stimuli. The influence of temporal offset on gaze behaviour is interesting. One possibility for the influence of the temporal offset of the auditory information on the gathering of visual information could have to do with subjects binding the audiovisual information. Auditory information is generally more reliable than visual information in the temporal domain (Spence, 2007), and subjects may be using the auditory temporal information to determine the timing of the visual information. For example, auditory stimuli have been shown to influence the perceived presentation rate of rapidly presented auditory and visual stimuli (Shipley, 1964; Welch, DuttonHurt, & Warren, 1986; Recanzone, 2003). Perhaps most interestingly, visual stimuli have been shown to be pulled, or temporally ventriloquized, into approximate temporal alignment with corresponding auditory stimuli (Morein-Zamir, Soto-Faraco, & Kingstone, 2003; Vroomen & Keetels, 2006). It is possible that the shift in time of the divergence in gaze patterns between the incongruent and congruent stimuli with increasing offset reflect a reliance on the temporal information to try and figure out when the syllable occurred. Related to this, the visual speech stimuli may have been at least partially temporally ventriloquized into alignment with the acoustic speech. This could occur even if subjects are also aware of the misalignment of the auditory and visual stimuli. The perception of the actual temporal alignment of the auditory and visual stimuli does not seem to necessarily correspond to the perceptual integration of the auditory 134

151 CHAPTER 5. GAZE CENTRALIZATION and visual speech information. At certain (in this case auditory-leading) offsets, it was possible for subjects to reliably identify that the auditory token preceded the visual token, yet perceptually identify the resulting percept as if the visual token preceded the auditory token (Soto-Faraco & Alsius, 2007). Additionally, there seems to be a dissociation between brain networks involved in the processing of the temporal alignment from those involved in mediating the perceptual binding of the auditory and visual speech information (Miller & D Esposito, 2005). The shift in gaze related to the different temporal offsets could be a reflection of these processes. 135

152 Chapter 6 Examining the integration of auditory and visual speech information in distractor talking faces This chapter will focus on the integration of auditory and visual information in distracting faces. In Experiment 5 the question of whether knowledge of a language can influence the binding of auditory speech information to a target face among several distractor faces will be addressed. Experiments 6a-6c will address whether the auditory and visual information for distractors is integrated by manipulating the visual information of the distractors and seeing if this alters performance on the audiovisual target task. 136

153 CHAPTER 6. DISTRACTOR FACES 6.1 Experiment 5 Cognitive factors have been shown to influence the segregation of simultaneous acoustic speech streams. Distractors that are intelligible to a listener seem to be more distracting than those that are not, and this may be due to informational masking (Leek et al., 1991; Brungart, 2001). For example, the intelligibility of auditory target speech has been shown to be better if the distractor speech is time-reversed than if it is played forward (Rhenbergen & Versfeld, 2005). For native speakers of a target language, distractors in the same language as the target tend to be more distracting that distractors in a different language. This has been shown for various language combinations, including English targets in English or Mandarin babble (Van Engen & Bradlow, 2007), Dutch targets in Dutch or Swedish babble (Rhenbergen & Versfeld, 2005), and English targets in English or Swedish babble (Garcia Lecumberri & Cooke, 2006). Cognitive factors have been shown to influence the integration of auditory and visual information. For instance, talker familiarity has been shown to improve acoustic speech identification in noise (Nygaard & Pisoni, 1998). Talker familiarity is also influential in perceiving the McGurk effect. While it has been shown that the auditory and visual speech information streams can be integrated from talkers of different genders (Green et al., 1991; Walker et al., 1995), the susceptibility to this effect decreases when subjects are familiar with the talkers (Walker et al., 1995). Mismatching talker genders can also influence judgments of temporal order for temporally offset audiovisual speech. Subjects are better at judging the temporal order of the auditory and visual speech when talker genders are mismatched (Vatakis & Spence, 2007; Vatakis et al., 2008). 137

154 CHAPTER 6. DISTRACTOR FACES Language experience has also been shown to influence the perception of synchrony in audiovisual speech. Navarra, Alsius, Velasco, et al. (2010) have shown a slight difference in audiovisual synchrony perception between native and non-native speakers of a language. They used both native English and native Spanish speakers, and both English and Spanish sentence stimuli. The visual stimuli had to lead the auditory stimuli more for native rather than non-native speakers. These differences in synchrony perception between native and non-native speakers decreased with language experience. Matching an auditory speech target to one of several visually articulated faces seems to be influenced by the number of distractors present. An experiment looking at matching auditory speech targets to a set of either 2,3 or 4 articulating faces has shown that reaction times to detecting and locating the face matching the auditory target (Alsius & Soto-Faraco, 2011) increases with an increase in the number of faces. This suggests that the matching of a target voice to one of several faces is not likely an automatic process. Can language experience facilitate the matching of an auditory and visual target, and will the language of distractor articulating faces affect the matching of the targets? To address these questions, a paradigm similar to one of the conditions in one of the experiments in Alsius and Soto-Faraco (2011) was used. In this condition participants had to match an auditory sentence to one of four faces presented simultaneously but at different locations on the screen, and respond using arrows on the keyboard. In Experiment 5, four faces were presented simultaneously on the screen in each trial, and subjects had to locate which face matched the target voice. The target voice and matching face was either in English or in Spanish, and the distractor faces were either in English or in Spanish. 138

155 CHAPTER 6. DISTRACTOR FACES Methods Participants There were 19 subjects (11 females) with a mean age of 21.1 years (range 19-30). Subjects were native speakers of English, and were not native speakers or fluent speakers of Spanish. All subjects reported normal or corrected to normal vision, and no speech or hearing difficulties. Stimuli The stimuli used in this experiment were derived from the original audiovisual stimuli used by Navarra, Alsius, Velasco, et al. (2010). The audiovisual stimuli were from a male bilingual speaker who spoke both English and Spanish sentences. The sentences in both languages were obtained from different (and non-popular) tales in literaturespecialized web pages. Mirror performance by native and non-native speakers of English and Spanish on synchrony judgments (Navarra, Alsius, Velasco, et al., 2010) suggests that the sentences were reasonably comparable across languages. Appendix D provides examples of the sentences. In total eighty-eight English sentences and eighty-eight Spanish sentences were used in the experiment. Several other sentences were used in practice trials to familiarize subjects with the task. The videos were edited so that only the lower half of the face, just below to the tip of the nose, was used (see Figure 6.1). The four faces were arranged similarly to the faces used by Alsius and Soto-Faraco (2011). An auditory sentence was played on each trial that matched one of the visually presented articulating faces. The other three distractor articulating faces were pseudorandomly chosen from the set of videotaped sentences, so that each distractor was presented 139

156 CHAPTER 6. DISTRACTOR FACES Figure 6.1: This shows the configuration for the talkers on the screen for Experiment 5. Each face subtends approximately 10.1 degrees of visual angle at its widest point and each mouth subtends approximately 3.8 degrees of visual angle. Though not shown here, a small white fixation cross was displayed at the centre of the screen. the same number of times throughout the experiment. Experimental procedures The experiments took place in a single walled sound booth and participants were seated approximately 57 cm away from a 22in flat CRT monitor (ViewSonic P220f). The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. English target and Spanish target sentences were run in separate blocks, and at the beginning of the block participants were presented with practice trials to familiarize 140

157 CHAPTER 6. DISTRACTOR FACES them with the task. A small white fixation cross was present before and during the video. Subjects were told to look at the fixation cross at the start of the trial, and encouraged to keep fixated on the cross during the trial. While subjects could move their gaze if they wanted to, observation of the subjects showed that they tended to keep their eye fixed on or near the central cross. Subject were told that the voice matched one of the faces, and that they needed to match the voice with the face. For each trial, subjects were asked to respond as soon as they detected which face corresponded to the audio by pressing the 0 key on the number keypad on the keyboard, and were told that this was a speeded reaction time task. After participants made their response to detecting a match, they were then asked to locate which face matched the auditory sentence using arrow keys (up, down, right or left arrows on the number pad). Design and Analysis Experiment 5 was carried out as a within-subjects design. A factorial design using both English and Spanish as targets and distractors produced four conditions - English targets with English distractors, English Targets with Spanish distractors, Spanish targets with English distractors, and Spanish targets with English distractors. The experimental conditions were presented in four blocks. The two blocks with targets in the same language were paired together. For example, in the first two blocks, all of the target sentences were in presented in English (in one block the distractors were in English, and the other in Spanish), and in the last two blocks all of the target sentences were presented in Spanish (in one block the distractors were in English, and the other in Spanish). The order of the English targets and Spanish targets was 141

158 CHAPTER 6. DISTRACTOR FACES counterbalanced between subjects, as was the order of the distractor language. Reaction times were analyzed using a 2 2 (target language distractor language) within-subjects repeated measures ANOVA was used to analyze the data. (Percent correct data was not analyzed statistically since performance was at ceiling across the conditions) Results and Discussion Overall, participants were quite accurate at matching the auditory target with the visual target. The mean accuracy was 99.4% correct (SE 0.2) and performance in all conditions was at ceiling (ranging from %). Because of this, only reaction times to correct trials were included in the analysis. The language of the target sentence had an influence on reaction times. Reaction times were significantly shorter when English, as opposed to Spanish, was the target language [F (1, 18) = 15.91, p <.001] (See Figure 6.2. However, the language of the distractor faces did not affect reaction times for matching the auditory and visual target sentences (p >.05), and there was no significant interaction between the target language and the distractor faces (p >.05). Faster reaction times to the English targets rather than the Spanish targets could be because subjects were more engaged in the task when the target was the English sentence. The reaction times for the English targets (approximately 2677 ms) were similar to the reaction times found in Alsius and Soto-Faraco (2011) in the four face distractor localization condition(approximately 2716 ms). The longer reaction times in Alsius and Soto-Faraco (2011) likely reflect slight differences in the tasks used in each experiment. It is possible that the Spanish sentences are actually harder to 142

159 CHAPTER 6. DISTRACTOR FACES 3500 Reaction times (ms) B)PerformanceontheSymbolsTask Percentcorrect Proportion samples with data on lower half of the screen English distractors Spanish distractors English targets B)Experiment2 elowerhalfof Spanish targets Target language (audio language) symbol 7 symbols 7symbols Series1 Figure 6.2: This shows the reaction times to matching the target auditory sentence with the target face. Only the difference between target languages is significant. The error bars indicate standard error of the mean. 143

160 CHAPTER 6. DISTRACTOR FACES match audiovisually, although if this was the case it seems likely that there would be either a cost or benefit to having the distractors in the same or a different language, and there was no interaction observed between target and distractor language in the current experiment. Also, mirror performance on synchrony judgments in Alsius and Soto-Faraco (2011) of native English speakers and native Spanish speakers using the same sentences suggests that this is unlikely the case. The lack of influence of the distractor language could suggest an insight into how subjects might have been performing this task, as well as the audiovisual matching tasks used by Alsius and Soto-Faraco (2011). Since the distractor language does not have an effect, nor is there an interaction between the language of the target and the distractor, it seems that there is little useful information about language conveyed in the moving face in terms of helping to find the target. Once a possible match has been found, knowledge of the language does seem to speed up reaction times. However, the lack of an effect of the distractor language suggests that the actual language of the auditory and visual information likely does not help subjects to find the match initially, but knowledge of the auditory language may allow subjects to more quickly confirm a match when they find it. The overall results suggest that while an understanding of the auditory sentence could help match auditory and visual speech streams, other linguistic features, such as the different rhythmic structures and intonation patterns of the two languages (Bahrick & Pickens, 1988) are not likely very important for matching the auditory and visual speech streams. It could be that we are not as sensitive to this information in the visual modality as we are in the auditory modality. For example, infants have been shown to integrate auditory and visual speech information as evidenced 144

161 CHAPTER 6. DISTRACTOR FACES by the McGurk effect (Rosenblum et al., 1997), and be sensitive to the different rhythmic structures and intonation patterns between audiovisual English and Spanish sentences (Bahrick & Pickens, 1988). On the other hand, infants do not seem to be sensitive to these differences using video-only presentations of the same sentences (Bahrick & Pickens, 1988). So, despite the fact the motion of the face contains a fairly high amount of correlated information with the speech acoustics (Yehia, Rubin, & Vatikiotis-Bateson, 1998), adults may also be less sensitive to language differences in the visual domain than they are in the auditory domain. 6.2 Experiment 6a There have been several studies that have used either one face and multiple voices (Driver, 1996; Alsius & Soto-Faraco, 2011; Helfer & Freyman, 2005), or multiple faces and one voice (Alsius & Soto-Faraco, 2011; Andersen et al., 2009) (and Experiment 5). In these experiments, when the information from an auditory and a visual stream was combined, the distractors would have remained unimodal. In the real world, unless someone is perhaps watching TV with the sound muted and listening to someone else speak, talking faces usually go along with talking voices, and vice versa. Distractors are commonly used in attentional studies (Treisman & Gelade, 1980; Lavie, 2005; Hickey, Di Lollo, & McDonald, 2009; D. E. Wilson, Muroi, & MacLeod, 2011), yet both the influence of audiovisual speech distractors on the processing of target audiovisual speech, and the extent to which the auditory and visual speech information of distractors are integrated remain largely unexplored. Both the influence of audiovisual speech distractors on the processing of target 145

162 CHAPTER 6. DISTRACTOR FACES audiovisual speech, and the extent to which the auditory and visual speech information of distractors are integrated will be addressed in Experiments 6a, 6b and 6c. In each of these experiments, the visual stimuli of the target, and the auditory stimuli of both the target and distractor talkers was held constant. Only the visual stimuli (the videos) of the distractor talkers were manipulated. The distractors were manipulated to create three conditions, 1) the motion of the distractor videos matched the audio of the distractors, 2) the motion of the distractor videos was mismatched to the audio of the distractors, and 3) there is no motion in the videos of the distractors (still images). If the audio and the video of the distractors are integrated, matching audio and video distractors could help segregate the speech streams by facilitating streaming of the target from the distractors. This could increase performance on the target task. In this case we could expect better performance in the matching condition than the mismatching condition or the no motion condition, since neither of the latter two videos would provide information useful for audiovisual integration. It is also possible that just the presence of the videos of the distractors is distracting. In this case we could expect better performance in the no motion condition than in the mismatch condition Methods Participants There were 32 subjects (22 females) with a mean age of years (range years). All subjects were native speakers of English, and reported normal or corrected to normal vision, and no speech or hearing difficulties. 146

163 CHAPTER 6. DISTRACTOR FACES Stimuli Because of the long reaction times for identifying matching auditory and visual stimuli with multiple competing stimuli (see Experiment 5 and (Alsius & Soto-Faraco, 2011)) relatively long sentences were needed in order to give participants a good chance to integrate the voice and face. Each sentence was used only once in the experiment. See Appendix E for the target sentences as they appeared in the experiment Five talkers were filmed saying the sentences using digital audio and video recording equipment. The stimuli were converted to black and white to minimize differences in colour balance. (Colour versus black and white does not affect audiovisual speech perception (Jordan, McCotter, & Thomas, 2000)). The stimuli were edited into clips in Final Cut Pro. Audio levels were normalized using custom MATLAB software. The stimuli were then grouped into twenty sentence sets of 5 sentences (one sentence from each talker) with the target sentences either the same duration or shorter than the distractor sentences. The sentences were then aligned based on the acoustics so that the sentences in a set all started at the same time. Sentence sets of approximately the same length were then paired together (for example, sentence set 1 could be paired with sentence set 2), and edited in length so that the sentences from the target talker for the paired sentences sets were the same number of frames in length. Occasionally words from the target sentences had to be cut for length, but the cuts were always made between words (based on both the acoustics and the video). All the sentences are in Appendix E. Sentences that were cut for length can be identifyied by... placed after the last word before a cut. The videos of the talkers were arranged with the target talker in the centre (see Fig. 6.3). This created the stimuli for the matching condition, and the audio for all three conditions. To create the sentences 147

164 CHAPTER 6. DISTRACTOR FACES for the mismatching motion condition, the videos from the distractor sentences were swapped with the sentence pair, so that all the sentences were the same length, and started and stopped at the same time, but the videos for the distractors didn t match the audio for the distractors. This ensures that the matching and mismatching conditions have the same audio for target and distractors, the same videos for the target, and only differ in terms of the videos for the distractors. Somewhat surprisingly, the mismatching distractor motion isn t particularly obvious when watching the stimuli. To create the videos for the no-motion condition, for each distractor, a video consisting of a still frame with the talker s mouth closed and displaying a fairly neutral face was pulled from the start of a video used in the matching condition, and was edited to the same length as the audio for each sentence set. The order or presentation of the sentence sets was randomized for each participant. The videos were all the same size, and the small size of the video should not affect the perception of audiovisual speech. Shrinking the face to as small as 10% of its original size does not dramatically reduce the effectiveness of the visual cues (Jordan & Sergeant, 1998). While it is recognized that subjects would likely fixate on the target at the centre of the screen, thus presenting the distractors peripherally on the retina, no attempt was made to scale the images to correct for this. Both stimulus and task complexity seem to be factors in determining whether spatial scaling can equate performance across the visual field (Melmoth, Kukkonen, Mäkelä, & Rovamo, 2000). There is also some disagreement as to whether certain attributes can be scaled to equate performance. For example, the extent that biological motion can be scaled to equate for differences in retinal eccentricity is debated. Studies using point light displays have argued that biological motion may either not be scalable, as argued by 148

165 CHAPTER 6. DISTRACTOR FACES Ikeda, Blake, and Watanabe (2005), or it may be scalable, but motion perception with peripheral vision is relatively more sensitive to masking noise, and may take more time to process (Thompson, Hansen, Hess, & Troje, 2007). In any case, all of the videos were presented within 40 degrees of eccentricity. At that distance Paré et al. (2003) have shown that visual speech information is still useful to be fused with the auditory information. Preliminary testing with the same audio levels for each of the talkers showed the task to be very difficult. The audio levels of the distractors were attenuated to make it a bit easier to hear the target. Separate audio files, one for the target and one for the four distractors were also exported from Final Cut Pro to test audio levels. The audio for the target (without the distractors) was measured at 60 db(a). The audio for all the distractors (without the target) was also measured at 60 db(a). Experimental procedures The experiment took place in a single walled sound booth and participants were seated approximately 57 cm away from a 22in flat CRT monitor (ViewSonic P220f). The audio signal was played from speakers (Paradigm Reference Studio/20) positioned on either side of the monitor. DMDX display software ( ~kforster/dmdx/dmdx.htm) was used to control the presentation of the stimuli. Subjects were assigned to one of the three experimental conditions (either matching, mismatching, or no motion). To ensure that participants were familiar with the target s voice they were given two trials of the same sentence (not used in the experiment) with just the target s audio and video, and asked to repeat back what the target said. They were then given two trials with the same target sentence, but with 149

166 CHAPTER 6. DISTRACTOR FACES the distractor audio and video as well. For the experiment, subjects were asked to watch the target, and repeat back as many words as possible. Subject s verbal responses were recorded with Cool Edit Pro and scored later from the.wav files. Scoring used was similar to that used in the CID sentences (Davis & Silverman, 1970) in Experiments 1a and 1b. All of the words that were in the target sentence, but not in any of the acoustic distractor sentences that were played concurrently with the target sentence, were considered key words. Noun pluralization and verb tense were ignored. There were several contractions in the sentences. In cases where both parts of the contraction were reported, for example had not instead of hadn t would be considered correct. However, if the participant just reported had, then it would not be considered correct. Homonyms were considered correct. Sometimes, when reporting words, participants will provide sufficient context to determine that they meant a different homonym, for e.g., two versus too or to. These were still scored as correct since if the participant had only said the word without other words to provide context the word would have been scored as correct. The sentences that we used were much longer than those typically used for loose key word scoring. We also wanted to allow for as much variability in the scored words as possible, so that any word in the target sentence with did not also appear in any of the concurrent distractor sentences was used as a key word. There were five talkers visible on the monitor (see figure 6.3). Each talker spoke a sentence, and all of the sentences were played simultaneously. The centre talker was the target. Subjects were asked to watch the target talker and repeat back as many words as they could once the sentence was finished. 150

167 CHAPTER 6. DISTRACTOR FACES Figure 6.3: This figure shows the configuration for the five talkers for Experiment 6a. This frame was taken from a video in the matching condition. The centre talker was the target. 151

168 CHAPTER 6. DISTRACTOR FACES Design and Analysis The experiment was carried out as a between-subject design so that the same sentences could be used in each condition. A one-way ANOVA (distractor motion condition) was used to analyze the percentage of words correct in each condition. Results and Discussion As can be seen in Figure 6.4, performance was the same in all three conditions with no significant differences (p >.05) between the conditions. One possibility for this is that with four distractors it becomes difficult to segregate each talker. For instance, it has been shown that people can very accurately judge the number of concurrent talkers present up to two talkers, but that accuracy in making that judgement starts to decline with more talkers (Kashino & Hirahara, 1996). Perhaps fewer distractors might make it easier to segregate the distractors. 6.3 Experiment 6b Experiment 6b uses the same target talker and only two of the distractor talkers from Experiment 6a Methods Participants There were 37 subjects (28 females) with a mean age of 18.3 years (range years). All subjects were native speakers of English, and reported normal or corrected to normal vision, and no speech or hearing difficulties. 152

169 CHAPTER 6. DISTRACTOR FACES Percent words correct Difference Matching motion Mismatching motion No motion Figure 6.4: This shows performance on the speech task for Experiment 6a by distractor condition. The error bars indicate standard error of the mean. 153

170 CHAPTER 6. DISTRACTOR FACES Figure 6.5: This figure shows configuration for the three talkers for Experiment 6b. This frame was taken from a video in the matching condition. The centre talker was the target. Stimuli The original audio and videos from Experiment 6a were edited in Final Cut Pro to remove the audio and videos of two of the distractor talkers. The same two talkers were removed from all the videos. In Experiment 6b the original normalized audio levels were used. Separate audio files, one for the target and one for the two distractors were also exported from Final Cut Pro to test audio levels. The audio for the target (without the distractors) was measured at 60 db(a). The audio for the two distractors (without the target) was also measured at 56 db(a). 154

171 CHAPTER 6. DISTRACTOR FACES Experimental procedures The procedures for Experiment 6b were the same as those used in Experiment 6a except that there were three talkers visible on the monitor. As in Experiment 6a, the centre talker was the target (see figure 6.5). Design and Analysis The design and analysis of Experiment 6b were the same as those used in Experiment 6a. The three experimental conditions (matching, mismatching and no motion) were between-subjects, and a one-way ANOVA (distractor motion condition) was used to analyze the data Results and Discussion As can be seen in figure 6.6, performance was the same in all three conditions with no significant differences (p >.05) between the conditions. While overall performance was slightly higher in Experiment 6b than in Experiment 6a, the same pattern of results appears in Experiment 6a and 6b. These results suggest that participants don t seem to be particularly influenced by the visual information from the distractors, since the auditory information was the same in all three cases. 6.4 Experiment 6c Although no eyetracking data was collected in Experiments 6a and 6b, experimenters observed that in those experiments participants seemed to be fixated on the target talker during the trial. This is unsurprising since participants were instructed to watch 155

172 CHAPTER 6. DISTRACTOR FACES Percent words correct Difference Matching motion Mismatching motion No motion Figure 6.6: This shows performance on the speech task for Experiment 6b by distractor condition. The error bars indicate standard error of the mean. 156

173 CHAPTER 6. DISTRACTOR FACES the talker. Also, there is no benefit to looking away from the talker, and depending on how far away participants averted their gaze from the talker there could be a cost to not looking at the talker. The faces are approximately 10.2 degrees of visual angle appart, measured from one talker s nose to another talker s nose. Even though all of the face stimuli would have fallen within 40 degrees of visual angle (Paré et al., 2003) if participants were looking at the centre of the frame with the central talker, more detailed information would have been available for the central target talker rather than the distractors. Substantially reducing (Jordan & Sergeant, 1998) or enlarging (Vatikiotis-Bateson et al., 1998) the size of the face does not seem to dramatically reduce or enhance the effectiveness of the visual cues, and speech information is present across a wide range of spatial frequencies (Munhall et al., 2004), so it seems unlikely that the level of detailed information from the distractors would account for the results. Perhaps more importantly, fixating on the central talker could have biased visual attention towards the central target talker. For instance visual target detection has been shown to be more accurate when the target is at the same location as the location of fixation (Hoffman & Subramaniam, 1995), and visual distractors at fixation have been shown to be more difficult to filter out (Beck & Lavie, 2005) than those presented more peripherally. For audiovisual stimuli, the ventriloquist effect (for non-speech stimuli) for peripherally presented stimuli does not seem to depend on the location of endogenous (Bertelson, Vroomen, Gelder, & Driver, 2000) and exogenous (Vroomen, Bertelson, & de Gelder, 2001) visual attention. On the other hand, it has been shown that it is more difficult to ignore auditory speech that appears to be located at the locus of visual attention (Spence, Ranson, & Driver, 2000) rather than speech 157

174 CHAPTER 6. DISTRACTOR FACES which appears to be located more peripherally (to one side). It was more difficult to ignore auditory speech at the location of fixation whether they were engaged in a visual task at fixation or were passively fixating the visual location. This suggests that somewhat different processing could be going on for central and peripherally presented audiovisual talkers. In Experiment 6c participants were instructed to fixate on one of the distractors, yet pay attention to the central target talker. The stimuli used for Experiment 6c were the same as those used for Experiment 6b, but the instructions were changed so that participants would have to fixate on a distractor talker. Would the visual information from the distractors be more influential when participants were fixating one of the distractors as opposed to the target? Methods Participants There were 27 subjects (19 females), with a mean age of 20.2 (range 18-25) in the study. Three of the participants ran in an audio-only control condition. Data from one subject was removed for failing to follow instructions (eyetracking data showed that they watched the target talker in every trial). All subjects were native speakers of English, and reported normal or corrected to normal vision, and no speech or hearing difficulties. Stimuli The stimuli used for Experiment 6c were the same as those used for Experiment 6b. Because of screen resolution constraints used in the program that controls the 158

175 CHAPTER 6. DISTRACTOR FACES eyetracker and the experiment (the Eyelink II system mentioned in Experiments 2 and 3, in Chapters 3 and 4 respectively), the display on the screen was slightly different than that in Experiments 6a and 6b. The video was presented at The screen resolution was set to , and a black border was inserted around the video to fill up the rest of the screen. Due to the insertion of the black boarder the faces are slightly closer together than in 6b, approximately 10 degrees of visual angle appart, measured from one talker s nose to another talker s nose. The audio levels were the same as those in Experiment 6b. The audio for the target (without the distractors) was measured at 60 db(a). The audio for the two distractors (without the target) was measured at 56 db(a). Experimental procedures Apart from the use of an eyetracker to monitor gaze, and the instruction to fixate on one of the distractors while paying attention to the central target talker, the procedures for Experiment 6c were the same as those used in Experiment 6b. Each subject was told to watch one of the distractors. Whether subjects looked at the talker on the left or the right of the central target talker was counter-balanced. Eye position was monitored using an Eyelink II eye tracking system (SR Research, Osgoode, Canada) using dark pupil tracking with a sampling rate of 500Hz. Each sample contains an x and y coordinate which corresponds to the location of gaze on the screen. A nine-point calibration and validation procedure was used. The maximum average error was 1.0 visual degrees, and maximum error on a single point was 1.2 visual degrees with the exception of the central point which was always less than 1.0 degrees. A drift correction was performed before each trial. 159

176 CHAPTER 6. DISTRACTOR FACES Design and Analysis For the behavioural data, the design and analysis of Experiment 6c were the same as those used in Experiments 6a and 6b. The three experimental conditions (matching, mismatching and no motion) were between-subjects, and a one-way ANOVA (distractor motion condition) was used to analyze the data. Because Experiments 6a and 6b showed no influence of the distractor condition on the number of words reported from the target talker, an additional descriptive analysis was done looking at the number of words from the distractor sentences that were reported. Only words that appeared in one distractor sentence, and not in the other concurrent distractor sentence or the target sentence were counted. In practice, it was actually quite clear which sentence/talker the words came from. This analysis was not performed for Experiments 6a and 6b because it was noted in the scoring of the target words that it was quite rare that words were reported that were from the distractor sentences, with most subjects not reporting any words from the distractor sentences. This analysis of distractor words was added when it was noted that during the scoring that subjects were regularly reporting a couple of words or a phrase from the distractor sentences. The data from the eyetracker was used to ensure that subjects were looking at the appropriate distractor. Because of this, only a descriptive analysis is provided for the eyetracking data. The location of the x and y eyetracking coordinates was used to determine whether subjects were looking at the screen during the trial. Additionally, the location of each talker s video on the screen was determined, and the location of the x and y eyetracking coordinates was used to determine the number of samples spent fixated on each talker. 160

177 CHAPTER 6. DISTRACTOR FACES Results and Discussion Subjects were very good at looking at the screen during the experiment (approximately 99% of eyetracking samples during each trial fell on the screen). For the most part, subjects were good at following the instructions. Nineteen of the twenty-three subjects spent an average of between 93 and 99% of the trial looking at the distractor talker they were instructed to. Three participants spent an average of 78-88% of the trial looking at a distarctor, and one participant only spent an average of 62-69% of the trial looking at a distractor. A trial by trial examination of the data showed that one participant, while not looking directly at the talker, was looking at the same side of the screen as the talker over 89% of the time on all trials. For the distractor words by talker descriptive analysis, the eyetracking data was used to determine which distractor the subject was looking at for each trial. In cases where subjects spent less than 90% of the trial on average looking at the distractor, the behavioural data was unremarkable as compared with subjects who spent over 90% of the trial looking at the correct distractor talker, so the behavioural data was included in the analysis. Behavioural results can be seen in Figure 6.7. Like experiments 6a and 6b, there was no significant effect of the distractor motion condition on the percentage of target words reported (p >.05). For comparison, audio-only control subjects reported 10.0 (SE 2.1) target words. The number of distractor words reported by condition can be seen in Table 6.1. The number of distractor words does not differ across conditions (p >.05). For comparison, audio-only control subjects reported 28.7 (SE 2.1) distractor words. The number of distractor words was further broken down by whether or not the words came from the distractor talker that the subject was looking at. Although it 161

178 CHAPTER 6. DISTRACTOR FACES Percent words correct Difference Matching motion Mismatching motion No motion Figure 6.7: This shows performance on the speech task for Experiment 6c by distractor condition. The error bars indicate standard error of the mean. Match Mismatch No Motion words 27.6 (SE 5.2) 23.3 (SE 4.2) 30.9 (SE 2.0) Table 6.1: This shows the mean number of distractor words reported by distractor condition. Standard errors of the mean are in parentheses. 162

179 CHAPTER 6. DISTRACTOR FACES is certainly possible that a greater sample size would show a different pattern, with the current sample there does not appear to be a particularly strong bias towards reporting the words from either the fixated or the non-fixated distractor. Match Mismatch No Motion looking at 8.8 (SE 2.9) 14.4 (SE 3.6) 16.6 (SE 3.4) not looking 19.4 (SE 6.6) 8.8 (SE 2.8) 14.3 (SE 2.9) Table 6.2: This shows the mean number of distractor words reported by distractor condition and whether the subject was fixated on (looking at) the distractor or not (not looking). Standard errors of the mean are in parentheses Overall performance in terms of target words reported appears to be lower in Experiment 6c than in Experiment 6b. Even though all of the face stimuli would have fallen well within visual eccentricities shown to provide sufficient visual information (Paré et al., 2003), having subjects fixate on one of the distrators seemed to make the task more difficult. The fact that the performance was the same across the two motion conditions and the no motion condition suggests that retinal eccentricity may be a factor. Paré et al. (2003) used only one talker, and is possible that the effects of eccentricity may be modulated when there are multiple talkers. It is possible that performance being lower when subjects had to fixate a distractor instead of being able to fixate on the talker is due to less visual information available at the increased eccentricity. On the other hand, there are could be attentional influences of fixating on stimuli that participants are trying to ignore. The solution would be to try and equate the information available for centrally and peripherally presented talkers. However, it is not clear how the information for foveal and peripheral vision could be equated. While increasing the size of the peripheral images would 163

180 CHAPTER 6. DISTRACTOR FACES compensate for the decreases in visual acuity associated with increased eccentricity of the non-foveated target, the biological motion stimuli may (Thompson et al., 2007) or may not (Ikeda et al., 2005) quite be able to be scaled to equate performance in foveal and peripheral vision. Another possibility for not finding a difference between motion conditions could have resulted from item effects. Although the audio stimuli from the distractors, and the audio and video stimuli from the target were both held constant across conditions, the actual sentences chosen for the target and distractors could still have influenced the results (or lack thereof). Some sentences were likely either harder or easier to understand than others, and some sentences likely contain higher frequency words than other sentences. This could have increased the amount of noise in the data, possibly masking a small effect. However, there were no differences across the motion conditions for both the target words and the distractor words in Experiment 6c. This, despite the fact that the actual sentences were held constant across all three motion conditions. While the actual items used could have influenced the results, the lack of influence of the motion condition on both the target words and the distractor words suggests that item effects are not likely the sole reason for the lack of influence of the motion condition seen in Experiment 6c. Taken together, the results from Experiments 6a, 6b and 6c have shown that audiovisual speech distractors are not particularly more distracting than auditory distractor speech paired with a still image. 164

181 Chapter 7 General discussion and conclusion 7.1 Summary The overall results of the experiments presented here suggest that the integration of auditory and visual speech information is quite robust to various attempts to modulate the integration. Experiments in 1a, 1b, 2a and 2b (in Chapters 2 and 3) show very minimal, if any, disruption of the integration of auditory and visual speech information by the addition of a cognitive load task. Experiments 3a and 3b (in Chapter 4)showed that changing attentional instructions to get subjects to selectively attend to either the auditory or visual information speech information can have a slight influence on the observed integration of auditory and visual speech information. What is interesting, is that in spite of being fully informed about the stimuli, the influence of attentional instructions remained for the most part rather modest, even when the stimuli were obviously temporally misaligned or contained little visual information. The integration of temporally offset auditory and visual information seems rather insensitive to cognitive load or selective attentional manipulations. The processing 165

182 CHAPTER 7. GENERAL DISCUSSION of visual information from distractor faces seems to be limited. The language of the visually articulating distractors in Experiment 5 (in Chapter 6) doesn t appear to provide information helpful to matching together the auditory and visual speech streams. In Experiments 6a, 6b and 6c (in Chapter 6), audiovisual speech distractors are not really any more distracting than auditory distractor speech paired with a still image, suggesting a limited processing or integration of the visual and auditory distractor information. The gaze behaviour during audiovisual speech perception appears to be relatively unaffected by an increase in cognitive load (Experiments 2a and 2b), but is somewhat influenced by attentional instructions to selectively attend to the auditory and visual information (Experiments 3a and 3b). Additionally, both the congruency of the consonant, and the temporal offset of the auditory and visual stimuli have small but rather robust influences on gaze. 7.2 Discussion Despite the seeming controversy in the literature, studies supporting the automaticity of the integration of auditory and visual speech information (Soto-Faraco et al., 2004; Sams et al., 1991; Colin et al., 2004, 2002; Saint-Amour et al., 2007; Kislyuk et al., 2008) can be reconciled with studies showing that attentional influences can decrease this integration (Tiippana et al., 2004; Alsius et al., 2005, 2007; Andersen et al., 2009). It seems that the integration of auditory and visual speech information is fairly unavoidable when there is little other perceptual information competing for attention. Also, subjects seem rather unable to break the influence of the visual information on the perception of the auditory speech information (i.e. Experiments 3a and 3b), even when the visual information is less reliable than the auditory information. 166

183 CHAPTER 7. GENERAL DISCUSSION On the other hand, despite the robustness of the McGurk effect to attentional and cognitive load manipulations, the integration of auditory and visual information may require attention in order to be bound together if there are multiple sources of either auditory or visual information. Andersen et al. (2009) found a greater effect of spatial attentional instructions when there were two competing faces instead of just one talking face. The reduction in audiovisual integration in Alsius et al. (2005, 2007) occurred with perceptual information presented concurrently with the speech task. When there is competing perceptual information, increasing the task demands or directing spatial attention may influence the amount of integration observed. It should be noted that even under the high attentional load conditions of Alsius et al. (2005, 2007) the influence of the visual information on the perception was not eliminated. That is, despite the modulations of the McGurk effect, the illusion was never broken. The results of Experiments 2a and 2b suggest that increasing the cognitive load alone may not have much influence on the integration of auditory and visual speech information. It should be noted that these conclusions are all based on studies that used conflicting visual and speech information (i.e. the McGurk effect). It is possible that the binding of conflicting visual information may be more difficult than the binding of corresponding visual information. The results of Experiment 4, while far from conclusive, hint that a very slight difference in visual gathering behaviour could be the result of greater difficulty in binding the conflicting visual information compared with the corresponding visual information. 167

184 CHAPTER 7. GENERAL DISCUSSION 7.3 Some limitations of the current research McGurk stimuli In the McGurk experiments in this thesis, a rather limited number of consonant combinations, and only one talker was used. This is consistent with other literature (McGurk & MacDonald, 1976; Sams et al., 1991; Jones & Jarick, 2006; Saint-Amour et al., 2007; van Wassenhove et al., 2007; Soto-Faraco & Alsius, 2007; Munhall et al., 2009; Andersen et al., 2009; Pilling, 2009, for example, see), but it is always possible that the choice of stimuli could limit the generalizability of the results of some of the studies. For example, the specific McGurk stimuli (talker, consonant etc...) used can have quite a huge influence on the illusion. For instance, some talkers are better at eliciting the McGurk effect that others (for example, see Paré et al., 2003). It is also likely that certain talkers may produce more compelling illusions with certain utterances. For example, an auditory pa paired with a visual ka has been used by van Wassenhove, Grant and Poeppel in several studies (see Grant et al., 2004; van Wassenhove et al., 2005, 2007) and produces (for them) a reliable and replicable perceptual response of ta (roughly about 70-75%). Yet in MacDonald and McGurk (1978) the same combination of an auditory pa paired with a visual ka only produced ta response 10% of the time. That said, even though the combination of talker and consonant combinations likely had an influence on the specific results of the studies (i.e. exactly how many McGurk, or auditory responses there were in each condition), the overall pattern of results is likely to generalize. For instance, even though different McGurk stimuli were used in various MMN experiments (Sams 168

185 CHAPTER 7. GENERAL DISCUSSION et al., 1991; Colin et al., 2004, 2002; Kislyuk et al., 2008), similar conclusions were reached across these experiments. Of course, the generalization of results is testable, and this can be tested by replication Temporal offsets In all of the experiments presented in this thesis, only visual leading offsets were used. Auditory leading offsets have also been used in many experiments looking at the influence of temporal offsets on the integration of auditory and visual speech information (Munhall et al., 1996; Jones & Jarick, 2006; van Wassenhove et al., 2007; Soto-Faraco & Alsius, 2007). The temporal discrepancy of auditory leading offsets may be more readily perceived than visual leading ones (Grant et al., 2004; Soto-Faraco & Alsius, 2007). It is possible that the integration of auditory leading offsets would have been more greatly influenced by the concurrent cognitive load task (Experiments 1c, 2a and 2b) or the selective attention instructions (Experiment 3a). However, auditory leading asynchronies are not usually encountered in everyday instances of audiovisual speech integration. Video leading asynchronies were chosen because they tend to be more naturalistic. Because of this, the results of the current studies may be more generalizable to everyday instances of audiovisual speech perception. 7.4 Some considerations for future research The influences of attentional demands on the integration of auditory and visual speech information (Alsius et al., 2005, 2007; Andersen et al., 2009, i.e. ) may be magnified 169

186 CHAPTER 7. GENERAL DISCUSSION by competing perceptual information. Further study is needed on the role of competing perceptual information on the influence of attentional manipulations. Of course, caution must be exercised to make sure that information from the speech streams isn t masked by other competing perceptual information. It should be noted that the vast majority of the research done on examining whether the integration of auditory and visual speech information is automatic, or can be influenced by cognitive factors, has used conflicting auditory and visual stimuli to produce the McGurk effect (for example see Green et al., 1991; Walker et al., 1995; Rosenblum et al., 1997; Burnham & Dodd, 2004; Rosenblum & Saldaña, 1996; Soto- Faraco et al., 2004; Colin et al., 2004, 2002; Kislyuk et al., 2008; Sams et al., 1991; Munhall et al., 2009; Tuomainen et al., 2005; Alsius et al., 2005, 2007; Andersen et al., 2009; Tiippana et al., 2004; Andersen et al., 2009). It remains an outstanding question whether complementary (or congruent) speech information, such as in a speech-in-noise paradigm, is more robust to attentional manipulations than conflicting (or incongruent) speech information. Based on the results of Experiment 4, the influence of the temporal offset of the auditory information on the gathering of visual information may reflect processes related to the binding of the auditory and visual speech information. The fact that the influences of both congruency and offset were shown to occur across most of the experiments in Experiment 4 is encouraging, and suggests that these influences may be replicable in a future study designed specifically to examine this issue. 170

187 CHAPTER 7. GENERAL DISCUSSION 7.5 Conclusion In summary, the data presented here suggest a modest influence of cognitive factors in audiovisual speech perception. The integration of auditory and visual speech information seems to be quite robust to the various attempts in this thesis to modulate the integration. On the other hand, it seems that there is minimal processing or integration of the visual information from distractors when subjects are free to fixate the target talker. 171

188 Appendix A Ethics board approval letters Letters of approval from Queen s General Research Ethics Board are included here as required by Queen s University School of Gradute Studies. A certificate of completion of Queen s Course in Human Research Participant Protection is also included. A.1 General Research Ethics Board approval letters A.2 Completion of Course in Human Research Participant Protection 172

189 APPENDIX A. ETHICS BOARD APPROVAL LETTERS Figure A.1: Ethics board approval 173

190 APPENDIX A. ETHICS BOARD APPROVAL LETTERS Figure A.2: Ethics board approval 174

191 APPENDIX A. ETHICS BOARD APPROVAL LETTERS Figure A.3: Ethics board approval 175

192 APPENDIX A. ETHICS BOARD APPROVAL LETTERS Figure A.4: Ethics board approval 176

193 APPENDIX A. ETHICS BOARD APPROVAL LETTERS Figure A.5: Course in research ethics 177

The influence of selective attention to auditory and visual speech on the integration of audiovisual speech information

The influence of selective attention to auditory and visual speech on the integration of audiovisual speech information Perception, 2011, volume 40, pages 1164 ^ 1182 doi:10.1068/p6939 The influence of selective attention to auditory and visual speech on the integration of audiovisual speech information Julie N Buchan,

More information

The role of visual spatial attention in audiovisual speech perception q

The role of visual spatial attention in audiovisual speech perception q Available online at www.sciencedirect.com Speech Communication xxx (2008) xxx xxx www.elsevier.com/locate/specom The role of visual spatial attention in audiovisual speech perception q Tobias S. Andersen

More information

ELECTROPHYSIOLOGY OF UNIMODAL AND AUDIOVISUAL SPEECH PERCEPTION

ELECTROPHYSIOLOGY OF UNIMODAL AND AUDIOVISUAL SPEECH PERCEPTION AVSP 2 International Conference on Auditory-Visual Speech Processing ELECTROPHYSIOLOGY OF UNIMODAL AND AUDIOVISUAL SPEECH PERCEPTION Lynne E. Bernstein, Curtis W. Ponton 2, Edward T. Auer, Jr. House Ear

More information

Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli

Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli Psychon Bull Rev (2011) 18:123 128 DOI 10.3758/s13423-010-0027-z Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli Shoko Kanaya & Kazuhiko Yokosawa Published

More information

Language Speech. Speech is the preferred modality for language.

Language Speech. Speech is the preferred modality for language. Language Speech Speech is the preferred modality for language. Outer ear Collects sound waves. The configuration of the outer ear serves to amplify sound, particularly at 2000-5000 Hz, a frequency range

More information

THE ROLE OF VISUAL SPEECH CUES IN THE AUDITORY PERCEPTION OF SYNTHETIC STIMULI BY CHILDREN USING A COCHLEAR IMPLANT AND CHILDREN WITH NORMAL HEARING

THE ROLE OF VISUAL SPEECH CUES IN THE AUDITORY PERCEPTION OF SYNTHETIC STIMULI BY CHILDREN USING A COCHLEAR IMPLANT AND CHILDREN WITH NORMAL HEARING THE ROLE OF VISUAL SPEECH CUES IN THE AUDITORY PERCEPTION OF SYNTHETIC STIMULI BY CHILDREN USING A COCHLEAR IMPLANT AND CHILDREN WITH NORMAL HEARING Vanessa Surowiecki 1, vid Grayden 1, Richard Dowell

More information

Grozdana Erjavec 1, Denis Legros 1. Department of Cognition, Language and Interaction, University of Paris VIII, France. Abstract. 1.

Grozdana Erjavec 1, Denis Legros 1. Department of Cognition, Language and Interaction, University of Paris VIII, France. Abstract. 1. INTERSPEECH 2013 Effects of Mouth-Only and Whole-Face Displays on Audio-Visual Speech Perception in Noise: Is the Vision of a Talker s Full Face Truly the Most Efficient Solution? Grozdana Erjavec 1, Denis

More information

Consonant Perception test

Consonant Perception test Consonant Perception test Introduction The Vowel-Consonant-Vowel (VCV) test is used in clinics to evaluate how well a listener can recognize consonants under different conditions (e.g. with and without

More information

Speech Reading Training and Audio-Visual Integration in a Child with Autism Spectrum Disorder

Speech Reading Training and Audio-Visual Integration in a Child with Autism Spectrum Disorder University of Arkansas, Fayetteville ScholarWorks@UARK Rehabilitation, Human Resources and Communication Disorders Undergraduate Honors Theses Rehabilitation, Human Resources and Communication Disorders

More information

DISCREPANT VISUAL SPEECH FACILITATES COVERT SELECTIVE LISTENING IN COCKTAIL PARTY CONDITIONS JASON A. WILLIAMS

DISCREPANT VISUAL SPEECH FACILITATES COVERT SELECTIVE LISTENING IN COCKTAIL PARTY CONDITIONS JASON A. WILLIAMS DISCREPANT VISUAL SPEECH FACILITATES COVERT SELECTIVE LISTENING IN COCKTAIL PARTY CONDITIONS JASON A. WILLIAMS Summary. The presence of congruent visual speech information facilitates the identification

More information

(SAT). d) inhibiting automatized responses.

(SAT). d) inhibiting automatized responses. Which of the following findings does NOT support the existence of task-specific mental resources? 1. a) It is more difficult to combine two verbal tasks than one verbal task and one spatial task. 2. b)

More information

Report. Audiovisual Integration of Speech in a Bistable Illusion

Report. Audiovisual Integration of Speech in a Bistable Illusion Current Biology 19, 735 739, May 12, 2009 ª2009 Elsevier Ltd All rights reserved DOI 10.1016/j.cub.2009.03.019 Audiovisual Integration of Speech in a Bistable Illusion Report K.G. Munhall, 1,2, * M.W.

More information

Gick et al.: JASA Express Letters DOI: / Published Online 17 March 2008

Gick et al.: JASA Express Letters DOI: / Published Online 17 March 2008 modality when that information is coupled with information via another modality (e.g., McGrath and Summerfield, 1985). It is unknown, however, whether there exist complex relationships across modalities,

More information

Sound Location Can Influence Audiovisual Speech Perception When Spatial Attention Is Manipulated

Sound Location Can Influence Audiovisual Speech Perception When Spatial Attention Is Manipulated Seeing and Perceiving 24 (2011) 67 90 brill.nl/sp Sound Location Can Influence Audiovisual Speech Perception When Spatial Attention Is Manipulated Kaisa Tiippana 1,2,, Hanna Puharinen 2, Riikka Möttönen

More information

Neuropsychologia 50 (2012) Contents lists available at SciVerse ScienceDirect. Neuropsychologia

Neuropsychologia 50 (2012) Contents lists available at SciVerse ScienceDirect. Neuropsychologia Neuropsychologia 50 (2012) 1425 1431 Contents lists available at SciVerse ScienceDirect Neuropsychologia jo u rn al hom epa ge : www.elsevier.com/locate/neuropsychologia Electrophysiological evidence for

More information

Congruency Effects with Dynamic Auditory Stimuli: Design Implications

Congruency Effects with Dynamic Auditory Stimuli: Design Implications Congruency Effects with Dynamic Auditory Stimuli: Design Implications Bruce N. Walker and Addie Ehrenstein Psychology Department Rice University 6100 Main Street Houston, TX 77005-1892 USA +1 (713) 527-8101

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Psychological and Physiological Acoustics Session 2pPPb: Speech. Attention,

More information

Auditory-Visual Integration of Sine-Wave Speech. A Senior Honors Thesis

Auditory-Visual Integration of Sine-Wave Speech. A Senior Honors Thesis Auditory-Visual Integration of Sine-Wave Speech A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for Graduation with Distinction in Speech and Hearing Science in the Undergraduate

More information

Audiovisual speech perception in children with autism spectrum disorders and typical controls

Audiovisual speech perception in children with autism spectrum disorders and typical controls Audiovisual speech perception in children with autism spectrum disorders and typical controls Julia R. Irwin 1,2 and Lawrence Brancazio 1,2 1 Haskins Laboratories, New Haven, CT, USA 2 Southern Connecticut

More information

Applying the summation model in audiovisual speech perception

Applying the summation model in audiovisual speech perception Applying the summation model in audiovisual speech perception Kaisa Tiippana, Ilmari Kurki, Tarja Peromaa Department of Psychology and Logopedics, Faculty of Medicine, University of Helsinki, Finland kaisa.tiippana@helsinki.fi,

More information

A Senior Honors Thesis. Brandie Andrews

A Senior Honors Thesis. Brandie Andrews Auditory and Visual Information Facilitating Speech Integration A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for graduation with distinction in Speech and Hearing Science

More information

Prof. Greg Francis 7/7/08

Prof. Greg Francis 7/7/08 Perceptual development IIE 366: Developmental Psychology Chapter 5: Perceptual and Motor Development Module 5.1 Basic Sensory and Perceptual Processes Greg Francis Lecture 11 Children and Their Development,

More information

Auditory scene analysis in humans: Implications for computational implementations.

Auditory scene analysis in humans: Implications for computational implementations. Auditory scene analysis in humans: Implications for computational implementations. Albert S. Bregman McGill University Introduction. The scene analysis problem. Two dimensions of grouping. Recognition

More information

There are often questions and, sometimes, confusion when looking at services to a child who is deaf or hard of hearing. Because very young children

There are often questions and, sometimes, confusion when looking at services to a child who is deaf or hard of hearing. Because very young children There are often questions and, sometimes, confusion when looking at services to a child who is deaf or hard of hearing. Because very young children are not yet ready to work on specific strategies for

More information

Audiovisual Integration of Speech Falters under High Attention Demands

Audiovisual Integration of Speech Falters under High Attention Demands Current Biology, Vol. 15, 839 843, May 10, 2005, 2005 Elsevier Ltd All rights reserved. DOI 10.1016/j.cub.2005.03.046 Audiovisual Integration of Speech Falters under High Attention Demands Agnès Alsius,

More information

ILLUSIONS AND ISSUES IN BIMODAL SPEECH PERCEPTION

ILLUSIONS AND ISSUES IN BIMODAL SPEECH PERCEPTION ISCA Archive ILLUSIONS AND ISSUES IN BIMODAL SPEECH PERCEPTION Dominic W. Massaro Perceptual Science Laboratory (http://mambo.ucsc.edu/psl/pslfan.html) University of California Santa Cruz, CA 95064 massaro@fuzzy.ucsc.edu

More information

Auditory-Visual Speech Perception Laboratory

Auditory-Visual Speech Perception Laboratory Auditory-Visual Speech Perception Laboratory Research Focus: Identify perceptual processes involved in auditory-visual speech perception Determine the abilities of individual patients to carry out these

More information

6 blank lines space before title. 1. Introduction

6 blank lines space before title. 1. Introduction 6 blank lines space before title Extremely economical: How key frames affect consonant perception under different audio-visual skews H. Knoche a, H. de Meer b, D. Kirsh c a Department of Computer Science,

More information

Multimodal interactions: visual-auditory

Multimodal interactions: visual-auditory 1 Multimodal interactions: visual-auditory Imagine that you are watching a game of tennis on television and someone accidentally mutes the sound. You will probably notice that following the game becomes

More information

Auditory Scene Analysis

Auditory Scene Analysis 1 Auditory Scene Analysis Albert S. Bregman Department of Psychology McGill University 1205 Docteur Penfield Avenue Montreal, QC Canada H3A 1B1 E-mail: bregman@hebb.psych.mcgill.ca To appear in N.J. Smelzer

More information

Perception of Synchrony between the Senses

Perception of Synchrony between the Senses 9 Perception of Synchrony between the Senses Mirjam Keetels and Jean Vroomen Contents 9.1 Introduction... 147 9.2 Measuring Intersensory Synchrony: Temporal Order Judgment Task and Simultaneity Judgment

More information

Outline.! Neural representation of speech sounds. " Basic intro " Sounds and categories " How do we perceive sounds? " Is speech sounds special?

Outline.! Neural representation of speech sounds.  Basic intro  Sounds and categories  How do we perceive sounds?  Is speech sounds special? Outline! Neural representation of speech sounds " Basic intro " Sounds and categories " How do we perceive sounds? " Is speech sounds special? ! What is a phoneme?! It s the basic linguistic unit of speech!

More information

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview

Infant Hearing Development: Translating Research Findings into Clinical Practice. Auditory Development. Overview Infant Hearing Development: Translating Research Findings into Clinical Practice Lori J. Leibold Department of Allied Health Sciences The University of North Carolina at Chapel Hill Auditory Development

More information

Existence of competing modality dominances

Existence of competing modality dominances DOI 10.3758/s13414-016-1061-3 Existence of competing modality dominances Christopher W. Robinson 1 & Marvin Chandra 2 & Scott Sinnett 2 # The Psychonomic Society, Inc. 2016 Abstract Approximately 40 years

More information

Categorical Perception

Categorical Perception Categorical Perception Discrimination for some speech contrasts is poor within phonetic categories and good between categories. Unusual, not found for most perceptual contrasts. Influenced by task, expectations,

More information

Invariant Effects of Working Memory Load in the Face of Competition

Invariant Effects of Working Memory Load in the Face of Competition Invariant Effects of Working Memory Load in the Face of Competition Ewald Neumann (ewald.neumann@canterbury.ac.nz) Department of Psychology, University of Canterbury Christchurch, New Zealand Stephen J.

More information

The Role of Information Redundancy in Audiovisual Speech Integration. A Senior Honors Thesis

The Role of Information Redundancy in Audiovisual Speech Integration. A Senior Honors Thesis The Role of Information Redundancy in Audiovisual Speech Integration A Senior Honors Thesis Printed in Partial Fulfillment of the Requirement for graduation with distinction in Speech and Hearing Science

More information

Attention. Concentrating and focusing of mental effort that is:

Attention. Concentrating and focusing of mental effort that is: What is attention? Concentrating and focusing of mental effort that is: Page 1 o Selective--focus on some things while excluding others o Divisible--able to focus on more than one thing at the same time

More information

Chapter 6. Attention. Attention

Chapter 6. Attention. Attention Chapter 6 Attention Attention William James, in 1890, wrote Everyone knows what attention is. Attention is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously

More information

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University Overview Goals of studying speech perception in individuals

More information

The role of modality congruence in the presentation and recognition of taskirrelevant stimuli in dual task paradigms.

The role of modality congruence in the presentation and recognition of taskirrelevant stimuli in dual task paradigms. The role of modality congruence in the presentation and recognition of taskirrelevant stimuli in dual task paradigms. Maegen Walker (maegenw@hawaii.edu) Department of Psychology, University of Hawaii at

More information

WIDEXPRESS. no.30. Background

WIDEXPRESS. no.30. Background WIDEXPRESS no. january 12 By Marie Sonne Kristensen Petri Korhonen Using the WidexLink technology to improve speech perception Background For most hearing aid users, the primary motivation for using hearing

More information

Audio-Visual Integration: Generalization Across Talkers. A Senior Honors Thesis

Audio-Visual Integration: Generalization Across Talkers. A Senior Honors Thesis Audio-Visual Integration: Generalization Across Talkers A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for graduation with research distinction in Speech and Hearing Science

More information

Alsius, Agnès; Möttönen, Riikka; Sams, Mikko; Soto-Faraco, Salvador; Tiippana, Kaisa Effect of attentional load on audiovisual speech perception

Alsius, Agnès; Möttönen, Riikka; Sams, Mikko; Soto-Faraco, Salvador; Tiippana, Kaisa Effect of attentional load on audiovisual speech perception Powered by TCPDF (www.tcpdf.org) This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Alsius, Agnès; Möttönen, Riikka;

More information

Gipsa-Lab, Speech & Cognition Department, CNRS-Grenoble University, France. Institute of Behavioural Sciences, University of Helsinki, Finland

Gipsa-Lab, Speech & Cognition Department, CNRS-Grenoble University, France. Institute of Behavioural Sciences, University of Helsinki, Finland ISCA Archive http://www.isca-speech.org/archive AVSP 2010 - International Conference on Audio-Visual Speech Processing Hakone, Kanagawa, Japan September 30-October 3, 2010 Disentangling unisensory from

More information

Title of Thesis. Study on Audiovisual Integration in Young and Elderly Adults by Event-Related Potential

Title of Thesis. Study on Audiovisual Integration in Young and Elderly Adults by Event-Related Potential Title of Thesis Study on Audiovisual Integration in Young and Elderly Adults by Event-Related Potential 2014 September Yang Weiping The Graduate School of Natural Science and Technology (Doctor s Course)

More information

Assessing Hearing and Speech Recognition

Assessing Hearing and Speech Recognition Assessing Hearing and Speech Recognition Audiological Rehabilitation Quick Review Audiogram Types of hearing loss hearing loss hearing loss Testing Air conduction Bone conduction Familiar Sounds Audiogram

More information

MULTI-CHANNEL COMMUNICATION

MULTI-CHANNEL COMMUNICATION INTRODUCTION Research on the Deaf Brain is beginning to provide a new evidence base for policy and practice in relation to intervention with deaf children. This talk outlines the multi-channel nature of

More information

(Visual) Attention. October 3, PSY Visual Attention 1

(Visual) Attention. October 3, PSY Visual Attention 1 (Visual) Attention Perception and awareness of a visual object seems to involve attending to the object. Do we have to attend to an object to perceive it? Some tasks seem to proceed with little or no attention

More information

Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect

Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect Perception & Psychophysics 3, 65 (4), 553-567 Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect MARTIN PARÉ, REBECCA C. RICHLER, and MARTIN TEN HOVE

More information

Providing Effective Communication Access

Providing Effective Communication Access Providing Effective Communication Access 2 nd International Hearing Loop Conference June 19 th, 2011 Matthew H. Bakke, Ph.D., CCC A Gallaudet University Outline of the Presentation Factors Affecting Communication

More information

Influence of acoustic complexity on spatial release from masking and lateralization

Influence of acoustic complexity on spatial release from masking and lateralization Influence of acoustic complexity on spatial release from masking and lateralization Gusztáv Lőcsei, Sébastien Santurette, Torsten Dau, Ewen N. MacDonald Hearing Systems Group, Department of Electrical

More information

Enhanced Performance for Recognition of Irrelevant Target-Aligned Auditory Stimuli: Unimodal and Cross-modal Considerations

Enhanced Performance for Recognition of Irrelevant Target-Aligned Auditory Stimuli: Unimodal and Cross-modal Considerations Enhanced Performance for Recognition of Irrelevant Target-Aligned Auditory Stimuli: Unimodal and Cross-modal Considerations Andrew D. Dewald (adewald@hawaii.edu) Department of Psychology, University of

More information

Optical Illusions 4/5. Optical Illusions 2/5. Optical Illusions 5/5 Optical Illusions 1/5. Reading. Reading. Fang Chen Spring 2004

Optical Illusions 4/5. Optical Illusions 2/5. Optical Illusions 5/5 Optical Illusions 1/5. Reading. Reading. Fang Chen Spring 2004 Optical Illusions 2/5 Optical Illusions 4/5 the Ponzo illusion the Muller Lyer illusion Optical Illusions 5/5 Optical Illusions 1/5 Mauritz Cornelis Escher Dutch 1898 1972 Graphical designer World s first

More information

The Simon Effect as a Function of Temporal Overlap between Relevant and Irrelevant

The Simon Effect as a Function of Temporal Overlap between Relevant and Irrelevant University of North Florida UNF Digital Commons All Volumes (2001-2008) The Osprey Journal of Ideas and Inquiry 2008 The Simon Effect as a Function of Temporal Overlap between Relevant and Irrelevant Leslie

More information

RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 24 (2000) Indiana University

RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 24 (2000) Indiana University COMPARISON OF PARTIAL INFORMATION RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 24 (2000) Indiana University Use of Partial Stimulus Information by Cochlear Implant Patients and Normal-Hearing

More information

Acta Psychologica 134 (2010) Contents lists available at ScienceDirect. Acta Psychologica. journal homepage:

Acta Psychologica 134 (2010) Contents lists available at ScienceDirect. Acta Psychologica. journal homepage: Acta Psychologica 134 (2010) 198 205 Contents lists available at ScienceDirect Acta Psychologica journal homepage: www.elsevier.com/locate/actpsy Audiovisual semantic interference and attention: Evidence

More information

Perceived Audiovisual Simultaneity in Speech by Musicians and Non-musicians: Preliminary Behavioral and Event-Related Potential (ERP) Findings

Perceived Audiovisual Simultaneity in Speech by Musicians and Non-musicians: Preliminary Behavioral and Event-Related Potential (ERP) Findings The 14th International Conference on Auditory-Visual Speech Processing 25-26 August 2017, Stockholm, Sweden Perceived Audiovisual Simultaneity in Speech by Musicians and Non-musicians: Preliminary Behavioral

More information

Tips on How to Better Serve Customers with Various Disabilities

Tips on How to Better Serve Customers with Various Disabilities FREDERICTON AGE-FRIENDLY COMMUNITY ADVISORY COMMITTEE Tips on How to Better Serve Customers with Various Disabilities Fredericton - A Community for All Ages How To Welcome Customers With Disabilities People

More information

Running head: HEARING-AIDS INDUCE PLASTICITY IN THE AUDITORY SYSTEM 1

Running head: HEARING-AIDS INDUCE PLASTICITY IN THE AUDITORY SYSTEM 1 Running head: HEARING-AIDS INDUCE PLASTICITY IN THE AUDITORY SYSTEM 1 Hearing-aids Induce Plasticity in the Auditory System: Perspectives From Three Research Designs and Personal Speculations About the

More information

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14 CS/NEUR125 Brains, Minds, and Machines Assignment 5: Neural mechanisms of object-based attention Due: Friday, April 14 This Assignment is a guided reading of the 2014 paper, Neural Mechanisms of Object-Based

More information

ATTENTION! Learning Objective Topics. (Specifically Divided and Selective Attention) Chapter 4. Selective Attention

ATTENTION! Learning Objective Topics. (Specifically Divided and Selective Attention) Chapter 4. Selective Attention ATTENTION! (Specifically Divided and Selective Attention) Chapter 4 Learning Objective Topics Selective Attention Visual Tasks Auditory Tasks Early vs. Late Selection Models Visual Search Divided Attention

More information

Facial expressions of singers influence perceived pitch relations

Facial expressions of singers influence perceived pitch relations Psychonomic Bulletin & Review 2010, 17 (3), 317-322 doi:10.3758/pbr.17.3.317 Facial expressions of singers influence perceived pitch relations WILLIAM FORDE THOMPSON Macquarie University, Sydney, New South

More information

Use of Auditory Techniques Checklists As Formative Tools: from Practicum to Student Teaching

Use of Auditory Techniques Checklists As Formative Tools: from Practicum to Student Teaching Use of Auditory Techniques Checklists As Formative Tools: from Practicum to Student Teaching Marietta M. Paterson, Ed. D. Program Coordinator & Associate Professor University of Hartford ACE-DHH 2011 Preparation

More information

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants Context Effect on Segmental and Supresegmental Cues Preceding context has been found to affect phoneme recognition Stop consonant recognition (Mann, 1980) A continuum from /da/ to /ga/ was preceded by

More information

Detection of auditory (cross-spectral) and auditory visual (cross-modal) synchrony

Detection of auditory (cross-spectral) and auditory visual (cross-modal) synchrony Speech Communication 44 (2004) 43 53 www.elsevier.com/locate/specom Detection of auditory (cross-spectral) and auditory visual (cross-modal) synchrony Ken W. Grant a, *, Virginie van Wassenhove b, David

More information

Accessibility. Serving Clients with Disabilities

Accessibility. Serving Clients with Disabilities Accessibility Serving Clients with Disabilities Did you know that just over 15.5% of Ontarians have a disability? That s 1 in every 7 Ontarians and as the population ages that number will grow. People

More information

Does Wernicke's Aphasia necessitate pure word deafness? Or the other way around? Or can they be independent? Or is that completely uncertain yet?

Does Wernicke's Aphasia necessitate pure word deafness? Or the other way around? Or can they be independent? Or is that completely uncertain yet? Does Wernicke's Aphasia necessitate pure word deafness? Or the other way around? Or can they be independent? Or is that completely uncertain yet? Two types of AVA: 1. Deficit at the prephonemic level and

More information

Production of Stop Consonants by Children with Cochlear Implants & Children with Normal Hearing. Danielle Revai University of Wisconsin - Madison

Production of Stop Consonants by Children with Cochlear Implants & Children with Normal Hearing. Danielle Revai University of Wisconsin - Madison Production of Stop Consonants by Children with Cochlear Implants & Children with Normal Hearing Danielle Revai University of Wisconsin - Madison Normal Hearing (NH) Who: Individuals with no HL What: Acoustic

More information

Coexistence of Multiple Modal Dominances

Coexistence of Multiple Modal Dominances Coexistence of Multiple Modal Dominances Marvin Chandra (mchandra@hawaii.edu) Department of Psychology, University of Hawaii at Manoa 2530 Dole Street, Honolulu, HI 96822, USA Christopher W. Robinson (robinson.777@osu.edu)

More information

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED Francisco J. Fraga, Alan M. Marotta National Institute of Telecommunications, Santa Rita do Sapucaí - MG, Brazil Abstract A considerable

More information

USING CUED SPEECH WITH SPECIAL CHILDREN Pamela H. Beck, 2002

USING CUED SPEECH WITH SPECIAL CHILDREN Pamela H. Beck, 2002 USING CUED SPEECH WITH SPECIAL CHILDREN Pamela H. Beck, 2002 Cued Speech is used with children with and without hearing loss for a variety of purposes, such as accelerating the learning phonics or speech

More information

Notes: Ear Troubles conductive hearing loss, behaviour and learning

Notes: Ear Troubles conductive hearing loss, behaviour and learning Notes: Ear Troubles conductive hearing loss, behaviour and learning Part 1 Hi, my name's Damian Howard, and this presentation has being compiled by myself and Jody Bernie. It's about conductive hearing

More information

Atypical processing of prosodic changes in natural speech stimuli in school-age children with Asperger syndrome

Atypical processing of prosodic changes in natural speech stimuli in school-age children with Asperger syndrome Atypical processing of prosodic changes in natural speech stimuli in school-age children with Asperger syndrome Riikka Lindström, PhD student Cognitive Brain Research Unit University of Helsinki 31.8.2012

More information

Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception

Integral Processing of Visual Place and Auditory Voicing Information During Phonetic Perception Journal of Experimental Psychology: Human Perception and Performance 1991, Vol. 17. No. 1,278-288 Copyright 1991 by the American Psychological Association, Inc. 0096-1523/91/S3.00 Integral Processing of

More information

Tips When Meeting A Person Who Has A Disability

Tips When Meeting A Person Who Has A Disability Tips When Meeting A Person Who Has A Disability Many people find meeting someone with a disability to be an awkward experience because they are afraid they will say or do the wrong thing; perhaps you are

More information

The Mismatch Negativity (MMN) and the McGurk Effect

The Mismatch Negativity (MMN) and the McGurk Effect The Mismatch Negativity (MMN) and the McGurk Effect Colin, C. 1,2, Radeau, M. 1,3 and Deltenre, P. 1,2. 1 Research Unit in Cognitive Neurosciences, Université Libre de Bruxelles 2 Evoked Potentials Laboratory,

More information

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES ISCA Archive ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES Allard Jongman 1, Yue Wang 2, and Joan Sereno 1 1 Linguistics Department, University of Kansas, Lawrence, KS 66045 U.S.A. 2 Department

More information

Human cogition. Human Cognition. Optical Illusions. Human cognition. Optical Illusions. Optical Illusions

Human cogition. Human Cognition. Optical Illusions. Human cognition. Optical Illusions. Optical Illusions Human Cognition Fang Chen Chalmers University of Technology Human cogition Perception and recognition Attention, emotion Learning Reading, speaking, and listening Problem solving, planning, reasoning,

More information

DRAFT. 7 Steps to Better Communication. When a loved one has hearing loss. How does hearing loss affect communication?

DRAFT. 7 Steps to Better Communication. When a loved one has hearing loss. How does hearing loss affect communication? UW MEDICINE PATIENT EDUCATION 7 Steps to Better Communication When a loved one has hearing loss This handout gives practical tips to help people communicate better in spite of hearing loss. How does hearing

More information

The effects of perceptual load on semantic processing under inattention

The effects of perceptual load on semantic processing under inattention Psychonomic Bulletin & Review 2009, 16 (5), 864-868 doi:10.3758/pbr.16.5.864 The effects of perceptual load on semantic processing under inattention MIKA KOIVISTO University of Turku, Turku, Finland AND

More information

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080

INTRODUCTION J. Acoust. Soc. Am. 103 (2), February /98/103(2)/1080/5/$ Acoustical Society of America 1080 Perceptual segregation of a harmonic from a vowel by interaural time difference in conjunction with mistuning and onset asynchrony C. J. Darwin and R. W. Hukin Experimental Psychology, University of Sussex,

More information

Seeing Sound: Changing Visual Perception Through Cross-Modal Interaction. Tracey D. Berger. New York University. Department of Psychology

Seeing Sound: Changing Visual Perception Through Cross-Modal Interaction. Tracey D. Berger. New York University. Department of Psychology Cross-Modal Effects on Perception 1 Seeing Sound: Changing Visual Perception Through Cross-Modal Interaction. Tracey D. Berger New York University Department of Psychology Faculty Sponsor : Denis Pelli

More information

Attention. What is attention? Attention metaphors. Definitions of attention. Chapter 6. Attention as a mental process

Attention. What is attention? Attention metaphors. Definitions of attention. Chapter 6. Attention as a mental process What is attention? Attention Chapter 6 To drive a car you Use effort Sustain attention Orient to several locations Restrict attention Select particular objects Search for particular objects Respond with

More information

Demonstration of a Novel Speech-Coding Method for Single-Channel Cochlear Stimulation

Demonstration of a Novel Speech-Coding Method for Single-Channel Cochlear Stimulation THE HARRIS SCIENCE REVIEW OF DOSHISHA UNIVERSITY, VOL. 58, NO. 4 January 2018 Demonstration of a Novel Speech-Coding Method for Single-Channel Cochlear Stimulation Yuta TAMAI*, Shizuko HIRYU*, and Kohta

More information

Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag

Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag JAIST Reposi https://dspace.j Title Effects of speaker's and listener's environments on speech intelligibili annoyance Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag Citation Inter-noise 2016: 171-176 Issue

More information

Top Ten Tips for Supporting Communication

Top Ten Tips for Supporting Communication Top Ten Tips for Supporting Communication 1. Modify the environment The majority of children with autism are strong visual learners so visual information may be distracting, even if it s not what you want

More information

Meeting someone with disabilities etiquette

Meeting someone with disabilities etiquette Meeting someone with disabilities etiquette Many people unsure how to go about meeting someone with a disability because they don t want to say or do the wrong thing. Here are a few tips to keep in mind

More information

Verbal Working Memory. The left temporoparietal junction in verbal working memory: Storage or attention. Baddelely s Multiple-Component Model

Verbal Working Memory. The left temporoparietal junction in verbal working memory: Storage or attention. Baddelely s Multiple-Component Model Verbal Working Memory The left temporoparietal junction in verbal working memory: Storage or attention Susan Ravizza LTM vs WM Focusing on the storage component of WM Maintenance of words, pictures, goals

More information

You and Your Student with a Hearing Impairment

You and Your Student with a Hearing Impairment You and Your Student with a Hearing Impairment HEARING IMPAIRMENT - is a physical disability resulting in diminished levels of hearing in one or both ears. There are different degrees of hearing loss,

More information

Auditory and Auditory-Visual Lombard Speech Perception by Younger and Older Adults

Auditory and Auditory-Visual Lombard Speech Perception by Younger and Older Adults Auditory and Auditory-Visual Lombard Speech Perception by Younger and Older Adults Michael Fitzpatrick, Jeesun Kim, Chris Davis MARCS Institute, University of Western Sydney, Australia michael.fitzpatrick@uws.edu.au,

More information

Attentional Blink Paradigm

Attentional Blink Paradigm Attentional Blink Paradigm ATTENTIONAL BLINK 83 ms stimulus onset asychrony between all stimuli B T D A 3 N P Z F R K M R N Lag 3 Target 1 Target 2 After detection of a target in a rapid stream of visual

More information

European Standard EN 15927:2010 Services offered by hearing aid professionals. Creating a barrier-free Europe for all hard of hearing citizens

European Standard EN 15927:2010 Services offered by hearing aid professionals. Creating a barrier-free Europe for all hard of hearing citizens Creating a barrier-free Europe for all hard of hearing citizens European Standard EN 15927:2010 Services offered by hearing aid professionals Preamble In 2010, the European Committee for Standardization

More information

Using the method of adjustment to enhance collision warning perception

Using the method of adjustment to enhance collision warning perception Using the method of adjustment to enhance collision warning perception Jesse L. Eisert, Bridget A. Lewis, & Carryl L. Baldwin Psychology, George Mason Univeristy, Fairfax, Virginia, USA The psychophysical

More information

MODALITY, PERCEPTUAL ENCODING SPEED, AND TIME-COURSE OF PHONETIC INFORMATION

MODALITY, PERCEPTUAL ENCODING SPEED, AND TIME-COURSE OF PHONETIC INFORMATION ISCA Archive MODALITY, PERCEPTUAL ENCODING SPEED, AND TIME-COURSE OF PHONETIC INFORMATION Philip Franz Seitz and Ken W. Grant Army Audiology and Speech Center Walter Reed Army Medical Center Washington,

More information

The Meaning of the Mask Matters

The Meaning of the Mask Matters PSYCHOLOGICAL SCIENCE Research Report The Meaning of the Mask Matters Evidence of Conceptual Interference in the Attentional Blink Paul E. Dux and Veronika Coltheart Macquarie Centre for Cognitive Science,

More information

Critical Review: What are the objective and subjective outcomes of fitting a conventional hearing aid to children with unilateral hearing impairment?

Critical Review: What are the objective and subjective outcomes of fitting a conventional hearing aid to children with unilateral hearing impairment? Critical Review: What are the objective and subjective outcomes of fitting a conventional hearing aid to children with unilateral hearing impairment? Cowley, Angela M.Cl.Sc (AUD) Candidate University of

More information

Cued Speech and Cochlear Implants: Powerful Partners. Jane Smith Communication Specialist Montgomery County Public Schools

Cued Speech and Cochlear Implants: Powerful Partners. Jane Smith Communication Specialist Montgomery County Public Schools Cued Speech and Cochlear Implants: Powerful Partners Jane Smith Communication Specialist Montgomery County Public Schools Jane_B_Smith@mcpsmd.org Agenda: Welcome and remarks Cochlear implants how they

More information

How should you study for Friday's exam?

How should you study for Friday's exam? How should you study for Friday's exam? re-read textbook? re-read lecture slides? study guide? NO! these are passive. Use active study. Test yourself by Take the practice quizzes in Moodle Create your

More information

This American Life Transcript. Prologue. Broadcast June 25, Episode #411: First Contact. So, Scott, you were born without hearing, right?

This American Life Transcript. Prologue. Broadcast June 25, Episode #411: First Contact. So, Scott, you were born without hearing, right? Scott Krepel Interview from TAL #411 1 This American Life Transcript Prologue Broadcast June 25, 2010 Episode #411: First Contact Is that Marc? Yes, that s Marc speaking for Scott. So, Scott, you were

More information