Localization in the Presence of a Distracter and Reverberation in the Frontal Horizontal Plane: II. Model Algorithms

Size: px

Start display at page:

Download "Localization in the Presence of a Distracter and Reverberation in the Frontal Horizontal Plane: II. Model Algorithms"

Noah Perkins
5 years ago
Views:

1 Localization in the Presence of a Distracter and Reverberation in the Frontal Horizontal Plane: II. Model Algorithms Jonas Braasch Institut für Kommunikationsakustik, Ruhr-Universität Bochum, Germany Summary Previously, it was shown that humans are able to localize a broadband noise burst quite well, even if it is presented in the presence of a second distracting broadband noise burst at -db target-to-distracter ratio. In this investigation, different model algorithms are tested for their ability to predict the psychoacoustic results. First, a long-time and a running cross-correlation algorithm are introduced to simulate the psychoacoustic experiments. Both algorithms, which are used to analyze the time window in which target and distracter are presented, are not able to estimate the target position as accurately as human listeners. Therefore, a listening test was conducted to find out, if the listeners used the part of the distracter that preceded the target in the previous tests as a reference. The outcome indicates that the onsets of test sound and distracter must be separated temporally; otherwise, human localization performance deteriorates rapidly. These results are incorporated into the interaural cross-correlation difference (ICCD) model, the third model approach that is introduced. For each frequency band, the interaural time differences of the test sound are calculated from the difference in the interaural cross correlation (ICC) of the total sound (target signal+distracter) and the ICC of the distracter alone. The latter is calculated from the part of the distracter that precedes the test sound. The model is able to demonstrate a number of psychoacoustic effects, including shifts in the auditory events for low target-to-distracter ratios. PACS no Ba, Dc, Pn, Qp 1. Introduction This paper is a companion to another paper by the same authors [1]. In that paper, which is referred to as Paper I, results of a set of experiments are presented. The aim of this paper, which was motivated by many of those results, is to find a model algorithm that is able to simulate the psychoacoustic results. The model should include stages of the auditory periphery (outer ears, basilar membrane, hair cells), but also remain as simple as possible. The idea was to come to a better understanding of human sound localization by developing such a model and, further, to provide a new algorithm for computational auditory scene analysis (CASA). Wightman and Kistler showed in a localization test that interaural time difference (ITD) cues dominate interaural level difference (ILD) cues for natural ITD/ILD combinations if low frequencies are present [2]. For that reason, only model algorithms based on ITD cues are considered here. In 1948, Jeffress was the first to introduce this kind of model, proposing a coincidence mechanism [3]. Sayers and Cherry [4] later compared the left and the right ear signals calculating the interaural cross-correlation. Received 15 May 21, revised 16 May 22, accepted 1 July 22. Blauert and Cobben combined the cross-correlation algorithm with a peripheral stage containing a bandpass filter bank, a half-wave rectifier and a lowpass filter [5]. In the same year, Stern and Colburn proposed a lateralization model evaluating both ITDs and ILDs based on auditory nerve data [6]. Later, Lindemann introduced his inhibitory cross-correlation algorithm that processes both ILDs and ITDs and is able to simulate the precedence effect [7]. This model was extended by Gaik to process natural combinations of ITDs and ILDs, as they are found in head-related transfer functions (HRTFs) [8]. Stern et al. introduced their weighted-image model to emphasize cross-correlation peaks in different frequency bands that line up in one lag (straightness) and to enhance those that are found at small ITD magnitudes (centrality) [9]. These modifications allow a better prediction of subjective lateral position of bandpass stimuli. Recent overviews on localization models were written by Blauert [1] and Colburn [11]. Most of the research described above is not considered in this investigation. Besides simulating the auditory periphery, only an algorithm that takes into account the natural combinations of ITDs in the single frequency bands and an algorithm for estimating the target position, combining the output of the single frequency bands, are implemented. 956 c S. Hirzel Verlag EAA

2 t 1 t 2 t 3 t 4 Distracter Target time [ms] Figure 1. Time course of target and distracter of the stimuli that are used to evaluate the model algorithms. Before introducing the model algorithms, the test material to evaluate the model will be described in the next section. 2. Stimuli Principally, the same test signals are used to evaluate the model as in the psychoacoustic experiments in Paper I. The target and distracter are two broadband noise bursts (2 14Hz, 2-ms cos 2 -ramps, 2-ms and 5-ms duration). The delay between the onset of the distracter and that of the target is 2 ms. The time course of the stimuli is shown in Figure 1. For the anechoic environment, a catalog of human HRTFs is used. Details of the measurement procedure can be found in Paper I. The reverberant environment was generated in Matlab, using the mirror-image technique [12]. The simulation is very similar to the one used in Paper I, but in this investigation the late reflections are also calculated using the mirror-image technique. In Paper I, a reverb processor was used for this task because the Tucker-Davis System, on which the experiments were run, allowed the use of only a limited number of mirror-image sources. The advantage of calculating the complete impulse response using the mirror-image technique is that it is easier for the reader to reproduce it from the description. Qualitatively, both room impulse responses are very similar and the room size (6 m 5m 3 m) is consistent with Paper I. The room impulse response used here consisted of the first 1 mirror sources. For the auralization, the HRTF catalog is interpolated to a resolution of 1 azimuth and 1 elevation, using spherical splines in the frequency domain [13], [14]. Each mirror source was then filtered with the HRTFs at the closest available angle. Both the target and distracter are presented at a distance of 2 m from the virtual listener at the same height as his ears (1: m). The frequency dependent absorption coefficients of the walls and the floor are taken from measurements described in the Deutsche Industrie Norm (DIN, German Industrial Norm) [15]. Cocos-fiber/felt (a3.1-1) is chosen for the walls and ceiling. Data for a rug (a.6-1) are used for the floor. In the following, the target-to-distracter ratio (T/Dratio) is considered to be the power ratio of the target and the distracter before they are filtered with the HRTFs. Therefore, only when the target arrives from the same direction as the distracter is the T/D-ratio at the eardrums identical to the T/D-ratio of the source signals. For the remaining target positions, the T/D-ratio at the eardrums will be different, according to the magnitude transfer functions of the individual HRTFs. 3. First approach: A long-time crosscorrelation algorithm 3.1. Model structure Periphery In the first approach for simulating localization of a partly masked target, a model based on a simple interaural crosscorrelation algorithm and stages to simulate the auditory periphery is implemented, as shown in Figure 2. The model is similar to the one proposed by Blauert and Cobben [5]. The transformations from the sound sources to the eardrums (influence of the outer ear and occasionally room reflections) are taken into account by filtering the sounds with HRTFs from a specific direction. Afterwards, the outputs for all sound sources are added together for the left and the right channel. Basilar-membrane and hair-cell behavior are simulated with a gammatone filter bank with 36 bands at a sampling frequency of 48 khz, as described by Patterson et al. [16], and a simple halfwave rectification. Only the frequency bands 3 to 12 (2 12 Hz) are analyzed to take into account that the human auditory system cannot resolve the temporal fine structure at high frequencies, as well as the fact that the time differences in the fine structure of the lower frequencies are dominant if they are available [2] Cross correlation After the half-wave rectification, the interaural cross correlation is estimated within each frequency band over the whole target duration: Y l r (f ) = 1 t3 ; t2 t 3X t=t 2 Y l (f t)y r (f t + ) (1) with t2=2 ms and t3=4 ms (see Figure 1) Remapping and decision device For broadband signals, it is useful to remap the cross correlation functions from interaural time differences to azimuth positions. Otherwise, the peaks of the crosscorrelation functions will not necessarily line up at one lag for a single sound source because the ITDs of the HRTFs are frequency dependent. To calculate the ITDs of the HRTFs throughout the horizontal plane, the HRTF catalog, measured in a resolution of 15 in the horizontal plane, is interpolated to 1 resolution using the spherical spline method. After filtering the HRTFs with the gammatone filter bank, the ITDs for each frequency band and angle are estimated using an interaural cross-correlation (ICC) algorithm. This frequency-dependent relationship between ITDs and azimuthal angles is used to remap the output of the cross-correlation stage (ICC curves) from a basis of ITDs ( f i ) to a basis of azimuth angles in every 957

3 Decision d evice x RM to: HR TF r HRTF l HRTF l x RM HRTF r HRTF r from: sound sources x RM Sound sources Outer ear Bandpass filter bank Halfwave rectification Cross-correlation Remapping Halfwave rectification Bandpass filter bank Outer ear Figure 2. General model structure for all three evaluated localization algorithms. frequency band: ( f i ) = g(hrtf l HRTF r f i ) = g( f i ) (2) with = azimuth, = elevation =, r = distance = 2m, HRTF l=r =HRTF l=r ( r), f i = center frequency of bandpass filter. Next, the ICC curves ( ( f i )) are remapped to a basis of azimuth angles using a simple for-loop in Matlab: for alpha=1:1:3 psi rm(alpha,freq)=psi(g(alpha,freq),freq); end In the decision device, the average of the remapped ICC functions: psi rm(alpha,freq) over the frequency bands 3 12 is calculated and normalized to one. The model estimates the sound sources at the positions of the local peaks of the averaged ICC function Results Figure 3a shows the result of the simulation for a target at. The upper panel shows the ICC for every frequency band as a function of the azimuth angle. The ICC function is normalized to one in every frequency band. Afterwards, the values are interpolated linearly. The greyscale/colorcoding of the values is shown in the greyscale/color map for all three plots. In the lower panel, the ICC function averaged over all the frequency bands (3 12) is given. Note that there is not only a peak at, but also at approximately 15 ; without evaluating ILDs and monaural cues, the model is not able to discriminate between presentations in the frontal or rear hemisphere. Figure 3b shows the results for the distracter at azimuth. The two peaks of the averaged ICC function are at and 18 as expected. Figure 3c shows the results for the target at azimuth and the distracter at azimuth presented simultaneously at a T/D-ratio of db. Ideally, one would have expected four peaks in the averaged ICC function: at and, as well as at the equivalent positions in the rear hemisphere. But in reality, only two peaks appear at about 15, approximately midway between the target and distracter position and at approximately 165, the equivalent position in the rear hemisphere. However, the listeners in the previous psychoacoustic experiment were able to localize the target quite well at T/D-ratios of db. This first approach, the model based on a long-time cross-correlation analysis is not able to resolve the positions of the target and distracter. Instead of two separate cross-correlation peaks for the target and distracter, only one cross-correlation peak occurs. It appears between the peaks measured for the target and distracter position, when the signals are presented separately from each other. The next approach will be a running cross-correlation algorithm. As they comprise broadband noise bursts, the target and distracter fluctuate in the single frequency bands, possibly allowing both to be separated. 4. Second approach: A running crosscorrelation algorithm 4.1. Model structure The proposed model is basically identical to the model described in the previous section. The only difference is that the cross-correlation function is no longer averaged over the whole target duration, but implemented as a running cross-correlation. The step size is set to 1 ms and the window length is set to 1 ms (rectangular window). The latter time is in the order of the binaural sluggishness [17] Results The results of this second approach are shown in Figure 4 for the fifth frequency band (center frequency: 375 Hz). 958

4 Figure 3. Results for the first modeling approach based on an ICC algorithm for a target at (top-left graph), the distracter at (top-right graph) and both presented simultaneously at -db T/D-ratio (bottom-left graph). The upper panel of each graph shows the ICC functions for each frequency band. The lower panel displays the average over all the frequency bands (for details see text). The target is located at azimuth, while the distracter is kept at azimuth. The abscissa shows the time course of the distracter and target, as shown in Figure 1. The ordinate displays the azimuth. The maximum of the ICC curve of every time step t i : Y l r (t i max ) is scaled as follows: ^Yl r (t i max )=7+2log1 Y l r (t i max ) Y l r (t max max ) : (3) The normalized values are then calculated in the following way: ^Yl r (t i j )= Y l r (t i j ) ^Yl r (t i max ) Y l r (t i max ) : (4) This implies that the overall maximum ^Y l r (t max max ) is normalized to 7 db. The maximum of the ICC function of every time step is now 7 db minus the level difference between its maximum and the overall maximum. If the ICC function in every time step were normalized to one, the information about the local peak magnitude in every step would be lost. In addition, if it were scaled to a logarithmic scale at all, the ICC functions with a relatively low amplitude would not be visible for several step times. If all values in t i and j are strictly transformed into a logarithmic scale, the ICC functions would become too broad to judge a sound source position by eye. For a further improvement in contrast, all the negative values of ^Y l r are set to zero. Also in this approach, the cross-correlation functions are remapped from interaural time differences to azimuth positions. The top-left panel of Figure 4 shows the target-alone condition. The maximum of the ICC function is located at azimuth, but the magnitude of the peak fluctuates heavily. If a distracter is added at a T/D-ratio of db (topright panel), not only the magnitude of the ICC function fluctuates during the presentation of the target, but also its 959

5 Figure 4. Results for the second modeling approach based on a running ICC algorithm. Target ( azimuth) and distracter ( azimuth) are presented according to the time course in Figure 1 for three different T/D-ratios: db (top-left graph), db (top-right graph), ;1 db (bottom-left graph). The shown data are computed for the fifth frequency band (375-Hz center frequency). For details, especially the scaling, see text. position. Determining the two sound-source positions is very difficult, if not impossible, because the peak maxima are not only found at target and distracter positions, but also fluctuate arbitrarily between and even beyond them. In contrast, the listeners localization performance is quite accurate at a T/D-ratio of db (Figure 1, in Paper I). Even at a T/D-ratio of ;1 db they are still able to localize the target in most cases. Moreover, if the T/D-ratio of the running cross-correlation model is reduced to ;1 db, the chances of estimating the target position decrease dramatically (bottomleft panel). Only at approximately 3 ms does the ICC peak exceed, but with a magnitude that is at least db below the overall maximum. So far, the model has focused on times when target and distracter are presented together, but useful information on the distracter can be extracted in the time interval when it precedes the target ( 2 ms). To test if this assumption is plausible for the human auditory system, a further listening test was conducted in which the distracter and target had the same length and simultaneous onsets. 5. Psychoacoustic experiment: Importance of the onset delay between the target and distracter 5.1. Methods Listeners Four male listeners, who were not paid for their participation, took part in the experiment. Three of them had already participated in the experiments described in Paper I, and the fourth listener was an experienced listener, too. The listeners identification numbers were kept from Paper I, and the fourth listener was given a new identification number L12. Their ages ranged from 26 to years. 9

6 All the listeners had normal hearing (hearing loss 2 db at octave frequencies between 125 and 8 Hz) Apparatus and stimuli The experiment is very similar to the anechoic condition of Experiment I in Paper I. The sounds were also filtered with individual HRTFs and presented to the listeners in a virtual environment using headphones. Again, both the target and distracter were broadband noise bursts with a frequency range between 2 Hz and 14 khz. The main difference is in the time course of the distracter and target sound. In this case, no time delay between the distracter and target onset was provided (2 ms in Experiment I). Further, both signals had a duration of 2 ms, thus making their time courses identical. The target and distracter were filtered off-line with the individual HRTFs of the listeners. The direction for the distracter was set at azimuth and elevation throughout the experiments, the position of the target varied between 13 equidistant positions between ; and azimuth ( elevation). Each trial was repeated 1 times. The sound pressure level of the distracter was set at approximately 7-dB SPL. The T/D-ratio, the power ratio of the target and the distracter before they are filtered with the HRTFs, was set at db. The sounds were generated using a PC soundboard (Tripledat, Masterport) and were delivered to the listener through headphones (STAX, SR- Lambda) Procedure For each listener, the stimuli were presented in a pseudorandom order in one session lasting approximately 14 minutes. The session started with a training phase of 1 training trials. During the experiment, the listeners were seated on a chair and were asked to keep their heads still during the presentation of the stimuli. After the presentation of each trial, the listener had to report the azimuth of the auditory event of the stimulus using a graphical interface on the computer screen. Further, they were asked to report if they had perceived more than one distinct auditory event after each trial presentation. In cases where they reported more than one auditory event, the listeners were asked to report the azimuth of the one that is perceived more lateralized, and if both are equally lateralized to report the azimuth of either one of them. In the case of a single reported auditory event, they were asked to indicate its centroid. After a response had been made, the next stimulus was presented with a delay of 2 seconds. No trial-by-trial feedback was provided to the listeners either during the training phase or the recording phase Results Figure 5 shows the results for two of the four listeners. In each panel, the perceived L/R direction of the target is plotted against the presented L/R direction. Judgement angles and target angles are grouped into 15 -wide bins. The area of each circle is proportional to the number of perceived azimuth[ ] L7 L1 no onset delay onset delay: 2 ms presented azimuth[ ] Figure 5. Localization performance for two listeners L1 and L7 at -db T/D-ratio. The left panels show the case where the time courses of the target and distracter ( azimuth) are identical (both 2-ms duration). The target (2-ms duration) was presented with a 2-ms onset delay to the distracter (5-ms duration) for the data shown in the right panels. The data in the right panels are taken from Braasch and Hartung [1]. responses that are measured within the bin for the presented direction that coincides with the center of the bin. In the majority of the cases, the listeners perceived only one auditory event: L1: 98%, L6: 55%, L7: 67%, L12: 1% (percentages of incidents with one auditory event). All the listeners reported that in those cases where they perceived two distinct auditory events, they tended to perceive two unnatural auditory events instead of perceiving one sound clearly at the front and the second one at another direction. By using the term unnatural it is not only meant that the two perceived auditory events are at different positions from the presented position, but also that the positions of the events are in general very diffuse and not perceived externalized. This was reported by the listeners after they had participated in the experiment. The Spearman rank correlation coefficient r S is calculated to find out if there is a correlation between magnitude of azimuth (sideness) and percentage of cases in which two auditory events were perceived. The correlation coefficient is only estimated for L6 and L7 because L1 and L12 rarely perceived two auditory events. The value at azimuth is not considered in the calculation because target and distracter are presented from the same direction in this case and the natural binaural cues remain. A significant correlation exists in the case of L6 (r S =;:6364, =.5), but not for L7 (r S =;:1276). The top-left panel (Figure 5) shows the results for listener L7. The listener perceived the stimulus as being less lateralized than presented. It is perceived somewhere between the position of the distracter and the position of the target. The top-right panel shows the same situation, when 961

7 the distracter is partly preceding the target (2-ms onset difference, 5-ms distracter duration). The data are taken from Experiment I in Paper I. In this case, the target positions are perceived more lateralized than presented. For L1, however, the responses are different (bottom-left panel). In most of the synchronous cases, he perceives the sound coming from ; or, although he was able to indicate the target position more accurately when there was an onset delay between the target and distracter. The response patterns of L12 are similar to those of L7, while L6 showed behavior similar to L Discussion Although the response patterns of the listeners can be divided into two groups, none of the listeners is able to perceive two distinct auditory events for the target and distracter when the time courses of target and distracter are identical. It appears that the preceding part of the distracter is necessary to gain a clear idea of the auditory scene. In general, it is probably not necessary that the distracter precedes the target, but rather that both sounds do not fully overlap in frequency and do not have an identical time course. L7 and L12 perceive the auditory event as less lateralized than in the target-alone condition or in the condition with a partly preceding distracter. The auditory events are approximately located between the auditory event of the single target and the auditory event of the single distracter. L1 and L6 seem to be rather confused by the simultaneous presentation of the target and distracter. Their auditory events can hardly be related to either the target or distracter position, except that the side left or right of the auditory event corresponds to the side on which the target was presented. 6. Third approach: The cross-correlation difference function 6.1. Preliminary remarks The psychoacoustic experiment in the previous section indicates that there are general differences if the distracter partly precedes the target or if both signals are presented simultaneously. In the following, it will be assumed that humans use information about the distracter that is extracted while the distracter preceded the target. This assumption is based on the following considerations: 1. Two noise bursts are usually perceived as one auditory event if their envelope is identical and they overlap in spectrum. This can be observed even when the noise bursts have different spatial positions and are uncorrelated. 2. Our models that do not employ the temporal differences of onsets fail at low T/D-ratios. 3. From other investigations, we know that the auditory event of the target shifts sideward with the exposure time of the preceding part of the distracter. The first point is principally the outcome of the psychoacoustic test of the previous section. It is congruent with the auditory streaming theory that points out the importance of separated onsets for perceiving multiple auditory streams [18]. Furthermore, the effect is very similar to the summing localization effect, which occurs when (in this case) two correlated signals with different positions in the frontal horizontal plane are presented with a small delay ( < 1 ms) [1]. The listeners usually perceive only one auditory event sited in-between both presented sound sources. With regard to the second point, it seems evident that humans use the information on the distracter that is gained when it preceded the target. That is why the algorithms that have been previously described do not achieve human localization performance when it comes to localizing two broadband noises. This is due to the fact that they try to achieve something that humans do not achieve themselves: localizing the target just by analyzing the part where target and distracter are presented together and ignoring the information on the preceding part of the distracter. Alternatively to the hypothesis that it is necessary to have an interval during which the distracter is presented alone, one might propose that it is sufficient to have the knowledge about the distracter position. So far, an experiment that reveals which hypothesis is true has not been carried out. The third point will be discussed in more detail in another section. Now, it should just be mentioned that the exposure time of the preceding part of the distracter has a major influence on the perceived target position ([19], [2], [21], [22]). This evidence leads to the assumption that it is not only important to perceive two different onsets to be able to separate the target and distracter into two auditory streams, but it also leads to the conclusion that the exposure time of the distracter before the presentation of the target has a major impact on the perceived target position. In the following, an algorithm is presented which uses knowledge about the preceding part of the distracter to estimate the target position Model structure The ICC function of the total signal is the sum of the ICC function of the distracter and the ICC function of the target, if the distracter and target are assumed to be uncorrelated signals because the cross terms are zero: Y l r (f ) = T l r (f )+ D l r (f ) (5) where T indicates the target, D the distracter, and Y the total signal. The assumption that the distracter and target are uncorrelated is not only true for the psychoacoustic experiments described in Paper I, but it is also the normal case for scenarios observed in nature (if surface reflections are neglected). Now, the ICC function of the target T l r (f ) can be easily estimated from the ICC function of the total 962

8 signal Y l r (f ), if the ICC function of the distracter D l r (f ) is known: T l r (f ) = Y l r (f ) ; D l r (f ): (6) Although the exact ICC function of the distracter is not available because it is partly masked by the target, it can be substituted by the ICC function of the distracter of the part that precedes the target: D l r = D l r = 1 t3 ; t2 1 t2 ; t1 t 3X t=t 2 D l (t)d r (t + ) (7) t 2X t=t 1 D l (t)d r (t + ) (8) with t1= ms, t2=2 ms, and t3=4 ms according to the time course in Figure 1. The ICC function of the target is now estimated as follows: T l r = Y l r ; D l r = 1 t3 ; t2 t 3X ; 1 t2 ; t1 t=t 2 Y l (t)y r (t + ) ; t 2X (9) t=t 1 D l (t)d r (t + ): (1) Afterwards, T is remapped in the manner described in l r section 3.1 in every frequency band, and the maximum peak of the averaged ICC function is the estimated target position of the model Results Figure 6a shows the result of the simulation of the ICCD model for a target at azimuth and a distracter at azimuth. The method of presenting the data is identical to Figure 3. The reason for the changing positions of the crosscorrelation peaks in the single bands is the statistical variations of the distracter amplitude. The amplitude of the ICC function also varies because of this variation, and the amount subtracted according to equation (9) is usually either slightly too small or too large. The variations of the cross-correlation peaks found in the single bands almost vanish after averaging over the frequency bands At ;5-dB T/D-ratio (Figure 6b), the fluctuations in the single frequency bands increase. On average over all the frequency bands, however, the ICC function is still very similar to the -db condition. When the T/D-ratio is reduced to ;1 db, the fluctuations in the single frequency bands increase further and some bands no longer show any response at all because their values are all negative after subtracting D l r. At;15-dB T/D-ratio, the peaks of the ICC functions can no longer be found at the target position. The model estimates the source position at approximately. To test the reliability of the model, the interaural crosscorrelation difference (ICCD) function is estimated for ten repetitions with independent samples of noise. The results are shown in Figure 7 (top panel). The target and distracter are generated ten times according to the parameters described in section 2. The ICCD-peak of each repetition is normalized to 1. Variances only occur because of the statistical fluctuations of the distracter and target. As shown in the middle panel, the variations of the ICC peak cannot be found in the ICC function of the single target (solid line) and distracter (dashed-dotted line). Even though the signals are random noise bursts, the relationship between left and right signal is essentially determined by the HRTF of the left and right channel (no interference between two signals). The noise bursts only determine the amplitude of the cross-correlation function, but this influence disappears after the normalization. However, variations have already appeared for the total signal because of the interference between the distracter and target (Figure 7, bottom panel). The increase in variation for the ICCD function in Figure 7a is due to an additional uncertainty caused by the substitution of the ICC function of the distracter (equation 9) Influence of room reflections Figure 8 shows the ICC function (here again as the function of the interaural time difference) of the target in two different reverberant environments and the anechoic reference condition for the fourth (top graph) and the eighth frequency band (bottom graph). One of the reverberant environments is the simulated room that is used throughout this investigation (dotted line, label: mirror-image ). The other reverberant environment (dash-dotted line, label: real room ) is a real seminar room (11 m 5.5 m 3.5 m) for which the impulse response is measured using a custom-built dummy head and an omnidirectional loudspeaker (custom-built for architectural acoustic measurements). In all the cases, the single target (2-ms broadband noise burst) is presented from azimuth. In the eighth frequency band, the maximum of the cross correlation is located at.81 ms (corresponding to azimuth) for the anechoic condition (solid line). However, the peak shifts toward the middle (median plane) when room reflections are added. This can easily be explained by the fact that the interaural delay of the diffuse reflections is arbitrary and is therefore, for reasons of symmetry, on average approximately zero. The size of the shift depends on the ratio between the direct sound and the room reflections. For the reverberant environment, which was used in this investigation, the peak of the ICC function moves to.71 ms, which corresponds to 7 azimuth position (dotted line). For an impulse response measured in the real seminar room (dashed-dotted line), the ICC peak even moves to.53 ms (47 azimuth) in the eighth frequency band. The shift of the cross-correlation peak depends on the frequency. In the fourth frequency band, for example, the azimuth positions of the two reverberant environments are almost reversed 963

Figure 6. Results for the third modeling approach based on the ICCD algorithm. The target and distracter and the form of data presentation are identical to those in Figure 3 ( and azimuth).

9 Figure 6. Results for the third modeling approach based on the ICCD algorithm. The target and distracter and the form of data presentation are identical to those in Figure 3 ( and azimuth). The ICCD function is shown for four different T/D-ratios: db (top-left graph), ;5 db (top-right graph), ;1 db (bottom-left graph), ;15 db (bottom-right graph). The data are presented as in Figure 3. to their position in the eighth frequency band. The full understanding of the frequency dependance of the crosscorrelation peak shift would require the further analysis of room impulse responses. This, however, will not be further investigated here. The motivation to introduce the impulse response of a real room at this point is to show that the reverberance of our simulated room is comparable to the reverberance of a real seminar room. The reason why we do not observe lateralization shifts in that order in everyday life is most likely due to our ability to resolve them through head movements. However, small lateralization shifts could be observed in our virtual reverberant environment where head movements are not simulated (Figure 4, top-right panel, Paper I) Adaptation processes Braasch and Hartung (Paper I) observed lateralization shifts when a distracter was added to the target (e.g., see section 2.4 in Paper I, -db T/D-ratio). As discussed there, lateralization shifts, in this case for narrow-band signals, are also observed by others. Recall that Canévet and Meunier [21] and Meunier et al. [23] were able to show that direction and magnitude of those perceived lateralization shifts of the target strongly depend on the onset time difference between the distracter and target sound. The lateralization shift occurs toward the direction of the distracter for small onset time differences. Its magnitude decreases with increasing onset delay. As shown for a distracter position at azimuth, the lateralization shift disappears at a certain onset time difference, usually at approximately 1 ms ([21], [23]). Then, with a further increase in the delay time, the lateralization shift occurs in the opposite direction with increasing magnitude for increasing delay times. In Paper I, the delay time was set to 2 ms. At this point, lateralization shifts in the opposite direction of the distracter are usually observed. 964

10 1..5. amplitude [MU] 1..9 g= 23 g=.5 26 g=1 33 g= g=2 54 g=3 79 amplitude [MU] azimuth [ ] Figure 7. ICC functions averaged over the frequency bands Every condition is simulated 1 times. The target positions are at and the distracter positions are at. The top panel shows the ICC function for target-distracter pairs at -db T/D-ratio. The panel in the middle shows the ICC functions for the single targets (solid line) and distracters (dashed-dotted line). The bottom panel shows the ICCD functions of the target-distracter pairs. amplitude [MU] amplitude [MU] th frequency band mirror image 47 real room 68 anechoic condition interaural time difference [ms] th frequency band real room 47 mirror image interaural time difference [ms] 7 anechoic condition Figure 8. Shifts in the peaks of the ICC functions for a sound source presented in three different environments: anechoic (solid line), reverberant using the mirror-image technique (dotted line), measured binaural room impulse response (dashed-dotted line) at an azimuth of. The top graph shows the fourth frequency band (center frequency 293 Hz) and the bottom graph the eighth frequency band (center frequency 621 Hz). The constancy of these effects is observed for three different distracter positions (,, ) throughout the whole frontal horizontal plane in Paper I. This suggests the idea of introducing a simple time-dependent subtraction factor g(t) into the cross-correlation difference model as follows: T = Y ; g(t) D (11) with t being the delay time between the distracter and target onset interaural time difference [ms] Figure 9. Lateralization shifts in the seventh frequency band (center frequency 527 Hz) for different values of the subtraction factor g(t) (dotted lines). The ICC functions for the single target ( azimuth) and distracter ( azimuth) are shown by the solid and dashed-dotted lines. Figure 9 shows how the positions of the cross-correlation peak of the target ( ), and therefore its estimated positions, shift with a varying subtraction factor g(t) (distracter, -db T/D-ratio) for the seventh frequency band (center frequency 527 Hz). The position of the correlation peak for g(t)=1 is nearly identical to the position for the target-alone condition. When g(t) is decreased, the correlation peak moves toward the distracter, and if it is greater than 1, it moves in the opposite direction. The aim is not to estimate g(t) for numerous delay times, but rather to investigate if such an approach makes sense in general. The difficulty of estimating the whole time course of g(t) is that it is not only a function of the delay time, but that it also depends on the duration and the sound characteristics of the target and distracter. However, some general conclusions can be made. At t=, we expect g(t) to be zero as there is no preceding distracter part to estimate D. If the target and distracter have enough similarity, the summing localization effect can be observed [1]. The results of the listening test presented in section 5 do not disagree with that prediction, although one should keep in mind that two of the listeners seemed to be confused and showed a different localization behavior from the proposed model. At a certain point t, where no lateralization shifts are observed, the function g(t) will be one. It seems that g(t) increases monotically above a threshold, with g(t) < 1 for t < t and g(t) > 1 for t > t. Below that threshold, the estimation of g(t) is certainly more complex because of mechanisms related to the precedence effect that are observed for small delay times. In a physiological sense, we could interpret the subtraction process in the ICCD model as an inhibition process. The longer the distracter is exposed before the target onset, the greater the degree of inhibition. 7. Simulation of the localization experiments 7.1. Methods To evaluate whether the ICCD model is able to simulate localization at low T/D-ratios, the stimuli are presented to the algorithm analogous to Experiment I in Paper I. The ten different target and distracter noise bursts that were al- 965

11 ready used to create Figure 7 are presented to the model for all combinations of 13 target directions (; to azimuth in steps of 15 ) and 4 T/D-ratios from to ;15 db (in steps of 5 db). The position of the target is determined by the maximum of the ICCD function in the azimuth range between ; and. If not mentioned otherwise, the subtraction factor g(t) is set to 1.7 and the frequency bands 3 to 12 are analyzed throughout the model simulation Lateralization shifts The results for the model simulation for the target-alone condition and the distracted condition at a T/D-ratio of db are displayed in Figure 1, analogous to Figure 1 in Paper I, which showed the psychoacoustic results. Here and in the following, the data are presented in the L/R direction of the three-pole coordinate system (e.g. [24], [25]), to be congruent with the presentation of the psychoacoustic data. 1 It should be noted that, in the model simulation, the L/R direction is identical to the azimuth, because elevations other than are not considered. In each panel, the single-source condition is represented by x s and a solid curve, the condition at a T/D-ratio of db by + s and a dashed curve. The top-left panel shows the anechoic condition with a distracter at. In this case, a simple subtraction factor (g(t)=1.7) is sufficient to simulate the lateralization shifts. Several characteristics are present in both the model and the psychoacoustic results: the direction and magnitude of the shift are similar throughout the frontal horizontal plane; the curves coincide at the outer angles and the position of the distracter. Note that the largest shifts for both the model simulation and the psychoacoustic evaluation are found near, but not directly at, the position of the distracter. Even though the structure of the model is principally symmetric, asymmetries can be found between the left and the right hemispheres. These asymmetries occur because the human HRTFs, which are used in this model, are not perfectly symmetric. Lateralization shifts can also be simulated when the stimuli are presented in a reverberant condition (Figure 1, top-right panel). Further, it is apparent that the span of the response decreases to approximately. This effect has already been discussed in section 6.3.1, and it is caused by the interaural symmetry of the late reflections that move the auditory event toward the median plane. Interestingly, in the psychoacoustic test, the localization curves for the target-alone and the distracted condition coincide at ; and 45 and separate again at the outer angles. We do not provide an explanation for this, but the model simulation shows a similar coincidence effect at ;75 and 75. This indicates that the effect could be explained by the physical properties of the reverberant environment based on 1 In the L/R dimension, is in the left hemisphere and ; in the right hemisphere analogous to the azimuth of the head-related coordinate system [1]. estimated left/right[ ], anechoic, anechoic, echoic, anechoic presented left/right[ ] Figure 1. Model localization performance in the L/R dimension. Each data point shows the median averaged over ten different target-distracter presentations. The distracting sound is presented at a T/D-ratio of db at the following angles: /anechoic condition (top-left panel), /reverberant condition (top-right panel), /anechoic condition (bottom-left panel) and /anechoic condition (bottom-right panel). The + s with the dashed curves mark the distracted condition; the x s with the solid curves mark the target-alone condition. estimated left/right[ ] presented left/right[ ] Figure 11. Same as the distracter condition in Figure 1 (bottom-right panel), but only the frequency bands 3 9 (instead of 3 12) are considered in the simulation. the mirror-image technique and human HRTFs. The minor differences between the psychoacoustic data and the model simulation can be explained by differences in the HRTFs and the small differences in generating the reverberation that are used in both cases. The model also simulates psychoacoustic data that has been obtained for lateralized distracter positions (Figure 1, bottom panels). The localization curves for the asymmetric distracters are quite similar to those obtained in the psychoacoustic tests (g(t)=1.7). The simulation of the distracter position can be improved if the model is restricted to the frequency bands 3 to 9 instead of 3 to 12 (Figure 11). The reason for this is the existence of multiple cross-correlation peaks that are found in higher frequency 966

12 target alone db target alone db estimated left/right[ ] 5 db 1 db estimated left/right[ ] 5 db 1 db 15 db presented left/right[ ] 15 db presented left/right[ ] Figure 12. Model localization performance in an anechoic environment for ten target/distracter repetitions in the L/R dimension for different T/D-ratios. The distracter is placed at azimuth, the subtraction factor g(t) is set to 1.7. Figure 13. Same as Figure 12, but for g(t)=1.. bands. Near the peak position in the contralateral side, the estimated target position shifts toward the actual position of the distracter. Reducing the bandwidth is the easiest way to eliminate the effect, as no frequency dependent weighting factor has been included in the model so far Localization at low T/D-ratios Next, the model is examined for lower T/D-ratios. Figure 12 shows the results for the anechoic environment. Figure 13 shows the same simulation, but this time the subtraction factor g(t) is adjusted to 1.. The reason for varying the subtraction factor is to show how the characteristic patterns of clustering and scattering, as they are described in Paper I, change with the variation of g(t). Asit will be shown later, the individual characteristics of clustering and scattering may be explained by assuming individual differences in the adaptation process, which is simulated in the model using the subtraction factor g(t).inthe target-alone condition, the estimated target positions coincide with the presented target directions with one exception. Targets that are presented from ;75 are localized at ;. However, in Paper I the listeners responses for the ;75 angle are usually very similar to those of the ; angle, too. The reason for this estimation error is that the ITDs of the HRTFs become nearly independent of the azimuth for the outer angles. estimated left/right[ ] target alone 5 db db 1 db presented left/right[ ] Figure 14. Model localization performance in a reverberant environment for ten target/distracter repetitions in the L/R dimension for different T/D-ratios. The distracter is placed at azimuth, the subtraction factor g(t) is set to 1.7. In the following, the data for the condition with g(t)=1. are discussed (Figure 13). At a T/D-ratio of db, the target directions can still be estimated quite accurately. Only at the outer angles, the estimations are slightly scattered, due to the interference and distracter amplitude fluctua- 967

13 estimated left/right[ ] target alone -5 db db -1 db presented left/right[ ] Figure 15. Same as Figure 14, but for g(t)=1.. tions described in section 6.3. These variations increase and spread toward the inner angles when the T/D-ratio is decreased to ;5 db, and the effects become more apparent when the T/D-ratio is decreased to ;1 db. The localization performance of the model deteriorates dramatically when the T/D-ratio decreases further. At a T/D-ratio of ;15 db, the distribution functions of the estimations can be grouped in three groups: left (45 to ), frontal (; to ) and right (;45 to ; ). Only in the frontal group the target positions vary slightly with the presented target position. The estimated angles move toward the side, if the subtraction factor is set to 1.7 (Figure 12). At a T/D-ratio of ;1 db, the estimated responses group to two clusterings, one for the left and the other one for the right angles. In contrast to the psychoacoustic data, a cluster at for the target positions near the median plane does not show up, but it was demonstrated that the listeners are no longer able to even detect the target within the distracter in those cases. One must keep in mind that the model does not yet have an implemented detection stage. The results for the model simulation in the reverberant environment are presented in Figure 14 (g(t)=1.7) and Figure 15 (g(t)=1.). In the target-alone condition, the target lateralization is systematically underestimated. This effect, which is caused by the room reflections, is also found in the psychoacoustic experiments. When the distracter is added at -db T/D-ratio, there is not much change except for a greater variance when g(t)=1.7. When the subtraction factor g(t) is set to 1. (Figure 15), there is less variance, but increased saturation at smaller angles in responses. With a further decrease in the T/D-ratio to ;5 db, however, localization performance becomes worse. For g(t)=1. (Figure 15), the distribution functions of the target estimation can be divided into three different groups: left ( to ), frontal (;15 to 15 ) and right (; to ; ). Within each group the distribution functions remain similar. At a T/D-ratio of ;1 db, the estimation of the target is almost near for all cases, independent of the direction of incidence. Also in the reverberant environment (Figure 14), the estimated target angles shift to the side, if g(t) is moved from 1. to 1.7. At a T/D-ratio of ;1 db, the responses are widely scattered for all presented target angles. 8. Discussion Several effects can be simulated with the ICCD model, even though it does not take ILD cues into account, like the model of Lindemann [7] or complex weighting functions like those of Stern et al. [9]. It enables the target to be adequately estimated at a T/D-ratio of db. At very low T/D-ratios (< ;1 db), the performance of the model is similar to human localization performance and shows the effects of clustering and scattering, as found in Paper I. We could observe a difference of approximately ;5 db between similar human localization performances in the anechoic environment and localization in the reverberant environment. This effect could be reproduced in the model simulation. Introducing a subtraction factor g(t) enables the model to simulate the dependence of lateralization shifts on the onset delay t. The results are similar to those of Kashino and Nishida [22], who simulated the localization aftereffect by introducing a gain control to the cross-correlation model. Besides showing lateralization effects, g(t) also demonstrates the robustness of the ICCD model. Even if the cross-correlation function of the distracter D is subtracted at -db T/D-ratio using a factor of 1.7, the esti- mated position of the target only changes by a few degrees. Several suggestions have been made as to how humans localize in reverberant environments. Here, the precedence effect and the law of the first wavefront are often used as explanations [26]. The position of a sound source in a room is localized based on the direction of its first wave front while the directions of the reflecting sources are suppressed by the precedence effect. However, the majority of investigations that support this theory are based on clicks or sinusoidal sounds [27]. In both cases onsets become very important. In the case of the sinusoidal sound, a strong onset will lead to a broad frequency spectrum in the onset area. For the noise bursts that are used in our investigation, the precedence effect does not seem to have much influence on localizing the target. Otherwise, the influence of the reflection would not shift the listeners auditory events toward the center. A stationary model that simply averages over the direct sound and the reflections produces very similar results to those of the listeners. Acknowledgement This work was supported by the Deutsche Forschungsgemeinschaft within the graduate program KOGNET. I 968

Localization in the Presence of a Distracter and Reverberation in the Frontal Horizontal Plane. I. Psychoacoustical Data

942 955 Localization in the Presence of a Distracter and Reverberation in the Frontal Horizontal Plane. I. Psychoacoustical Data Jonas Braasch, Klaus Hartung Institut für Kommunikationsakustik, Ruhr-Universität