EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound 1 2 3 4 Spatial acoustics Binaural perception Synthesizing spatial audio Extracting spatial sounds Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-1
1 Spatial acoustics Received sound = source + channel - so far, only considered ideal source waveform Sound carries information on its spatial origin - e.g. ripples in the lake - great evolutionary significance The basis of scene analysis? - yes and no - try blocking an ear E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-2
Ripples in the lake Listener Source Wavefront (@ c m/s) Energy 1/r 2 Effect of relative position on sound - delay = r/c - energy decay ~ 1/r 2 - absorption ~ G(f) r - direct energy plus reflections Give cues for recovering source position Describe wavefront by its normal E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-3
Recovering spatial information Source direction as wavefront normal - moving plane found from timing at 3 points B pressure t/c = s = AB cosθ θ A C wavefront time - need to solve correspondence elevation φ range r Space: need 3 parameters - e.g. 2 angles and range azimuth θ E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-4
The effect of the environment Reflection causes additional wavefronts reflection diffraction & shadowing freq / Hz 8000 6000 4000 2000 - + scattering, absorption - many paths many echoes Reverberant effect - causal smearing of signal energy Dry speech airvib16 freq / Hz 8000 6000 4000 2000 + reverb from hlwy16 0 0 0.5 1 1.5 time / sec 0 0 0.5 1 1.5 E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-5 time / sec
Reverberation impulse response Exponential decay of reflections: h room (t) ~e -t /T freq / Hz 8000 6000 hlwy16-128pt window -10-20 -30 4000-40 t 2000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s -50-60 -70 Frequency-dependent - greater absorption at high frequencies faster decay Size-dependent - larger rooms longer delays slower decay Sabine s equation: 0.049V RT 60 = ----------------- Sα Time constant as size, absorption E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-6
Outline 1 2 3 4 Spatial acoustics Binaural perception - The sound at the two ears - Available cues - Perceptual phenomena Synthesizing spatial audio Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-7
2 Binaural perception R L head shadow (high freq) source path length difference What is the information in the 2 ear signals? - the sound of the source(s) (L+R) - the position of the source(s) (L-R) Example waveforms (ShATR database) shatr78m3 waveform 0.1 0.05 Left 0-0.05 Right -0.1 2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235 E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-8 time / s
Main cues to spatial hearing Interaural time difference (ITD) - from different path lengths around head - dominates in low frequency (< 1.5 khz) - max ~ 750 µs ambiguous for freqs > 600 Hz Interaural intensity difference (IID) - from head shadowing of far ear - negligable for LF; increases with frequency Spectral detail (from pinna relfections) useful for elevation & range Direct-to-reverberant useful for range E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-9
Head-Related Transfer Fns (HRTFs) Capture source coupling as impulse responses { ()} l θφr,, ()r t, θφr,, t Collection: (http://phosphor.cipic.ucdavis.edu/) RIGHT HRIR_021 Left @ 0 el HRIR_021 Right @ 0 el 1 HRIR_021 Left @ 0 el 0 az 45 0 Azimuth / deg 0 1 HRIR_021 Right @ 0 el 0 az -45 0 LEFT 0 0.5 1 1.5 0 0.5 1 1.5-1 time / ms 0 0.5 1 1.5 time / ms Highly individual! E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-10
Cone of confusion azimuth θ Cone of confusion (approx equal ITD) Interaural timing cue dominates (below 1kHz) - from differing path lengths to two ears But: only resolves to a cone - Up/down? Front/back? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-11
Further cues Pinna causes elevation-dependent coloration Monaural perception - separate coloration from source spectrum? Head motion - synchronized spectral changes - also for ITD (front/back) etc. E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-12
Combining multiple cues Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears image is centered l(t) r(t) t 1 ms t l(t) Delaying right channel moves image to left r(t) t t Attenuating left channel returns image to center l(t) r(t) t t - trading @ around 0.1 ms / db E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-13
Binaural position estimation Imperfect results: (Arruda, Kistler & Wightman 1992) 180 Judged Azimuth (Deg) 120 60 0-60 -120-180 -180-120 -60 0 60 120 180 Target Azimuth (Deg) - listening to wrong hrtfs errors - front/back reversals stay on cone of confusion E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-14
The Precedence Effect Reflections give misleading spatial cues R l(t) r(t) direct reflected t R /c t But: Spatial impression based on 1st wavefront then switches off for ~50 ms -.. even if reflections are louder -.. leads to impression of room E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-15
Binaural Masking Release Adding noise to reveal target Tone + noise to one ear: tone is masked + t t Identical noise to other ear: tone is audible + t t t - why does this make sense? Binaural Masking Level Difference up to 12dB - greatest for noise in phase, tone anti-phase E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-16
Outline 1 2 3 4 Spatial acoustics Binaural perception Synthesizing spatial audio - Position - Environment Extracting spatial sounds E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-17
3 Synthesizing spatial audio Goal: recreate realistic soundfield - hi-fi experience - synthetic environments (VR) Constraints - resources - information (individual HRTFs) - delivery mechanism (headphones) Source material types - live recordings (actual soundfields) - synthetic (studio mixing, virtual environments) E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-18
Classic stereo L R Intensity panning : no timing modifications, just vary level ±20 db - works as long as listener is equidistant Surround sound: extra channels in center, sides,... - same basic effect - pan between pairs E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-19
Simulating reverberation Can characterize reverb by impulse response - spatial cues are important - record in stereo - IRs of ~ 1 sec very long convolution Image model: reflections as duplicate sources virtual (image) sources reflected path source listener Early echos in room impulse response: direct path early echos h room (t) Actual reflection may be h ref (t), not δ(t) E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-20 t
Artificial reverberation Reproduce perceptually salient aspects - early echo pattern ( room size impression) - overall decay tail ( wall materials...) - interaural coherence ( spaciousness) Nested allpass filters (Gardner 92) Allpass x[n] -g + z -k + y[n] H(z) = h[n] z -k - g 1 - g z -k 1-g 2 g(1-g 2 ) g 2 (1-g 2 ) g,k g -g k 2k 3k n Nested+Cascade Allpass 50,0.5 30,0.7 20,0.3 Synthetic Reverb + AP 0 AP 1 AP 2 g LPF + + a 0 a 1 a 2 E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-21
Synthetic binaural audio Source convolved with {L,R} HRTFs gives precise positioning -...for headphone presentation - can combine multiple sources (by adding) Where to get HRTFs? - measured set, but: specific to individual, discrete - interpolate by linear crossfade, PCA basis set - or: parametric model - delay, shadow, pinna Delay Shadow Pinna z -t DL (θ) 1 - b L (θ)z -1 1 - az t Σ p kl (θ,φ) z -t PkL(θ,φ) + Source Room echo K E z -t E z -t DR (θ) 1 - b R (θ)z -1 1 - az t Σ p kr (θ,φ) z -t PkR(θ,φ) Head motion cues? - head tracking + fast updates + (after Brown & Duda '97) E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-22
Transaural sound Binaural signals without headphones? Can cross-cancel wrap-around signals - speakers S L,R, ears E L,R, binaural signals B L,R. 1 S L = H LL ( B L H RL S R ) B L M B R 1 S R = H RR ( B R H LR S L ) S L S R H LR H RL H LL H RR E L E R Narrow sweet spot - head motion? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-23
Soundfield reconstruction Stop thinking about ears just reconstruct pressure + spatial derivatives p(t) / z p(t) / y p(t) / x p(x,y,z,t) - ears in reconstructed field receive same sounds Complex reconstruction setup (ambisonics) - able to preserve head motion cues? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-24
Outline 1 2 3 4 Spatial acoustics Binaural perception Synthesizing spatial audio Extracting spatial sounds - Microphone arrays - Modeling binaural processing E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-25
4 Extracting spatial sounds Given access to soundfield, can we recover separate components? - degrees of freedom: >N signals from N sensors is hard - but: people can do it (somewhat) Information-theoretic approach - use only very general constraints - rely on precision measurements Anthropic approach - examine human perception - attempt to use same information E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-26
Microphone arrays Signals from multiple microphones can be combined to enhance/cancel certain sources Coincident mics with diff. directional gains m 1 s 1 a a 12 11 a 22 a 21 m 2 s 2 m 1 a 11 a 12 s = 1 m 2 a 21 a 22 s 2 s 1ˆ = A 1 s 2ˆ m Microphone arrays (endfire) 0 λ = 4D -20 λ = 2D -40 λ = D + D + D + D E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-27
Adaptive Beamforming & Independent Component Analysis (ICA) Formulate mathematical criteria to optimize Beamforming: Drive interference to zero - cancel energy during nontarget intervals ICA: maximize mutual independence of outputs - from higher-order moments during overlap m 1 a 11 a s x 12 1 m 2 a 21 a 22 s 2 δ MutInfo δa Limited by separation model parameter space - only NxN? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-28
Binaural models Human listeners do better? - certainly given only 2 channels Extract ITD and IID cues? Target azimuth Center freq / Hz 3200 1600 800 400 200 Interaural cross-correlation 100-6 -4-2 0 2 4 6 lag / ms - cross-correlation finds timing differences - consume counter-moving pulses - how to achieve IID, trading - vertical cues... E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-29
Nonlinear filtering How to separate sounds based on direction? - estimate direction locally - choose target direction - remove energy from other directions E.g. Kollmeier, Peissig & Hohman 93 frequency FFT Modulus l analysis L w L w 2 Smooth (1-a) (1-az -1 ) S LL OLA- FFT synthesis l' Crosscorrelation L w R * w Smooth (1-a) (1-az -1 ) S LR Gain factor calc g time FFT Modulus r analysis R w R w 2 Smooth (1-a) (1-az -1 ) S RR OLA- FFT synthesis r' X w (mh,2 πk/nt) - IID from L w / R w ; ITD (IPD) from arg{l w R * w } - match to IID/IPD template for desired direction - also reverberation? E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-30
Summary Spatial sound - sampling at more than one point gives information on origin direction Binaural perception - time & intensity cues used between/within ears Sound rendering - conventional stereo - HRTF-based Spatial analysis - optimal linear techniques - elusive auditory models E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-31
References B.C.J. Moore, An introduction to the psychology of hearing (4th ed.) Academic, 1997. J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996. R.O. Duda, Sound Localization Research, http://www.engr.sjsu.edu/~duda/duda.research.frameset.html E6820 SAPR - Dan Ellis L08 - Spatilal sound 2002-04-01-32