EE E682: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception 1 2 3 4 5 6 Motivation: Why & how Auditory physiology Psychophysics: Detection & discrimination Pitch perception Speech perception Auditory organization & Scene analysis Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e682/ Columbia University Dept. of Electrical Engineering Spring 26 E682 SAPR - Dan Ellis L4 - Perception 26-2-9-1 1 Why study perception? Perception is messy: Can we avoid it? No! Audition provides the ground truth in audio - what is relevant and irrelevant - subjective importance of distortion (coding etc.) - (there could be other information in sound...) Some sounds are designed for audition - co-evolution of speech and hearing The auditory system is very successful - we would do extremely well to duplicate it We are now able to model complex systems - faster computers, bigger memories E682 SAPR - Dan Ellis L4 - Perception 26-2-9-2
How to study perception? Three different approaches: Analyze the example: physiology - dissection & nerve recordings Black box input/output: psychophysics - fit simple models of simple functions Information processing models - investigate and model complex functions - e.g. scene analysis, speech perception E682 SAPR - Dan Ellis L4 - Perception 26-2-9-3 Outline 1 2 3 4 5 6 Motivation Physiology - Outer, middle & inner ear - The Auditory Nerve and beyond - Models Psychophysics Pitch perception Speech perception Scene analysis E682 SAPR - Dan Ellis L4 - Perception 26-2-9-4
2 Physiology Processing chain from air to brain: Middle ear Auditory nerve Outer ear Inner ear (cochlea) Midbrain Cortex Study via: - anatomy - nerve recordings Signals flow in both directions E682 SAPR - Dan Ellis L4 - Perception 26-2-9-5 Outer & middle ear Pinna Ear canal Middle ear bones Cochlea (inner ear) Eardrum (tympanum) Pinna horn - complex reflections give spatial (elevation) cues Ear canal - acoustic tube Middle ear - bones provide impedance matching E682 SAPR - Dan Ellis L4 - Perception 26-2-9-6
Inner ear: Cochlea Oval window (from ME bones) Travelling wave Basilar Membrane (BM) Cochlea 16 khz Resonant frequency 5 Hz Position 35mm Mechanical input from middle ear starts traveling wave moving down Basilar Membrane Varying stiffness and mass of BM gives results in continuous variation of resonant frequency At resonance, traveling wave energy is dissipated in BM vibration Frequency (Fourier) analysis E682 SAPR - Dan Ellis L4 - Perception 26-2-9-7 Cochlea hair cells Ear converts sound to BM motion; Each point on BM corresponds to a frequency Cochlea Tectorial membrane Basilar membrane Inner Hair Cell (IHC) Auditory nerve Outer Hair Cell (OHC) Hair cells on BM convert motion into nerve impulses (firings) Inner Hair Cells detect motion Outer Hair Cells? Variable damping? [BM animation] E682 SAPR - Dan Ellis L4 - Perception 26-2-9-8
Inner Hair Cells IHCs convert BM vibration into nerve firings Human hear has ~35 IHCs; Each IHC has ~7 connections to Auditory Nerve Each nerve fires (somes) near peak displacement: Local BM displacement Typical nerve signal (mv) 5 / ms Histogram to get firing probability: Firing count Cycle angle E682 SAPR - Dan Ellis L4 - Perception 26-2-9-9 Auditory nerve (AN) signals Single nerve measurements: Tone burst histogram Spike count 1 Frequency threshold db SPL 8 6 4 Tone burst Time 1 ms 3 One fiber: ~ 25 db dynamic range 2 1 Hz 1 khz 1 khz (log) frequency (approx. constant-q) Rate vs. intensity Spikes/sec 2 1 2 Intensity / db SPL 4 6 8 1 Hearing dynamic range > 1 db Hard to measure: probe living ANs? E682 SAPR - Dan Ellis L4 - Perception 26-2-9-1
AN population response All the information the brain has about sound: - average rate & spike timings on 3, fibers freq / 8ve re 1 Hz Not unlike a (constant-q) spectrogram? 5 4 3 2 1 p ( ) 1 2 3 4 5 6 / ms E682 SAPR - Dan Ellis L4 - Perception 26-2-9-11 Beyond the auditory nerve Ascending and descending Tonotopic? - modulation - position - source?? (from lloydwatts.com) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-12
Periphery models IHC Sound Outer/middle ear filtering Cochlea filterbank IHC Modeled aspects: - outer/middle ear - cochlea filtering - hair cell transduction - efferent feedback? Result: neurogram / cochleagram channel 6 5 4 3 2 1 SlaneyPatterson 12 chans/oct from 18 Hz, BBC1tmp (21218).1.2.3.4.5 / s E682 SAPR - Dan Ellis L4 - Perception 26-2-9-13 Outline 1 2 3 4 5 6 Motivation Physiology Psychophysics - Detection theory modeling - Intensity perception - Masking Pitch perception Speech perception Scene analysis E682 SAPR - Dan Ellis L4 - Perception 26-2-9-14
3 Psychophysics Physiology looks at the implementation; Psychology looks at the function/behavior Analyze audition as signal detection: - psychological tests reflect internal decisions - assume optimal decision process - infer nature of internal representations, noise,... lower bounds on more complex functions Different aspects to measure -, frequency, intensity - tones, complexes, noise - binaural - pitch, detuning p( ω O) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-15 Log(loudness rating) 2.4 2.2 2. 1.8 1.6 Basic psychophysics Relate physical and perceptual variables - e.g. intensity loudness frequency pitch Methodology: subject tests - just noticeable difference (jnd) - magnitude scaling e.g. adjust to twice as loud Results for Intensity vs. Loudness: Weber s law I α I log(l) = k log(i) Hartmann(1993) Classroom loudness scaling data 2.6 Textbook figure: L α I.3 1.4-2 -1 1 Power law fit: L α I.22 Sound level / db log2 ( L ) =.3log 2 ( I) log.3 1 I = ------------- log 1 2.3 db = ------------- ------ log 1 2 1 = db Ú 1 E682 SAPR - Dan Ellis L4 - Perception 26-2-9-16
Loudness as a function of frequency Fletcher-Munson equal-loudness curves: Intensity / db SPL 12 1 8 6 4 2 1 1 1, freq / Hz Equivalent loudness @ 1kHz 1 8 6 4 2 Equivalent loudness @ 1kHz 1 1 Hz 1 khz rapid loudness growth 2 4 6 8 Intensity / db 2 4 6 8 6 4 2 8 Intensity / db Hearing impairment: exaggerates E682 SAPR - Dan Ellis L4 - Perception 26-2-9-17 Loudness as a function of bandwidth Same total energy, different distribution: freq mag mag I Same total energy I B I 1 B freq B 1 freq Loudness... but wider perceived as louder Critical Bandwidth B bandwidth - e.g. 2 chans at -6 db (not -1 db) Critical bands: independent freq. channels - ~ 25 total (4-6 / octave) [sndex] E682 SAPR - Dan Ellis L4 - Perception 26-2-9-18
Simultaneous masking A louder tone can mask the perception of a second tone nearby in frequency: masking absolute tone threshold Intensity / db Suggests an internal noise model: p(x I) p(x I) p(x I+ I) masked threshold log freq σ n internal noise decision variable x E682 SAPR - Dan Ellis L4 - Perception 26-2-9-19 I Sequential masking Backward/forward in : simultaneous masking ~1 db masker envelope Intensity / db masked threshold backward masking ~5 ms forward masking ~1 ms - suggests temporal envelope of decision var. Time-frequency masking skirt : Masking tone intensity freq Masked threshold E682 SAPR - Dan Ellis L4 - Perception 26-2-9-2
What we do and don t hear A B X two-interval forced-choice : X = A or B? Timing: 2ms attack resolution, 2ms discrim - but: spectral splatter Tuning: ~ 1% discrimination - but: beats Spectrum: profile changes, formants - variable -frequency resolution Harmonic phase? Noisy signals & texture (Trace vs. categorical memory) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-21 Outline 1 2 3 4 5 6 Motivation Physiology Psychophysics Pitch perception - Place models - Time models - Multiple cues & competition Speech perception Scene analysis E682 SAPR - Dan Ellis L4 - Perception 26-2-9-22
4 Pitch perception: A classic argument in psychophysics Harmonic complexes are a pattern on AN 7 freq. chan. 6 5 4 3 2 1.5.1 /s -.. but give a fused percept (ecological) What determines the pitch percept? - not the fundamental How is it computed? Two competing models: place and E682 SAPR - Dan Ellis L4 - Perception 26-2-9-23 AN excitation Place model of pitch AN excitation pattern shows individual peaks Pattern matching method to find pitch: broader HF channels cannot resolve harmonics resolved harmonics Correlate with harmonic sieve : frequency channel Pitch strength frequency channel Support: Low harmonics are very important But: Flat-spectrum noise can carry pitch E682 SAPR - Dan Ellis L4 - Perception 26-2-9-24
Time model of pitch Timing information is preserved in AN down to ~ 1ms scale Extract periodicity by e.g. autocorrelation & combine across frequency chans: freq per-channel autocorrelation autocorrelation Summary autocorrelation 1 2 3 common period (pitch) lag / ms But: HF gives weak pitch (in practice) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-25 Alternate & competing cues Pitch perception could rely on various cues - average excitation pattern - summary autocorrelation - more complex pattern matching Relying on just one cue is brittle - e.g. missing fundamental Perceptual system appears to use a flexible, opportunistic combination Optimal detector justification? argmax p( ω o) ω p( o ω) p( ω) = argmax ---------------------------------- ω p( o) = argmax po ( 1 ω) po ( 2 ω) p( ω) ω if o 1 and o 2 are conditionally independent E682 SAPR - Dan Ellis L4 - Perception 26-2-9-26
Outline 1 2 3 4 5 6 Motivation Physiology Psychophysics Pitch perception Speech perception - The sounds of speech - Phoneme perception - Context and top-down influences Scene Analysis E682 SAPR - Dan Ellis L4 - Perception 26-2-9-27 5 Speech perception Highly specialized function - subsequent to source organization? -.. but also can interact Kinds of speech sounds: glide nasal vowel fricative stop burst freq / Hz 4 3 2 6 5 4 3 1 1.4 1.6 1.8 2 2.2 2.4 2.6 /s has a watch thin as a dime - vowels - glides - nasals - stops - fricatives... 2 level/db E682 SAPR - Dan Ellis L4 - Perception 26-2-9-28
Cues to phoneme perception Linguists describe speech with phonemes: h ε z has e a w t cl c ^ θ watch thin as a dime - phonemes define minimal word contrasts Acoustic-phoneticians describe phonemes by: formants & transitions I n I z I d a y bursts & onset s m freq transition stop burst vowel formants voicing onset E682 SAPR - Dan Ellis L4 - Perception 26-2-9-29 Categorical perception (Some) speech sounds perceived categorically rather than analogically - e.g. stop-burst & timing: freq f b stop burst vowel formants burst freq f b / Hz 4 3 2 T P K P 1 P i e ε a - tokens within category are hard to distinguish - category boundaries are very sharp Categories are learned for native tongue - merry / mary / marry c following vowel o u E682 SAPR - Dan Ellis L4 - Perception 26-2-9-3
Where is the information in speech? Articulation of high/low-pass filtered speech: high-pass low-pass Articulation / % 8 6 4 2 1 2 3 4 freq / Hz - sums to more than 1... Speech message is highly redundant - e.g. constraints of language, context listeners can understand with very few cues E682 SAPR - Dan Ellis L4 - Perception 26-2-9-31 Top-down influences: Phonemic restoration (Warren 197) freq / Hz 4 3 What if a noise burst obscures speech? 2 1 1.4 1.6 1.8 2 2.2 2.4 2.6 / s - auditory system restores the missing phoneme... based on semantic context... even in retrospect! Subjects are typically unaware of which sounds are restored E682 SAPR - Dan Ellis L4 - Perception 26-2-9-32
freq / Hz freq / Hz 5 4 3 2 1 5 4 3 2 1 A predisposition for speech: Sinewave replicas (Remez et al. 1994) Replace each formant with a single sinusoid:.5 1 1.5 2 2.5 3 - speech is (somewhat) intelligible - people hear both whistles and speech ( duplex ) - processed as speech despite un-speech-like What does it take to be speech? / s Speech Sines E682 SAPR - Dan Ellis L4 - Perception 26-2-9-33 Simultaneous vowels Mix synthetic vowels with different f s: /iy/ @ 1 Hz db /ah/ @ 125 Hz freq + = Pitch difference helps (though not necessary): % both vowels correct 75 5 25 1 / 1 4 / 2 1 2 4 f (semitones) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-34
Computational models of speech perception Various theoretical - practical models of speech comprehension, e.g. : Speech Phoneme recognition Lexical access Words Grammar constraints Open questions: - mechanism of phoneme classification - mechanism of lexical recall - mechanism of grammar constraints ASR is a practical implementation (?) E682 SAPR - Dan Ellis L4 - Perception 26-2-9-35 Outline 1 2 3 4 5 5 Motivation Physiology Psychophysics Pitch perception Speech perception Scene analysis - Events and sources - Fusion and streaming - Continuity & restoration - Simultaneous vowels E682 SAPR - Dan Ellis L4 - Perception 26-2-9-36
6 Auditory Organization frq/hz Detection model is huge simplification Real role of hearing is much more general: Recover useful information from outside world Sound organization into events and sources: 4 Voice 2 2 4 /s Rumble Stab Research questions: - what determines perception of sources? - how do humans separate mixtures? - how much can we tell about a source? E682 SAPR - Dan Ellis L4 - Perception 26-2-9-37 Auditory scene analysis: Simultaneous fusion Harmonics are distinct on AN, but perceived as one sound ( fused ): freq - depends on common onset - depends on harmonicity (common period) Methodologies: - ask subject how many objects - match attributes e.g. object pitch - manipulate higher level e.g. vowel identity E682 SAPR - Dan Ellis L4 - Perception 26-2-9-38
Sequential grouping: streaming Pattern / rhythm: property of set of objects - subsequent to fusion employs fused events? TRT: 6-15 ms frequency 1 khz f: 2 octaves Measure by relative timing judgments - cannot compare between streams Separate coherence and fusion boundaries Can interact and compete with fusion [sndex] E682 SAPR - Dan Ellis L4 - Perception 26-2-9-39 Continuity & restoration Tone is interrupted by noise burst: What happened? freq? + + + - masking makes tone undetectable during noise Need to infer most probable real-world events - observation equally likely for either explanation - prior on continuous tone much higher choose Top-down influence on perceived events... pulsation threshold [sndex] E682 SAPR - Dan Ellis L4 - Perception 26-2-9-4
Models of auditory organization Psychological accounts suggest bottom-up: input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset period frq.mod - (Brown 1991) Complications in practice: - formation of separate elements - contradictory cues - influence of top-down constraints (context, expectations)... E682 SAPR - Dan Ellis L4 - Perception 26-2-9-41 Summary Auditory perception provides the ground truth underlying audio processing Physiology specifies information available Psychophysics measures basic sensitivities Sound sources requires further organization Strong contextual effects in speech perception Sound Transduce Multiple represent'ns Scene analysis High-level recognition Parting thought: Is pitch central to communication? Why? E682 SAPR - Dan Ellis L4 - Perception 26-2-9-42