SPEAKER RECOGNITION AT OREGON GRADUATE INSTITUTE June & 6, 997 Sarel van Vuuren and Narendranath Malayath Hynek Hermansky and Pieter Vermeulen, Oregon Graduate Institute, Portland, Oregon
Oregon Graduate Institute. Speaker Recognition at OGI Research Group Goals. Competitive System Architecture Results 3. Initial Robust System Architecture Preliminary Results and Conclusions Planned Extensions
People { Faculty: Hynek Hermansky, Pieter Vermeulen { Post Doc: Nobu Kanedera, Carlos Avendano { PhD Students: Sarel van Vuuren, Sangita Tibrewala, Narendranath Malayath Speech processing by emulating relevant properties of speech perception Collaboration with { CSLU at OGI { ICSI Berkeley { IIT Madras { IDIAP Martigny { KTH Stockholm
Activities { Speaker identication { Acoustic modeling for ASR { Enhancement of degraded speech and speech processing for handicapped { Human speech perception
Speaker Recognition at OGI Speech Signal { linguistic message { speaker characteristics { environment Task { nd out how these information sources are coded into the signal Applications { speaker ID { speaker independent ASR { voice mimic
Requirements of a Speaker Verication System Invariant to channel Invariant to session Invariant to noise Minimal training data Minimal verication data Adapt to speaker styles
Goals Be familiar with state of the art { Build an up to date competitive system following the state of the art { Analyze and understand abilities and limitations { Contribute to research system { Incorporate ideas from research system
Goals Research novel ideas { Knowledge driven { Analyze and understand { Report results { Contribute to state of the art system { Incorporate knowledge from state of the art system Address robustness { Invariance vs modeling { Channels and noise Address data requirements { Training { Verication
Initial Robust System Preprocessing Similar Representation Rep. Rep. L+E Speaker Specific Mapping L+S+E + L+S +E - Distance Information Sources L:Linguistic S:Speaker E:Environment Frame Integration Features Likelihood Estimator Residue invariant to extraneous information and noise Preprocessing: segmentation - such as silence removal, voiced segments Representation: diering speaker information - such as low order PLP vs high order PLP Speaker Specic Mapping: - such as Neural Net or Pseudo Inverse
Initial Robust System Preprocessing Similar Representation Rep. Rep. L+E Speaker Specific Mapping L+S+E + L+S +E - Distance Information Sources L:Linguistic S:Speaker E:Environment Frame Integration Features Likelihood Estimator Speaker Specic Distance Measure: - Euclidean, likelihood estimator, Bhattacharyya Frame Integrator: - average, voting Likelihood Estimator Adding other information (pitch, formants)
Initial Robust Implementation PLP Representation Remove Silence PLP-7 PLP-4 Speaker Specific NN + - Euclidean Frame Average Preprocessing: Silence deletion Representation: PLP-7 vs PLP-4 Speaker Specic Mapping: Neural Net Distance Measure: Euclidean Frame Integrator: Average Likelihood Estimator: None
Preliminary Studies Map from speaker independent to speaker rich representation Evidence of discrimination Evidence of low data requirements for verication No handset robustness - mapping not invariant due to training methodology
Results { GMM baseline DET curve: handset training; 3 sec test ; female; training handset 0 mdcf 0.047 hdcf 0.00 eer 9.80 % mdcf (.8,9.4) hdcf (.7,4.) mdcf 0.08 eer 73 0 DET curve: handset training; 3 sec test ; female; non training handset 0 mdcf 0.064 hdcf 0.068 eer 4.86 % mdcf (.9,4.) hdcf (.7,4.) mdcf 0.073 eer 0.394 0
Results { GMM baseline DET curve: handset training; 0 sec test; female; training handset 0 mdcf 0.09 hdcf 0.030 eer.04 % mdcf (.3,.7) hdcf (.,7.4) mdcf 0.009 eer 07 0 DET curve: handset training; 0 sec test; female; non training handset 0 mdcf 0.048 hdcf 0.00 eer 9.60 % mdcf (.,3.8) hdcf (.,37.8) mdcf 0.030 eer 0.3 0
Results { GMM baseline DET curve: handset training; 30 sec test; female; training handset 0 mdcf 0.0 hdcf 0.06 eer.80 % mdcf (0.6,8.7) hdcf (0.7,9.0) mdcf 0.0 eer 7 0 DET curve: handset training; 30 sec test; female; non training handset 0 mdcf 0.034 hdcf 0.037 eer 6.9 % mdcf (.,9.) hdcf (0.7,30.0) mdcf 0.09 eer 94 0
Results { PLP system DET curve: handset training; 3 sec test ; female; training handset 0 mdcf 0.086 eer 9.0 % mdcf (.9,0.0) mdcf 0.84 eer 0.93 0 DET curve: handset training; 3 sec test ; female; non training handset 0 mdcf 0.098 eer 33.3 % mdcf (0.8,0.0) mdcf 0.830 eer 0.960 0
Results { PLP system DET curve: handset training; 0 sec test; female; training handset 0 mdcf 0.06 eer.69 % mdcf (.4,.9) mdcf 0.88 eer 0.933 0 DET curve: handset training; 0 sec test; female; non training handset 0 mdcf 0.09 eer 30.03 % mdcf (0.9,0.0) mdcf 0.87 eer 0.99 0
Results { PLP system DET curve: handset training; 30 sec test; female; training handset 0 mdcf 0.069 eer 4.8 % mdcf (.6,4.6) mdcf 0.88 eer 0.93 0 DET curve: handset training; 30 sec test; female; non training handset 0 mdcf 0.093 eer 9.4 % mdcf (0.8,0.0) mdcf 0.83 eer 0.99 0
Results { Subspace system DET curve: handset training; 3 sec test ; female; training handset 0 mdcf 0.07 eer.4 % mdcf (.,0.0) mdcf 0.796 eer 0.890 0 DET curve: handset training; 3 sec test ; female; non training handset eer 3.8 % 0 0
Results { Subspace system DET curve: handset training; 0 sec test; female; training handset 0 mdcf 0.060 eer.76 % mdcf (.,39.) mdcf 0.809 eer 0.88 0 DET curve: handset training; 0 sec test; female; non training handset eer 30.96 % 0 0
Results { Subspace system DET curve: handset training; 30 sec test; female; training handset 0 mdcf 0.0 eer 9.4 % mdcf (.6,38.7) mdcf 0.809 eer 0.877 0 DET curve: handset training; 30 sec test; female; non training handset eer 9.4 % 0 0
Future Work: Speaker Verication Understand each component Preprocessing Representation Environment Invariant Mapping Distance Measure