Adaptation of Classification Model for Improving Speech Intelligibility in Noise

1: (Junyoung Jung et al.: Adaptation of Classification Model for Improving Speech Intelligibility in Noise) (Regular Paper) 23 4, 2018 7 (JBE Vol. 23, No. 4, July 2018) https://doi.org/10.5909/jbe.2018.23.4.511 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) a), a) Adaptation of Classification Model for Improving Speech Intelligibility in Noise Junyoung Jung a) and Gibak Kim a) - 0 1. - 0, 1..... Abstract This paper deals with improving speech intelligibility by applying binary mask to time-frequency units of speech in noise. The binary mask is set to 0 or 1 according to whether speech is dominant or noise is dominant by comparing signal-to-noise ratio with pre-defined threshold. Bayesian classifier trained with Gaussian mixture model is used to estimate the binary mask of each time-frequency signal. The binary mask based noise suppressor improves speech intelligibility only in noise condition which is included in the training data. In this paper, speaker adaptation techniques for speech recognition are applied to adapt the Gaussian mixture model to a new noise environment. Experiments with noise-corrupted speech are conducted to demonstrate the improvement of speech intelligibility by employing adaption techniques in a new noise environment. Keyword : speech intelligibility, noise suppression, binary mask, Gaussian mixture model, adaptation a) (School of Electrical Engineering, Soongsil University) Corresponding Author : (Gibak Kim) E-mail: imkgb27@ssu.ac.kr Tel: +82-2-828-7266 ORCID:https://orcid.org/0000-0001-5114-4117 No.10048474,. This material is based upon work supported by the Ministry of Trade, Industry & Energy(MOTIE, Korea) under Industrial Technology Innovation Program. No.10048474, 'Development of a Service Robot based on Big Data for Providing Aging Generation with Personalized Welfare Services' Manuscript received March 16, 2018; Revised May 8, 2018; Accepted May 8, 2018. Copyright 2016 Korean Institute of Broadcast and Media Engineers. All rights reserved. This is an Open-Access article distributed under the terms of the Creative Commons BY-NC-ND (http://creativecommons.org/licenses/by-nc-nd/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited and not altered.

(JBE Vol. 23, No. 4, July 2018)......... 5 db SNR (Signal-to-Noise Ratio) 100%. 5~15 db SNR (5~0 db SNR). [1-3]. 2000, CASA (Computational Auditory Scene Analysis) [4-6]. - - 0, 1. 0 1 0dB. -., -, [7-11]... (incremental training) [12]. (Gaussian Mixutre Model: GMM) (quasi-bayes),,,.... eigenvoice [13]. Eigenvoice

1: (Junyoung Jung et al.: Adaptation of Classification Model for Improving Speech Intelligibility in Noise). PCA (Principal Component Analysis) eigenvoice. eigenvoice. eigenvoice [14]. Eigenvoice.,.. eigenvoice MAP (Maximum a posterior) MLLR (Maximum Likelihood Linear Regression) [15,16]. (full covariance matrix) MLLR MAP, MLLR MAP.. 2, 3 MAP, MLLR. 4... - - 1. Fig. 1. Extraction of amplitude modulation spectrogram. Tchorz Kollmeier (AMS: Amplitude Modulation Spectrogram) [17,18] 1., 4ms Hann ( Hamming). 4ms 48 (12kHz ), 80 128, 128 FFT (Fast Fourier Transform). FFT (power spectrum). 0.25ms Hann FFT, 4kHz (1/0.25ms)., mel 25

(JBE Vol. 23, No. 4, July 2018). 25 mel 4kHz. 32ms (128) Hann, 16ms. Hann 128 128 256, 256 FFT.,. 15 15. 15 25 45. 0 1 (Bayesian). 0 1 ( ), - (AMS) 0 1 (posterior probability),.,. if otherwise o,. GMM... 1. MAP (Maximum a posteriori) MAP. ML (Maximum Likelihood). ML likelihood. MAP, (posterior distribution)., ML.,,.. [16].,,. ML. ML. ().

1: (Junyoung Jung et al.: Adaptation of Classification Model for Improving Speech Intelligibility in Noise). w i n w i n G i l i w i n W, l i L. 1/2 G i. 2. MLLR (Maximum Likelihood Linear Regression) M G i m c m m m T mi MLLR MAP (linear regression) [15]. m. W W EM (Expectation Maximization) ML W. [7]. MLLR Povey Saon (iterative) W [15]., (6). m m m (covariance matrix). W (gradient) W m W. 1.. IEEE [19], 25kHz, 12kHz. 16kHz 12kHz. 64. 34, babble, factory, speech-shaped. Babble 10, factory NOISEX [20]. NOISEX 20kHz 12kHz. Speech-shaped IEEE stationary. IEEE W m W

(JBE Vol. 23, No. 4, July 2018) 10 2~3. 200, 5, 0, 5 db.. 5dB 10 10 50. 2. 1 MAP, MLLR. III MAP MLLR. MAP MLLR MLLR MAP.,... (Hit) (False Alarm:FA) (Hit-FA). Hit 1 1, FA 0 1. Hit-FA Hit-FA [7]. 2. MAP, MLLR MAP+MLLR, MLLR+MAP. eigenvoice(k=5). 2. ( - : Hit - FA, NA: No Adaptation) Fig. 2. Performance of model adaptation in new noises 20% Hit-FA. [7], 40%., MLLR MAP, MLLR MAP.

1: (Junyoung Jung et al.: Adaptation of Classification Model for Improving Speech Intelligibility in Noise).,., eigenvoice MAP, MLLR.. (References) [1] Y. Hu and P. Loizou, Subjective comparison and evaluation of speech enhancement algorithms, Speech communication, vol. 49, no. 7, pp. 588601, Jul. 2007. https://doi.org/10.1016/j.specom.2006.12.006 [2] Y. Hu and P. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Speech and Audio Processing, vol. 16, no. 1, pp. 229238, 2008. https://doi.org/10.1109/tasl.2007.911054 [3] Y. Hu and P. Loizou, A comparative intelligibility study of single-microphone noise reduction algorithms. The Journal of the Acoustical Society of America, vol. 122, no. 3, p. 1777, Sep. 2007. https://doi.org/10.1121/1.2766778 [4] G. Brown and M. Cooke, Computational auditory scene analysis, Computer speech and language, vol. 8, pp. 297336, 1994. https://doi.org/10.1006/csla.1994.1016 [5] D. Wang and G. Brown, Computational Auditory Scene Analysis : Principles, Algorithms, and Applications, Wiley, Hoboken, NJ, 2006. https://doi.org/10.1109/tnn.2007.913988 [6] D. Wang, On ideal binary mask as the computational goal of auditory scene analysis, In Divenyi P. (ed.), Speech Separation by Humans and Machines, pp. 181-197, Kluwer Academic, Norwell MA, 2005. https://doi.org/10.1007/0-387-22794-6_12 [7] G. Kim. Y. Lu, Y. Hu and P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners, The Journal of the Acoustical Society of America, vol. 126, no. 3, pp 1486-1494, 2009. https://doi.org/10.1121/1.3184603 [8] Y. Hu, P. Loizou, Environment-specific noise suppression for improved speech intelligibility by cochlear implant users, The Journal of the Acoustical Society of America, vol. 127, no. 6, pp 3689-3695, 2010. https://doi.org/10.1121/1.3365256 [9] K. Han, D. Wang, A classification based approach to speech segregation, The Journal of the Acoustical Society of America, vol. 132, no. 5, pp 3475-3483, 2012. https://doi.org/10.1121/1.4754541 [10] Y. Wang, K. Han, D. Wang, Exploring monaural features for classification-based speech segregation, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 2, pp 270-279, 2013. https://doi.org/10.1109/tasl.2012.2221459 [11] G. Kim, A Post-processing for Binary Mask Estimation Toward Improving Speech Intelligibility in Noise, Journal of Broadcast Engineering, Vol. 18, No.2, pp.311-318, March, 2013. https://doi.org/10.5909/jbe.2013.18.2.311 [12] G. Kim, P. Loizou, Improving speech intelligibility in noise using environment-optimized algorithms, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp 2080-2090, 2010. https://doi.org/10.1109/tasl.2010.2041116 [13] G. Kim, Eigenvoice Adaptation of Classification Model for Binary Mask Estimation, Journal of Broadcast Engineering, Vol.20, No.1, pp.164-170, 2015. https://doi.org/10.5909/jbe.2015.20.1.164 [14] R. Kuhn, J. Junqua, P. Nguyen, N. Niedzielski, "Rapid Speaker Adaptation in Eigenvoice Space," IEEE Transactions on Speech and Audio Proceeding, vol. 8, no. 6, pp. 695-707, November 2000. https://doi.org/10.1109/89.876308 [15] D. Povey and G. Saon, Feature and model space speaker adaptation with full covariance Gaussians, Proceedings of Interspeech, USA, september 2006. [16] K. Shinoda, Speaker Adaptation Techniques for Speech Recognition Using Probabilistic Models, Electronics and Communications in Japan, Part 3, Vol. 88, No. 12, pp.25-42, 2005. https://doi.org/10.1002/ecjc.20207 [17] J. Tchorz and B. Kollmeier, Estimation of the signal-to-noise ratio with amplitude modulation spectrograms, Speech Communication, vol. 38, no. 12, pp. 117, Sep. 2002. https://doi.org/10.1016/s0167-6393(01)00040-1 [18] J. Tchorz and B. Kollmeier, SNR estimation based on amplitude modulation analysis with applications to noise suppression, IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 184 192, May 2003. https://doi.org/10.1109/tsa.2003.811542 [19] IEEE, IEEE recommended practice for speech quality measurements", IEEE Trans. Audio Electroacoust., vol. 17, pp. 225-246, 1969. https://doi.org/10.1109/tau.1969.1162058 [20] A. Varga and H. J. M. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Communication, vol. 12, pp. 247-251, 1993. https://doi.org/10.1016/0167-6393(93)90095-3

(JBE Vol. 23, No. 4, July 2018) - 2017 : - 2017 ~ : - ORCID : https://orcid.org/0000-0002-8327-7181 - :,, - 1994 : - 1996 : - 2007 : - 1996 ~ 2000 : LG - 2000 ~ 2003 : () - 2008 ~ 2010 : Univ. of Texas at Dallas, Research Associate - 2011 ~ : - ORCID : https://orcid.org/0000-0001-5114-4117 - :,,