Voice Detection using Speech Energy Maximization and Silence Feature Normalization

Similar documents
Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

CHAPTER 1 INTRODUCTION

HCS 7367 Speech Perception

Near-End Perception Enhancement using Dominant Frequency Extraction

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Acoustic Signal Processing Based on Deep Neural Networks

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED

Noise-Robust Speech Recognition Technologies in Mobile Environments

Robust Speech Detection for Noisy Environments

A Study on the Degree of Pronunciation Improvement by a Denture Attachment Using an Weighted-α Formant

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

LATERAL INHIBITION MECHANISM IN COMPUTATIONAL AUDITORY MODEL AND IT'S APPLICATION IN ROBUST SPEECH RECOGNITION

Speech Enhancement Based on Deep Neural Networks

CURRENTLY, the most accurate method for evaluating

Psychoacoustical Models WS 2016/17

Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phase Components

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

Speech Enhancement, Human Auditory system, Digital hearing aid, Noise maskers, Tone maskers, Signal to Noise ratio, Mean Square Error

9/29/14. Amanda M. Lauer, Dept. of Otolaryngology- HNS. From Signal Detection Theory and Psychophysics, Green & Swets (1966)

Sound Texture Classification Using Statistics from an Auditory Model

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

HCS 7367 Speech Perception

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

A Determination of Drinking According to Types of Sentence using Speech Signals

UvA-DARE (Digital Academic Repository) Perceptual evaluation of noise reduction in hearing aids Brons, I. Link to publication

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

A Judgment of Intoxication using Hybrid Analysis with Pitch Contour Compare (HAPCC) in Speech Signal Processing

Speech Enhancement Using Deep Neural Network

An active unpleasantness control system for indoor noise based on auditory masking

Computational Perception /785. Auditory Scene Analysis

Real-Time Implementation of Voice Activity Detector on ARM Embedded Processor of Smartphones

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

TESTS OF ROBUSTNESS OF GMM SPEAKER VERIFICATION IN VoIP TELEPHONY

Issues faced by people with a Sensorineural Hearing Loss

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

Voice Pitch Control Using a Two-Dimensional Tactile Display

BINAURAL DICHOTIC PRESENTATION FOR MODERATE BILATERAL SENSORINEURAL HEARING-IMPAIRED

Comparison of Lip Image Feature Extraction Methods for Improvement of Isolated Word Recognition Rate

ACOUSTIC AND PERCEPTUAL PROPERTIES OF ENGLISH FRICATIVES

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

Heart Murmur Recognition Based on Hidden Markov Model

INTEGRATING THE COCHLEA S COMPRESSIVE NONLINEARITY IN THE BAYESIAN APPROACH FOR SPEECH ENHANCEMENT

EEL 6586, Project - Hearing Aids algorithms

A. SEK, E. SKRODZKA, E. OZIMEK and A. WICHER

Speech recognition in noisy environments: A survey

Role of F0 differences in source segregation

Lecture 9: Speech Recognition: Front Ends

Frequency refers to how often something happens. Period refers to the time it takes something to happen.

Best Practice Protocols

Shaheen N. Awan 1, Nancy Pearl Solomon 2, Leah B. Helou 3, & Alexander Stojadinovic 2

Performance Study of Objective Speech Quality Measurement for Modern Wireless-VoIP Communications

Individualizing Noise Reduction in Hearing Aids

Acoustics, signals & systems for audiology. Psychoacoustics of hearing impairment

Juan Carlos Tejero-Calado 1, Janet C. Rutledge 2, and Peggy B. Nelson 3

International Journal of Scientific & Engineering Research, Volume 5, Issue 3, March ISSN

Providing Effective Communication Access

Performance of Gaussian Mixture Models as a Classifier for Pathological Voice

Telephone Based Automatic Voice Pathology Assessment.

A Study on Recovery in Voice Analysis through Vocal Changes before and After Speech Using Speech Signal Processing

Chapter 40 Effects of Peripheral Tuning on the Auditory Nerve s Representation of Speech Envelope and Temporal Fine Structure Cues

Jitter, Shimmer, and Noise in Pathological Voice Quality Perception

A Sleeping Monitor for Snoring Detection

SoundRecover2 the first adaptive frequency compression algorithm More audibility of high frequency sounds

Speech Enhancement Using Temporal Masking in the FFT Domain

Masking release and the contribution of obstruent consonants on speech recognition in noise by cochlear implant users

ADVANCES in NATURAL and APPLIED SCIENCES

LIST OF FIGURES. Figure No. Title Page No. Fig. l. l Fig. l.2

A case study of construction noise exposure for preserving worker s hearing in Egypt

Voice activity detection algorithm based on long-term pitch information

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 THE DUPLEX-THEORY OF LOCALIZATION INVESTIGATED UNDER NATURAL CONDITIONS

An Interesting Property of LPCs for Sonorant Vs Fricative Discrimination

Single channel noise reduction in hearing aids

Simulations of high-frequency vocoder on Mandarin speech recognition for acoustic hearing preserved cochlear implant

Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Using Source Models in Speech Separation

Hearing. Juan P Bello

A Study on CCTV-Based Dangerous Behavior Monitoring System

Speech Intelligibility Measurements in Auditorium

Chapter 4. MATLAB Implementation and Performance Evaluation of Transform Domain Methods

Quarterly Progress and Status Report. Masking effects of one s own voice

Implementation of Spectral Maxima Sound processing for cochlear. implants by using Bark scale Frequency band partition

Auditory scene analysis in humans: Implications for computational implementations.

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

Fig. 1 High level block diagram of the binary mask algorithm.[1]

Efficient voice activity detection algorithms using long-term speech information

EEG Signal Description with Spectral-Envelope- Based Speech Recognition Features for Detection of Neonatal Seizures

Sound localization psychophysics

A Single-Channel Noise Reduction Algorithm for Cochlear-Implant Users

Speech quality evaluation of a sparse coding shrinkage noise reduction algorithm with normal hearing and hearing impaired listeners

RESPIRATORY AIRFLOW ESTIMATION FROM LUNG SOUNDS BASED ON REGRESSION. Graz University of Technology, Austria. Medical University of Graz, Austria

The role of periodicity in the perception of masked speech with simulated and real cochlear implants

Improving the Noise-Robustness of Mel-Frequency Cepstral Coefficients for Speech Processing

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Automatic Live Monitoring of Communication Quality for Normal-Hearing and Hearing-Impaired Listeners

Research Article Automatic Speaker Recognition for Mobile Forensic Applications

What Is the Difference between db HL and db SPL?

The Institution of Engineers Australia Communications Conference Sydney, October

Verification of soft speech amplification in hearing aid fitting: A comparison of methods

Transcription:

, pp.25-29 http://dx.doi.org/10.14257/astl.2014.49.06 Voice Detection using Speech Energy Maximization and Silence Feature Normalization In-Sung Han 1 and Chan-Shik Ahn 2 1 Dept. of The 2nd R&D Institute, Agency for Defense Development 460,Songpa-gu, Seoul, 138-600, South Korea ishan78@add.re.kr 2 Dept. of Computer Engineering, Kwangwoon University 20, Gwangun-ro, Nowon-gu, Seoul, South Korea coolsahn@gmail.com Abstract. The estimation of noises is very important to remove noises from voices with background noises. This paper proposed a voice detection method using a silence feature normalization and voice energy maximize. In the high signal-to-noise ratio for the proposed method was used to maximize the characteristics receive less characterized the effects of noise by the voice energy. Cepstral feature distribution of voice/non-voice characteristics in the low signal-to-noise ratio and improves the recognition performance. Result of the recognition experiment, recognition performance improved compared to the conventional method. Keywords: Speech Recognition, Voice Detection, Noise Reduction, Speech Energy Maximization, Silence Feature Normalization. 1 Introduction In case of feature parameters used for voice detecting, in an environment that SNR is high, as the characteristics of feature parameter on speech and non-speech are relatively distinctive, their recognition function doesn t remarkably go down. However, in an actual noise environment that there are various kinds of noise or in one that SNR is low, as feature parameter is sensitive against noise signal, its voice detecting function decline may arise as a problem. Therefore, this paper suggests voice detecting method that has advantages in noise environment by using speech maximization and silence feature normalization. Suggestion made use of characteristic of the silence feature s being less affected by maximizing speech energy in high SNR environment. On the other hand, in low SNR, recognition function got improved using the characteristics of cepstrum feature distribution of speech and non-speech. Function improvement compared to existing methods was verified through recognition experiments. Corresponding author ISSN: 2287-1233 ASTL Copyright 2014 SERSC

This paper consists of 5 parts in total. Related studies will be mentioned in the second chapter and voice detecting method that has advantages against noise environment using speech energy maximization and silence feature normalization will be explained in detail in the third chapter. Systematic assessment is conducted in chapter 4 and this paper is concluded with chapter 5. 2 Related Work 2.1 Spectral Energy In low SNR energy spectrum, speech band shows relatively high energy spectrum compared to non-speech band. Therefore, it can be assumed that speech energy spectrum has comparatively higher energy spectrum than non-speech energy spectrum. Entropy can be explained with below formula in spectral energy band [1]. Spectral energy probability can be expressed in below formula (1) k shows Frequency bin Index, l shows frame index. Spectral energy probability against Frequency bin can be deduced from Frame l. Deduced each frequency bin probability is calculated to entropy by formula 2.2 Critical Band Masking effect is the phenomenon of weaker sound s being blocked by stronger sound and stronger sound s called masker, and the weaker sound is called maskee(the weaker sound). Maskee can be hear when it s over masking threshold calculated by the masker(the stronger sound) [2]. Frequency masking is that the maskee is being masked when the masker and the maskee simultaneously occur and the masking degree can be calculated through frequency analysis. The threshold of the sound being masked differ from frequency bandwidth, and frequency bandwidth with equivalent masking threshold is called critical band. The width of critical band is constant as 100Hz when the frequency is below 1kHz, and when the frequency is over 1kHz, it proportionally increases according to the frequency[3]. 26 Copyright 2014 SERSC

3 Voice Detection That Excel in Noise Environment 3.1 Voice energy maximization Voice has pitch frequency for vowels, and the pitch frequency is called Basic frequency. Basic frequency has the characteristics of showing the biggest energy throughout the full-band of voice band, so it s expressed as the maximum energy and minimum energy is shown as noise signal unrelated to voice signal. Voice energy has relatively bigger energy compared to noise and noise has smaller amount of energy, so calculation of the ratio of voice energy against noise energy and energy ratio on the output are shown in below formula. (2) 3.2 Silence feature normalization Silence feature normalization is the method of finding the silence interval with weak log energy and then cutting the value down under a certain value. Voice interval contaminated by noise has broader frequency bandwidth compared to the band only with voice and the interval with big log energy is less affected by noise compared to the interval with small log energy, so used weighting function. The weighting function ω(n) can be shown like below formula. (3) 4 Experiment Result This paper conducted the experiment on voice detection method that excel in noise environment by using voice energy maximization and silence feature normalization that this paper suggested For assessment, Aurora 2.0 database was used. Aurora 2.0 consists of noise environment and each-noise level(including white gaussian noise, Copyright 2014 SERSC 27

babble noise) and sorts each noise environment (street, airport, car noise etc.) so it s used to verify voice improvement algorism[4]. The experiment was conducted in each noise environment with 15dB, 10dB, 5dB and 0dB separately to verify its performance on SNR change, also to evaluate endpoint detection performance PHR(Pause Hit Ratio) and FAR(False Alarm Ratio) in sound area were used. As sound source, 8kHz sampling rate, 16bits were used and FFT size is 256 samples and 1/2 overlapping interval was used. Hamming Window was also used[5]. PHR used as performance scale shows 98% accuracy in low SNR of noise environment and showed 100% accuracy in high SNR. FAR(False Alarm Ratio) in sound area was 0% by showing outstanding performance in SNR of 15dB and 10dB and 2% performance were marked for each in low SNR interval of 5dB and 0dB Fig. 1. Result of (a) Input signal (b) MMSE-STSA (c) Proposed method at SNR 10dB Airport Noise 5 Conclusion This paper evaluated voice detection and recognition performance that excel in noise environment by using speech energy maximization and silence feature normalization. As feature parameter is sensitive against noise signal in actual noise environment with various kinds of environmental noise or for low SNR voice, so voice detection performance deteriorates. Therefore, voice detection method that excel in noise environment using voice energy maximization and silence feature normalization was suggested. Suggested method made use of characteristics of silence feature s being less affected by noise by maximizing voice energy in high SNR and voice and nonvoice cepstrum feature dispersion in low SNR. In result, recognition performance has accuracy 99% in char noise environment. FAR performance is good (0%) in SNR 15dB and 10dB. However, in low level 5dB 28 Copyright 2014 SERSC

and 0dB it just improved 2% of it's performance. Improved recognition performance compared to existing method was verified through experiment result. References 1. Shen, G., Chung, H. Y.: Cepstral Distance and Log-Energy Based Silence Feature Normalization for Robust Speech Recognition. The Journal of the Acoustical Society of Korea. Vol 29, No. 4, pp. 278-285 (2010). 2. Sohn, J., Kim, N. S., Sung, W.: A statistical model-based voice activity detection, IEEE Signal Processing Letters. Vol. 6, No. 1, pp. 1-3 (1999) 3. Wang, K. C., Tsai, Y. H.: Voice activity detection algorithm with low signal-to-noise ratios based on spectrum entropy, Second International Symposium on Universal Communication, pp. 423-428(2008) 4. Rix, A. W, Beerends, J. G, Hollier, M. P, Hekstra, A. P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. No. 2, pp. 749-752(2001) Copyright 2014 SERSC 29