CHAPTER 1 INTRODUCTION

Size: px

Start display at page:

Download "CHAPTER 1 INTRODUCTION"

Sheryl Johns
5 years ago
Views:

1 CHAPTER 1 INTRODUCTION 1.1 BACKGROUND Speech is the most natural form of human communication. Speech has also become an important means of human-machine interaction and the advancement in technology has modernized the ways of communication. Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in digital representation, so speech processing is regarded as a special case of digital signal processing applied to speech signal. Speech processing relates to the enhancement, compression, synthesis or recognition of speech signals. In the recent times, digital signal processing has found importance and wide applications since the techniques are more sophisticated and advanced as compared to their analog equivalents. Ease and speed of representing, storing, retrieving and processing speech data has contributed to the development of efficient and effective speech processing techniques to address the issues related to speech. 1.2 OBJECTIVE OF THE PRESENT WORK The speech processing systems used to communicate or store speech signal are usually designed for noise free environment, but in real world the presence of background interference in the form of additive background noise and channel noise drastically degrades the performance of these systems, it causes inaccurate information exchange and listener

2 2 fatigue. Speech enhancement is a field of digital speech processing technique which aims to enhance the intelligibility and/or perceptual quality of the speech signal, like audio noise reduction for audio signals. Restoring the desired speech signal from the mixture of speech and background noise is amongst the oldest, still indefinable goals in speech processing and communication system research. As the presence of noise degrades the performance of various speech processing systems, it is therefore common to incorporate speech enhancement as a preprocessing step in these systems operating in noisy environments. Removing various types of noise is difficult due to its random nature and the inherent complexities of speech. Noise reduction techniques usually have a trade-off between the amount of noise removal and speech distortions introduced due to the processing of the speech signal. Complication and ease of implementation of the noise reduction algorithms is also of concern in applications especially those related to portable devices such as mobile communications, digital hearing aid, etc. The main objective of speech enhancement or noise reduction technique is to improve one or more perceptual aspects of speech, such as the speech quality or intelligibility. This is important in a variety of contexts, like environments with interfering background noise (e.g. offices, streets, automobiles, factory environment etc.), in speech recognition systems, hands free environment for cars, hearing aid devices etc. The ultimate goal of this work is to eliminate the additive noise present in the speech signal and restore the speech signal to its original form.

3 3 1.3 SPEECH ENHANCEMENT TECHNIQUES Most of the methods have been developed with some or the other auditory, perceptual or statistical constraints placed on the speech and noise signals. However, in real world situations, it is very difficult to reliably predict the characteristics of the interfering noise signal or the exact characteristics of the speech waveform. Hence, the speech enhancement methods are suboptimal and only reduce the amount of noise in the signal to some extent. Due to this nature, some of the speech signals are distorted during the noise reduction process. Hence, there is a trade-off between distortions in the processed speech and the amount of noise suppressed. The effectiveness of the speech enhancement system is therefore being measured based on how well it performs in light of this trade-off. Speech enhancement systems are classified in a number of ways based on the criteria used or application of the enhancement system. In general it is classified based on i) number of input channels like one, two or multiple, ii) domain of processing like time or frequency, iii) type of algorithm like adaptive or non-adaptive and iv) additional constraints like speech production or perception. The speech signal is acquired from single or multiple channel sensors. Non-stationarity of the noise process is further complicating the enhancement effort. The approach commonly referred to as multichannel system that consists of more than one channel, where one channel has the noisy signal that is to be processed and the other channel consists of the reference signal. Adaptive noise cancellation is one such powerful speech enhancement technique based on the availability of an auxiliary channel, known as reference path, where a correlated sample or reference of the contaminating noise is present. The system is relatively easier compared to a single channel system.

4 4 In the most common real time scenario, second channel is not available. One microphone input (single channel) makes speech enhancement difficult, as speech and noise are present in the same channel. Separation of the two signals requires relatively good knowledge of the speech and noise models. These systems are easy to build and comparatively less expensive than the multiple input systems. They constitute one of the most difficult situations of speech enhancement, since no reference signal to the noise is available and the clean speech cannot be preprocessed prior to being affected by the noise. Single channel systems make use of different statistics of speech and unwanted noise. The performance of these methods are limited in the presence of nonstationary noise as most of the methods make an assumption that noise is stationary during speech intervals and also the performance drastically degrades at lower signal to noise ratios. The best trade-off between speech distortion and noise reduction in a perceptual sense are based on the properties closely related to human perception and masking effects. Masking is a fundamental aspect of the human auditory system and is a basic element of perceptual coding systems which is utilized in enhancement systems. Masking is defined as either the process by which the threshold of audibility of one sound is raised by presence of another sound or the amount by which that threshold is increased. The auditory threshold is the minimum Sound Pressure Level (SPL) necessary to detect a pure tone in quiet environment as a function of the frequency of the tone. The hearing threshold level in such condition is a representative of average among the values obtained from different people. Below this level, the human ear cannot perceive sound. Masking is typically expressed in decibels (db), and its effect is known as the masked or masking threshold. Masking comes in play when

5 5 the activity caused by one signal is not detected due to the activity caused by presence of another neighboring signal. Masking occurs in frequency and time domain. The masking effect of a signal is the greatest within a critical bandwidth of the signal, although there are some effects in other critical bands too. Masking in frequency domain is typically determined for a noise masking a tone or a tone masking a noise. Temporal masking includes simultaneous masking, premasking (also called backward masking) and postmasking (also called forward masking). Simultaneous masking occurs when a masker is present throughout the duration of a relatively long signal. Postmasking occurs when a masking signal masks a signal occurring after the masker and is caused by reduction in sensitivity of recently simulated cells. A simplified explanation of the mechanism underlying the simultaneous masking phenomena is as follows. The presence of a strong noise or tone masker creates an excitation of sufficient strength on the basilar membrane at the critical band location to effectively block the transmission of a weaker signal. Hence the speech enhancement algorithm needs to be developed, considering the non-stationarity of the noise and the perceptual quality of the human ear. 1.4 SUBJECTIVE AND OBJECTIVE QUALITY MEASURES FOR PERFORMANCE ANALYSIS In this research, dual channel and single channel speech enhancement algorithms have been proposed to enhance the speech degraded by the additive background noise. In general the environment is categorized as stationary and non-stationary noisy environment. Most of the real time noises are non-stationary in nature. Speech degradation by such noise often occurs due to sources like air conditioning units, fans,

6 6 cars, city streets, factory environments, helicopters, computer systems, restaurants, etc. The core objective of this research is to develop a speech enhancement algorithm which works well in any of the real world environment. Performance analyses of speech enhancement techniques are based on their subjective and objective quality measures that are explained in this section. The two main aspects of speech enhancement algorithms are speech intelligibility and quality (Yi Hu and Philipos C. Loizou 2006a, 2006b, 2007, 2008); it is quantified using objective and subjective measures. Objective quality measures predict the perceived speech quality based on a computation of the numerical distance or distortion between the original and the processed speech. Speech intelligibility is the measure of the effectiveness of speech. Objective quality measures are evaluated automatically from the speech signal, its spectrum, or some parameters obtained thereof. Since they do not require listening tests, these measures give an immediate estimate of the perceptual quality of a speech enhancement algorithm. Subjective quality measures are obtained using listening tests in which human participants rate quality of the speech in accordance with a predetermined opinion scale. It is expressed in terms of how pleasant the signal sounds or how much effort is required for the listeners to understand the message. Subjective distortion measures provide the most accurate assessment of the performance since the degree of perceptual quality and intelligibility are determined by the human auditory system. In general, the performance of the speech enhancement algorithm with reference to speech

7 7 intelligibility and quality is evaluated using these subjective and objective distortion measures Objective Quality Measures The objective quality measures (Bayya and Vis 1996), (Philipos C. Loizou 2007) are primarily based on the idea that speech quality is modeled in terms of differences in loudness between the original and processed signals or simply in terms of differences in the spectral envelopes between the original and processed signals. Objective speech quality measures are categorized into: A. Signal to Noise Ratio (SNR) B. Itakuro - Saito (IS) distance measure C. Perceptual Evaluation of Speech Quality (PESQ) D. Mean Square Error (MSE) Signal to Noise Ratio (SNR) The SNR is the ratio of signal power to noise power expressed in decibels (db) and is given in Equation (1.1) as n SNR db 10 log10 2 (1.1) n s [s(n) 2 (n) ŝ(n)] signal. where s(n) is the clean speech and (n) is the enhanced speech

8 8 Itakuro-Saito (IS) Distance Measure The Itakuro-Saito distance is a measure of the perceptual difference between an original spectrum P ( ) and an approximation Pˆ( ) of that spectrum. The distance is defined as: 1 P( ) P( ) D IS (P( ),Pˆ ( )) log 1 d (1.2) 2 Pˆ( ) Pˆ ( ) This IS distance measure shows the phase difference information between enhanced and clean signal. Normally it should be of low value. Perceptual Evaluation of Speech Quality (PESQ) PESQ measure (Rix et al 2001) is one of the most commonly used measures to predict the subjective opinion score of a degraded or enhanced speech. It is recommended by International Telecommunications Union (ITU) for speech quality assessment. In PESQ measure, a reference signal and the enhanced signal are first aligned in both time and level. For normal subjective test material the PESQ score ranges from 1 to 5, with higher score indicating better quality. Mean Square Error The Mean Square Error (MSE) metric is frequently used in signal processing and is defined in Equation (1.3) as: 1 L 2 MSE S(i) Ŝ(i) (1.3) L i 1

9 9 signal and Ŝ (i ) where S(i) denotes the power spectrum of the clean speech denotes the power spectrum of the enhanced speech signal Subjective Quality Measures ITU-T-P.835 Standard ITU-T standard (P.835) was designed for evaluating the subjective quality of speech in noise and is particularly appropriate for the evaluation of noise suppression algorithms. The parameters used to evaluate the subjective quality measures of a speech signal are CSIG which specifies the scale of signal distortion, CBAK gives the scale of background intrusiveness and COVRL specifies the scale of mean opinion score. The details of these parameters are given in Tables 1.1, 1.2 and 1.3. a. Speech signal alone using a five-point scale of signal distortion (CSIG) shown in Table 1.1. Table 1.1 Scale of signal distortion Rating Speech quality 5 Very natural, no degradation 4 Fairly natural, little degradation 3 Somewhat natural, somewhat degraded 2 Fairly unnatural, fairly degraded 1 Very unnatural, very degraded b. Background noise alone using a five-point scale of background intrusiveness (CBAK) given in Table 1.2.

10 10 Table 1.2 Scale of background intrusiveness Rating Speech quality 5 Not noticeable 4 Somewhat noticeable 3 Noticeable, somewhat intrusive 2 Fairly conspicuous, somewhat intrusive 1 Very conspicuous, very intrusive c. Overall effect using the scale of the mean opinion score (COVRL) is given in Table 1.3. Table 1.3 Scale of mean opinion score Rating Speech quality 5 Excellent 4 Good 3 Fair 2 Poor 1 Unsatisfactory These values are obtained by linearly combining the existing objective measures by the following relations given in Equation (1.4) CSIG = LLR+0.603PESQ-0.009WSS CBAK = PESQ-0.007WSS+0.063segSNR COVL = PESQ-0.512LLR-0.007WSS (1.4) where LLR specifies the Log-Likelihood Ratio and WSS is the Weighted Spectral Slope distance.

11 11 Log-Likelihood Ratio The Log-Likelihood Ratio (LLR) measure is also referred to as the Itakuro distance measure. It is based on the dissimilarity between the all pole models of the original (clean) and enhanced speech. This distance measure is computed between sets of linear prediction coefficients over synchronous frames in the original and enhanced speech. The LLR measure is found as T a dr sa d LLR log (1.5) 10 T a sr sa s In Equation (1.5), a s and a d are the linear prediction coefficient vectors for the clean and degraded or enhanced speech segments respectively. R s denotes the autocorrelation matrices of the clean speech segment for which the optimal predictor coefficients a s have been computed. Weighted Spectral Slope distance (WSS) The WSS distance measure is based on critical filter band analyses (auditory model) in which overlapping filters of progressively larger bandwidth are used to estimate the smoothed short time speech spectrum (Yi Hu and Philipos C. Loizou 2006a, 2006b, 2008). This measure finds a weighted difference between the spectral slopes in each band. The magnitude of each weight reflects whether the band is near a spectral peak or valley and whether the peak is the largest in the spectrum.

12 12 Test Samples The test samples to examine the performance of the proposed speech enhancement algorithms are taken from SpEAR database of CSLU (Center for Spoken Language Understanding), the NOIZEUS database and recorded samples in room environment. SpEAR: Speech Enhancement and Assessment Resource. In this database speech corrupted by different classes of noise like white stationary, pink stationary, car, cellular etc., are available and the speech is sampled at 8 khz (16 bps). NOIZEUS: A noisy speech corpus for evaluation of speech enhancement algorithms (Varga and Steeneken 1993). The noisy database contains 30 IEEE sentences (produced by three male and three female speakers) corrupted by eight different real world noises at different SNRs. The noise was taken from the AURORA database and includes suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train station noise. The sentences are originally sampled at 25 khz and downsampled to 8 khz. The noise signals are added to the speech signals at SNRs of 0 db, 5 db and 10 db. In the subsequent chapters the dual channel and single channel speech enhancement algorithms are proposed and using the test samples the proposed methods are analyzed based on the performance analysis measures discussed above. 1.5 APPLICATIONS OF SPEECH ENHANCEMENT The following are some of the applications where speech enhancement plays a vital role to improve the performance of the speech processing systems.

13 13 Speech enhancement has found many applications particularly with the increase in Automatic Speech Recognition (ASR) and mobile communications. In ASR systems, the performance degrades badly in the case of adverse environments with very low SNR. It is found that the recognition rate is improved by applying a speech enhancement algorithm to the degraded speech. Also in the case of mobile communication, the speech signal is degraded by different types of noise in the communication channel; hence there is a need for speech enhancement in the receiver. The common features of most of the speech enhancement methods are to estimate the power spectrum of clean speech from the power spectrum of noisy speech and the spectrum of noise. Hearing instruments have come a long way from the first analog electro acoustic devices to state-of-the-art full fledged digital hearing systems. Besides providing the amplification necessary for the successful rehabilitation of the hearing impaired person, modern hearing instruments encompass a variety of functions, which improves the user experience. This includes functions for noise suppression, feedback control and wireless communications. Under noisy conditions, hearing impaired persons typically have greater difficulty in understanding speech than those with normal hearing. This disadvantage translates to the requirement of an additional 2.5 to 12 db SNR improvement for speech discrimination scores similar to those of normal hearing. As the capabilities of digital signal processing migrate to smaller devices, it is natural to consider its use as a front-end speech enhancement technique for future generation hearing aids. Undisputedly, the most important function is the restoration of the speech intelligibility in acoustically adverse conditions. This requires sophisticated acoustic signal (speech) processing algorithms.

14 14 The rapid evolution and combination of wireless communication and consumer electronics technologies have advanced the development of information agent industry. However, the use of handset equipment for cellular phones in cars is a restriction and a potential risk for the driver. Hands free devices are thus developed to overcome the problem. However, the driver speech will be corrupted severely by the ambient noise, which affects successive operations such as speech coding and ASR for voice dialing. To solve the problem, hands free car kits must provide some means of noise reduction in the front end of speech coder and recognizer. Regarding the case of speech recognizers, the objective of ASR systems is to recognize the human speech, such as words and sentences, using algorithms evaluated by a computer without the interference of human. However, under noisy environment, the recognition performance degrades significantly due to the statistical mismatch between the noisy speech feature and the clean-trained acoustic model of the recognition system. The mismatch occurs when the testing condition is different from the training condition, as the acoustic interferences such as additive background noise change the statistics of the speech. It is necessary to address this problem so that the recognition accuracy can be improved to a level which is applicable to real world problems. The speech enhancement techniques attempt to reduce the effect of the mismatched acoustic environment by estimating the clean speech features. These are some of the real time speech processing methods where speech enhancement is necessary for better performance and proposed speech enhancement techniques consider the problems of the existing methods.

15 ORGANIZATION OF THE THESIS The thesis is organized as follows: Chapter 2 gives a review of the existing methods for dual channel and single channel speech enhancement algorithms and about noise estimation related to the present work. In this chapter, Section 2.2 deals with the review of the existing speech enhancement methods and various noise estimation algorithms. In Section 2.3, drawbacks of the existing speech enhancement algorithms are discussed. Section 2.4 gives the need for the present work. Chapter 3 presents the dual channel algorithms. Hadamard Least Mean Square algorithm with DCT preprocessing technique and Hadamard Recursive Least Square algorithm with DCT preprocessing are proposed in this chapter. Further time domain and frequency domain plots are obtained and results are analyzed. In Chapter 4, single channel algorithms are proposed. In Section 4.2, enhancement technique using Partial Differential Equation (PDE) is proposed for stationary noisy speech signal condition and is operated on time domain speech samples. Section 4.3 deals with the speech enhancement using variance and modified gain function; here the modified gain function is calculated both in time index and in frequency bin to compensate the speech distortion. Performance analysis is made by comparing with the existing method. Chapter 5 provides the speech enhancement algorithm using subband approach. Section 5.2 consists of subband spectral subtraction method using adaptive noise estimation algorithm, which is used for

16 16 non-stationary noisy environment to improve the quality of the enhanced speech signal along with the reduced musical noise compared to the conventional spectral subtraction method. In Section 5.3, a Subband Two Step Decision Directed approach (SBTSDD) with adaptive weighting factor and perceptual gain factor is proposed. This section considers the masking property and the perceptual quality of the human ear. The masking thresholds are used to obtain the enhanced signal. Results for the proposed methods are analyzed and compared with the results of Chapter 4 and the plots for time domain and frequency domain outputs are given. Further in Chapter 6, speech enhancement for digital hearing aid is discussed based on the results obtained by the proposed methods. Chapter 7. The conclusion and suggestions for future work are presented in 1.7 SUMMARY In this chapter the introduction of speech processing systems and the speech enhancement are discussed. The overview of the speech enhancement techniques and its applications are discussed briefly. The next chapter deals with the literature review and the need for the present work.

HCS 7367 Speech Perception

Long-term spectrum of speech HCS 7367 Speech Perception Connected speech Absolute threshold Males Dr. Peter Assmann Fall 212 Females Long-term spectrum of speech Vowels Males Females 2) Absolute threshold