Genesis of wearable DSP structures for selective speech enhancement and replacement to compensate severe hearing deficits Axel PLINGE, Dieter BAUER Leibniz Institute for Occupational Health, Dortmund, Germany Abstract: Conventional hearing instruments often cannot compensate the hearing losses to a sufficient degree. Digital signal processing offers novel chances to provide speech enhancement and replacement functionalities in a wearable device. Targeting such a device, a working laboratory prototype with DSPs was created. Keywords: hearing impairment, frication, transposition, selectivity, DSP, wearable, classification Introduction In case of severe sensory hearing deficits, conventional hearing instruments are often insufficient to compensate sensory hearing losses in order to enable proper communication. Too many sounds are too weak or inaudible, leading to ill classification and confusion. Even worse, the residual speech recognition abilities are further disturbed by environmental noise and competing speakers. Thus, beyond very good noise reduction functionality, the users in our target group urgently need support in the form of enhancing or replacing otherwise inaudible speech features. Digital signal processing offers novel chances for delivering such functionalities. Modern digital low power processors are increasingly powerful and can be used as an inexpensive basis (as opposed to custom-made circuitry) to implement more and more functionality into wearable equipment with small rechargeable batteries as power supply.... front end DSP1 multi-band speech enhancement delay (temporal alignment) Σ transmitter DSP34 or radio link DSP2 phonetic spotter stimulus generation Figure 1. Basic structure
Parallel Data Bus 1. System Genesis 1.1. Basic Structure Many simulations and evaluations of available digital integrated circuits with low power consumption have led to the basic processing structure shown in figure 1. The processing core consists of two coupled subunits, each of which has one DSP at its centre. It provides baseband processing with controlled compression as well as enhancement and replacement of speech features controlled by a phoneme spotter. This unit can be coupled to either one of two possible front-ends for providing substantial noise reduction. On solution would be radio-link microphones worn by the communication partner. This requires will and acceptance of the partner that is not always granted. The second solution is the use of an intelligent microphone array that adapts to the noise field characteristics. We expect such a solution to be feasible with two further DSPs. 1.2. Laboratory Prototype The laboratory prototype consists of two coupled ADSP 2189 with surrounding interface hardware (figure 2). The functional blocks for one of the DSPs are sketched in figure 3. Each evaluation board can be equipped with both RAM and flash memory. The Flash is necessary to host the DSP software with boot loader when no PC is connected. An RS232 connection to a PC is used to upload the DSPs program into RAM or flash. The interface is also used to talk to the DSPs while running the program, thus allowing on-line monitoring and modification of the processing parameters. Power Supply (5.0V / 3.0V / 1.8V etc) Microcontroller PIC16F877 BUS Multiplexer Control Interface Diagnotics RAM Memory 128k Byte PC 2. DSP PLD RS232 Interface Serial data exchange DSP-Module ADSP-2189M 75Mips Flash Memory 512k Byte I/O Module Codec Module Codec Module I/O Module Figure 2. Laboratory prototype Figure 3. Blocks of on DSP module
1.3. Wearable Solution The wearable device will be based on the current laboratory prototype, but stripped of many components. Essentially the processor and flash memory and one codec will remain. The RS232 interface will be replaced by IrDA circuitry for wireless coupling to the PC (marked grey in figure 3). 2. Current Implementation The first aim of Goal of this implementation was to demonstrate that all the functionalities of the rather complex design (previously evaluated in simulation) of spotter controlled transposition of /s, z/ /C/ and /t/ is feasible using selected low power circuitry that can be easily transformed into a wearable design powered by lithium ion batteries [3, 4]. The second aim was to implement a better baseband processing surpassing the previous design [1] by intelligent control of the compression. 2.1. Baseband Figure 3 shows a simplified block diagram of the base band processing. The input filter bank consists of three linear (finite impulse response) filters that have individually different pass-bands to allow for a speech-mode pre-equalizing processing. The following multi-band compression uses tree different temporal characteristics that again may differ between bands. Within the higher second formant range (the third channel) additional processing for the temporal envelope may be introduced as novelty for speech-specific enhancement of second formant features (SEF). Another new feature are the two external control lines: The spotter control can be used to introduce phoneme dependant compression gain or specific SEF. All processing parameters can be modified selectively according to the spotted phoneme class. The spatial control may be used when the microphone array processor is used as font-end. Reliable speaker identification is transmitted to modulate the compression of ambient noise to a predetermined, non-masking level. Table 1 gives an account of the processing power used by the current implementation. We can conclude that all functionality whose salience was pre-established in simulations fits well into one 75MIPS (million instructions per second) DSP. spatial control controlled multi band compression (3 temporal characteristics, look-ahead, band coupling) spotter control Figure 3. Baseband processing post Σ 20 Compression 15 Control 10 Communication 15 Management 60 Table 1. Baseband MIPS
2.2. Transposer The Transposition Unit can be roughly divided into three main functional blocks as shown in Figure 4. The phoneme spotter extracts a set of speech features in the feature extraction block, and then classifies the feature vector as one (or none) of the predefined phoneme classes (classifier). The detection of a speech feature or phoneme to be replaced is then triggering the generation of replacement stimuli in the third block (and modifying processing parameters of the base band processing). The whole design was made considering the severe constraints of processing power (since targeting a wearable device) and time (to allow perceptual integration) from the first simulations to the present working prototype. 2.2.1. Feature Extraction Under the aforementioned constraints, only a small number of features can be used to reliably detect the fricatives and plosives in question. Special evaluation lead to refinement of the spectral features to the two ratios of three band for best separation of /S/ and /C/ and comparison of the energy values to an /s/ band situated beyond 4.5Khz [3, 4]. To avoid temporal asynchrony, four linear phase filters are used in conjunction with moving average and 16bit division. This branch requires 24MIPS of calculation effort. To separate voiced from unvoiced speech, the maximum value of the normalized cross correlation is used [6]. In order to use this very salient feature, the calculation was handoptimised in assembler code - under constant control of the resulting quality - down to 6MIPS (from about 100). Since the need for special treatment of /t/ became evident, plosive features had to be added [5]. For plosion burst detection, several energy deviation measures with different bands were tested. A single ROR (rate-of-rise) feature with just one pre-filter was found to yield good significance. A pause detector was added to account for the plosive closure. 2.2.2. Classification To classify the so derived feature vector, a threefold phoneme recognition scheme was devised [3, 4]. A Gaussian distance measure is evaluated using prototypes that are calculated >4.5kHz 2.4-3.8kHz 1.2-2.4kHz 0.6-1.2kHz Downsmpl. Avg Ratio NCCF range & post #ZC sin s' sin C' t' modulator Prefilter Prefilter ROR Pause features (36MIPS) µ distance classifier (20 MIPS) stimulus generation (10MIPS) soft switch Figure 4. Transposer
using the PC simulation of the classifier and hundreds of labelled speech samples. After omitting covariance, the distance function evaluated can be reduced to equation 1, requiring just 11MIPS for 6 features and 6 classes. x i μki d Κ ( x) : = 2log pk 2 logσ K = CK ( xi μki ) SKi (1) σ Ki To accommodate asymmetric deviations and exclude unwanted phonemes, a range check of the feature vector was introduced. It may also be utilized to adjust the transposers selectivity, as discussed in [8]. For temporal smoothing of the recognition result, a post correction is added. 2.2.3. Replacement Stimulus Generation After careful evaluation to find optimised replacement stimuli [7, 8], a high-quality but lowcost generation was implemented: No more than 10 MIPS are needed for concatenating stored data, sine modulation and zero-crossing rate measurement. 2 Conclusion Given the successful implementation within the laboratory prototype, the construction of a wearable test device that provides high quality assistance in speech understanding can be considered to be feasible and can be built. Acknowledgements: We would like to thank W.H. Ehrenstein for revising the English text. References [1] A. Plinge, D. Bauer, M. Finke (2001): Intelligibility enhancement of human speech for severely hearing impaired persons by dedicated digital processing In: Crt Marincek et al. (eds.) Assistive Technology - Added Value to the Quality of Life. IOS Press [2] L. Arslan and J. H. L. Hansen (1994): Minimum cost based phoneme class detection for improved iterative speech enhancement, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing Vol. 2 pp. 45 48 [3] D. Bauer, A. Plinge, M. Finke (2002): Selective Phoneme Spotting for Realization of an /s, z, C, t/ Transposer. In: Miesenberger et al. (eds): Computers Helping People with Special Needs, 8th ICCHP Proceedings, Lecture Notes in Computer Science 2398. Springer, Heidelberg [4] A. Plinge, D. Bauer (2003): Introducing Restoration of Selectivity in Hearing Instrument Design trough Phoneme Spotting In: G. M. Craddock et al (eds.): Assistive Technology Shaping the Future. IOS Press [5] B. Plannerer et al. (1996): A continuous speech recognition system integrating additional acoustic knowledge sources. Technical report, TU München [6] D. Talking (1995): A Robust Algorithm for Pitch Tracking, Speech Coding and Synthesis, W.B. Kelijn and K.K. Paliwal (Eds.), Elsevier Science [7] D. Bauer, A. Plinge and W.H. Ehrenstein (2003): Compensation of Severe Sensory Hearing Deficits. Two Different Approaches to Replace Inaudible Speech Elements: Re-Sampling Versus Re-Synthesis. In: G.M. Craddock et al. Assistive Technology Shaping the Future. IOS Press [8] D. Bauer, A Plinge (2005): Tools and Strategies for Fitting a Wearable Frication Transposer to the Needs of Severely Hearing Impaired People (this volume)