Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

Similar documents
Avaya IP Office 10.1 Telecommunication Functions

Avaya IP Office R9.1 Avaya one-x Portal Call Assistant Voluntary Product Accessibility Template (VPAT)

Date: April 19, 2017 Name of Product: Cisco Spark Board Contact for more information:

The following information relates to NEC products offered under our GSA Schedule GS-35F- 0245J and other Federal Contracts.

Communications Accessibility with Avaya IP Office

Voluntary Product Accessibility Template (VPAT)

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

Avaya one-x Communicator for Mac OS X R2.0 Voluntary Product Accessibility Template (VPAT)

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

Avaya Model 9611G H.323 Deskphone

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

Avaya G450 Branch Gateway, Release 7.1 Voluntary Product Accessibility Template (VPAT)

Summary Table Voluntary Product Accessibility Template. Supports. Not Applicable. Not Applicable- Not Applicable- Supports

DTERM SP30 and SP350. Not applicable

VPAT for Apple MacBook Air (mid 2013)

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Summary Table Voluntary Product Accessibility Template. Supporting Features. Not Applicable. Supports. Not Applicable. Supports

Avaya G450 Branch Gateway, R6.2 Voluntary Product Accessibility Template (VPAT)

Noise-Robust Speech Recognition Technologies in Mobile Environments

Summary Table Voluntary Product Accessibility Template

VPAT Summary. VPAT Details. Section Telecommunications Products - Detail. Date: October 8, 2014 Name of Product: BladeCenter HS23

A Consumer-friendly Recap of the HLAA 2018 Research Symposium: Listening in Noise Webinar

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

Summary Table Voluntary Product Accessibility Template. Supporting Features Not Applicable Not Applicable. Supports with Exceptions.

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

Fujitsu LifeBook T Series TabletPC Voluntary Product Accessibility Template

Open Portable Platform for Hearing Aid Research

CHAPTER 1 INTRODUCTION

The effect of wearing conventional and level-dependent hearing protectors on speech production in noise and quiet

Summary Table Voluntary Product Accessibility Template

Avaya B189 Conference Telephone Voluntary Product Accessibility Template (VPAT)

Summary Table Voluntary Product Accessibility Template

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

Supporting Features. Criteria. Remarks and Explanations

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

Avaya 2500 Series Analog Telephones Voluntary Product Accessibility Template (VPAT)

Summary Table Voluntary Product Accessibility Template. Supports. Please refer to. Supports. Please refer to

SUMMARY TABLE VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

Supporting Features Remarks and Explanations

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

SUMMARY TABLE VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

Avaya B179 Conference Telephone Voluntary Product Accessibility Template (VPAT)

Fig. 1 High level block diagram of the binary mask algorithm.[1]

Summary Table Voluntary Product Accessibility Template. Not Applicable

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

Apple emac. Standards Subpart Software applications and operating systems. Subpart B -- Technical Standards

HOW TO USE THE SHURE MXA910 CEILING ARRAY MICROPHONE FOR VOICE LIFT

Echo Canceller with Noise Reduction Provides Comfortable Hands-free Telecommunication in Noisy Environments

Voluntary Product Accessibility Template (VPAT)

Avaya 3904 Digital Deskphone Voluntary Product Accessibility Template (VPAT)

Criteria Supporting Features Remarks and Explanations

Speech as HCI. HCI Lecture 11. Human Communication by Speech. Speech as HCI(cont. 2) Guest lecture: Speech Interfaces

An active unpleasantness control system for indoor noise based on auditory masking

Accessibility Standards Mitel MiVoice 8528 and 8568 Digital Business Telephones

Sonic Spotlight. SmartCompress. Advancing compression technology into the future

FREQUENCY COMPRESSION AND FREQUENCY SHIFTING FOR THE HEARING IMPAIRED

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

Konftel 300Mx. Voluntary Product Accessibility Template (VPAT)

Room noise reduction in audio and video calls

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

CROS System Initial Fit Protocol

Sonic Spotlight. Binaural Coordination: Making the Connection

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Summary Table Voluntary Product Accessibility Template. Supporting Features. Not Applicable- Supports with Exception. Not Applicable.

Challenges in microphone array processing for hearing aids. Volkmar Hamacher Siemens Audiological Engineering Group Erlangen, Germany

Providing Effective Communication Access

Personal Listening Solutions. Featuring... Digisystem.

Summary Table Voluntary Product Accessibility Template. Criteria Supporting Features Remarks and explanations

Assistive Technologies

HearIntelligence by HANSATON. Intelligent hearing means natural hearing.

Proceedings of Meetings on Acoustics

Section Telecommunications Products Toll-Free Service (TFS) Detail Voluntary Product Accessibility Template

Summary Table Voluntary Product Accessibility Template. Supporting Features. Supports. Supports. Supports. Supports

Implementation of Spectral Maxima Sound processing for cochlear. implants by using Bark scale Frequency band partition

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Pitch & Binaural listening

Price list Tools for Research & Development

Avaya B159 Conference Telephone Voluntary Product Accessibility Template (VPAT)

USER GUIDE FOR MATLAB BASED HEARING TEST SIMULATOR GUI FOR CLINICAL TESTING

FM SYSTEMS. with the FONIX 6500-CX Hearing Aid Analyzer. (Requires software version 4.20 or above) FRYE ELECTRONICS, INC.

Visi-Pitch IV is the latest version of the most widely

Voluntary Product Accessibility Template (VPAT)

Team Members: MaryPat Beaufait, Benjamin Dennis, Taylor Milligan, Anirudh Nath, Wes Smith

Roger TM Plug in, turn on and teach. Dynamic SoundField

Walkthrough

iclicker2 Student Remote Voluntary Product Accessibility Template (VPAT)

BINAURAL DICHOTIC PRESENTATION FOR MODERATE BILATERAL SENSORINEURAL HEARING-IMPAIRED

Roger Dynamic SoundField A new era in classroom amplification

Roger TM. Learning without limits. SoundField for education

Computational Perception /785. Auditory Scene Analysis

C H A N N E L S A N D B A N D S A C T I V E N O I S E C O N T R O L 2

PC BASED AUDIOMETER GENERATING AUDIOGRAM TO ASSESS ACOUSTIC THRESHOLD

2/16/2012. Fitting Current Amplification Technology on Infants and Children. Preselection Issues & Procedures

SUPPRESSION OF MUSICAL NOISE IN ENHANCED SPEECH USING PRE-IMAGE ITERATIONS. Christina Leitner and Franz Pernkopf

Networx Universal. Supporting Features. Criteria. Remarks and explanations

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

Genesis of wearable DSP structures for selective speech enhancement and replacement to compensate severe hearing deficits

SUMMARY TABLE VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

Transcription:

DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited Advanced Audio Interface for Phonetic Speech Recognition in a High Noise Environment SBIR 99.1 TOPIC AF99-1Q3 PHASE I SUMMARY REPORT Prepared By Standard Object Systems, Inc. January 2000 for Air Force Research Laboratory Contract No. F41624-99-C-6019 ABSTRACT Standard Object Systems, Inc. (SOS) has used its existing technology in phonetic speech recognition, audio signal processing, and multilingual language translation to design and demonstrate an advanced audio interface for speech recognition in a high noise military environment. The Phase I result was a design for a Phase II prototype system with unique dual microphone hardware with adaptive digital filtering for noise cancellation which interfaces to speech recognition software. It uses auditory features in speech recognition training, and provides applications to multilingual spoken language translation. As a future Phase III commercial product, this system will perform real time multilingual speech recognition in noisy vehicles, offices and factories. The potential market for this technology includes any commercial speech and translation application in noisy environments. To reduce the effect of noise in speech recognition, SOS created an adaptive digital filter to remove noise from speech prior to recognition processing. Six different filters were programmed and demonstrated using a unique dual microphone input computer system. These results were evaluated using three commercial speech recognition software systems and the SOS phonetic speech recognition (SPSR) tool kit. The maximum improvement in spoken words recognized was over 300% for the highest noise level. In addition, SOS researched computational models for noise resistant speech features using three auditory models: the auditory physiology simulation (APS), the ensemble interval histogram (EIH), and the auditory image model (AIM). These models will be implemented and tested in Phase II and will only change the speech recognition systems when they increase the existing performance. In conjunction, SOS has developed a unique multilingual speech translation method that will be implemented in Phase II for use in noisy spoken communication applications. During Phase II, a prototype unit will be developed in the first year that includes dual microphones, adaptive filtering, auditory features, speech recognition, and multilingual translation. In the second year this prototype will be tested in operational environments and improved to become a commercial product in Phase III. jyric QUALIFY msfbceed 1

Purpose of the Research Work Both the Department of Defense and commercial users have identified the failure of computer speech recognition in noisy real-world situations as a critical problem. This Phase I SBIR research report describes the creative concept, the component designs, and the experiment analysis of a dual head-mounted noise canceling microphone with an adaptive digital filter for speech recognition in noisy environments designed by Standard Object Systems Inc. (SOS). The primary research goal is to improve computer speech recognition in noisy environments with innovative adaptive filtering of dual microphone inputs processed through dual sound cards. The following figure illustrates the sound signal processing and adaptive digital filtering designed to improve noisy speech recognition, as demonstrated in the proof of concept presentation. Advanced Audio Interface for Speech Recognition in a High Noise Environment Phase II Prototype Research and Development Project Speech Plus Noise Signal Independent Noise Signal Head Mounted Boom Microphone Ear Mounted Noise Microphone Adaptive Digital Noise Canceling Filter with Speech Detection and Classification Dual Sound Card Low Cost PC Based Computer System for Two Microphones Auditory Feature Modeling Modified Speech Recognition System Domain Based Translation System Multilingual Spoken Language Translator SOS has applied its existing technology in phonetic speech recognition, digital signal processing, software development, audio hardware engineering, and acoustic speech science to perform this research. Numerous experiments were conducted with adaptive filters, noise models, speech recognizers, auditory feature models, and PC dual microphone hardware designs. The results of the Phase I effort are documented in the Phase I report, in the accompanying CD ROM of sound files and data, and in the proof of concept demonstration conducted at the AFRL in Wright Patterson AFB.

Research Conducted During the Phase I Project The Phase I report provides an in-depth presentation of the design of the adaptive digital filters and the performance results from the SOS experiments for noisy speech recognition. It addresses hardware design and software interface conditions for using multiple PC microphones and low cost sound cards. The research compares several available PC speech recognition software products used to test the filtered speech signals, and presents an analysis of the test results. In addition, SOS researched three physiological models and auditory feature models of hearing for Phase II incorporation within the SPSR tool kit and other commercial speech recognition systems. The report describes the operation of the SPSR tool kit and the modifications necessary to use auditory feature data for noisy speech recognition. During the Phase I design, SOS constructed a number of adaptive filter experiments to prove and demonstrate the feasibility of this noise canceling technology. The following table classifies the six prototype filters by linear and nonlinear computation, time-domain and frequency-domain processing, the number of microphone inputs, and the unique speech signal reconstruction method. NUM NAME L/NL TD/FD MIC RECONSTRUCTION 1 Linear Adaptive Filter Bank L FD 2 Triangular IFFT Coherent 2 Magnitude FFT NL FD 2 Original Phase + Mag 3 Loq Magnitude FFT NL FD 2 Original Phase + Mag 4 LMS ALE L TD 1 None, Time Shifted Output 5 Mag FFT with Iterative Recon NL FD 2 Phase Iteration 6 Iterative Recon with Spectral Sub NL FD 1 Noise Est by Scale Function The Phase I report discusses each of these prototype adaptive filters in detail. Filter 1 is a linear adaptive filter that adjusts FIR coefficients per frequency bin for overlapping signal data blocks. In general the signal filtered with Filter 1 performs better for speech recognition programs than it sounds to the human ear. Filter 2 is a magnitude FFT experiment that accepts two sound signal inputs and produces a filtered speech signal as output. Filter 3 is a log magnitude FFT version of Filter 1 that scales the noise by the logarithm of the magnitude to approximate the response characteristic of the human ear. In general both of these filters sound clear but are not easily classified by the speech recognizer programs. Filter 4 is a least mean square adaptive line enhancer filter that uses N speech samples to predict 2 * N samples ahead. It is a linear timedomain computation that removes voiced speech from noise. The filter is limited to voiced speech signals and performs poorly with the speech recognition programs. Filter 5 is a magnitude FFT adaptive filter that scales by the noise magnitude and uses iterative reconstruction for overlapping dual signal stream data blocks. Filter 6 uses the Filter 5 computation to remove the estimated noise from a single signal input stream. In general, speech sounds reconstructed by Filters 5 and 6 are perceived more clearly by the human ear, and are easily classified by the speech recognition algorithms. SOS investigated three separate auditory models for application to noisy speech recognition. Each uses physiological models of hearing to enhance speech recognition features, but each has a different computational basis. These contrast the statistical methods that use signal processing to transform speech into features that can be used to train and test a pattern-based recognition system. The first method is auditory physical simulation, APS, developed by SOS. This model uses continuous differential equations to represent the coupled components in the hearing process. The second method is a published ensemble interval histogram, EIH, model of the cochlea and hair cell transduction to create numerical features. The third model is the auditory image model, AIM, developed by Roy Patterson and others at Cambridge University. In each

case SOS has developed or acquired a computer program used to analyze numerical feature data in phonetic speech recognition. A major task in Phase II will be to program and analyze these models to create phonetic speech feature data. This data will be used to train speech recognition systems in the presence of noise. If replacing the existing feature models improves the speech recognition accuracy, then auditory-based features will be utilized in the adaptive noise filter design. The auditory-based features research is low risk; and will only be implemented if it enhances speech recognition accuracy, therefor auditory based features were not prototyped for the Phase I proof of concept demonstration. Adaptive Filter Performance Gain m Noise 1 Noise 2 D Noise 3 D Noise 4 350.00% 300.00% Filter 1 Filter 2 Filter 3 Filter 4 Adaptive Filter Models Filter 5 Filter 6 Noise 4 Noise 3 Noise 2 Noise 1 Results of the Phase I Research The goal of the Phase I is to create a baseline design for the Phase II noise canceling filter prototype development. Each of the six prototypes was tested with the same set of speech and noise signal files to produce a sound input file for speech recognition as shown in the above figure. Four speech recognizers were selected for testing against each of the sound files to determine the correct words-recognized percentage. The figure shows the performance gain in word recognition improvement using the SRI Nuance speech recognition system for each filter

and each noise level. Other systems and other data were tested. Based on the results of the speech recognition performance and on subjective listening to the sound files, a baseline design for a Phase II adaptive noise canceling filter for speech recognition performance enhancement was created. The implementation of the adaptive filter in Phase II will be based on the Phase I prototypes and test results. Two configurations are planned for delivery during year one of the program. First, an alpha configuration that executes on a desktop computer to provide test and evaluation data. The second configuration is a beta version, implemented for real-time execution in an operational environment for test and evaluation during year two of the project. Single or Dual Microphone Signal Inputs Synchronization of Speech Signals Transfer Function between Microphones Speech Activity Detection Time Domain Processing Energy and Zero Crossings Voiced and Unvoiced Classification Frequency Domain Processing First Order LPC Computation Voiced Period Processing Linear Algorithm Using Filter 1 or 4 Experimental Selection and Tuning Unvoiced Period Processing Nonlinear Algorithm Using Filter 5 or 6 Experimental Selection and Tuning Speech Signal Reconstruction Combine Three Sound Periods Smooth Reconstruction Parameters Direct Input Feature Computation Time and Frequency Domain Data SPSR Feature Computation Interface The design is a combination of the results from the Phase I proof of concept filter experiments with six processing components. The first two are the speech activity detector and a voiced/unvoiced state classifier. Third is a linear filter for voiced period speech processing. Forth is a nonlinear filter for unvoiced period speech processing. In the fifth, a signal reconstruction recombines the three sound segments, silence, voiced, and unvoiced into a low noise speech signal. Alternatively, in six, the sound period data can be accessed directly to

compute speech recognition feature data. This can be accomplished by a direct access to the recognition process, as in the SOS SPSR tool kit, or through specialized programming for the HTK or Sphinx II systems as an example. Most commercial speech recognition programs do not allow access to such a level of training detail. The overall filter design is similar to the front end processing used in low bandwidth sound compression systems. The combination of these six component stages will result in the development of an innovative and unique noise cancellation system targeted specifically to speech recognition enhancement. Applications of this Research Commercial speech recognition applications have existed for over twenty years, while computer research dates back more than forty years. Industries as numerous and diverse as banking, dictation, inspection, mail, and medical data entry use speech recognition, as shown in Figure 4-1. The most prevalent noisy environments are vehicles such as airplanes, ships, autos, tanks, and specialized industrial equipment. These environments usually require a hands-free and eyes-free speech recognition application so the operator can enter data. One example would be a policeman speaking license numbers into the digital radio system of the national crime database while driving in traffic. Figure 4-1 Applications for Phonetic Speech Recognition with Noise APPLICATION Critical Communication (911, ATC, Police) Routing (touch tone menus, telephones ) Control (wheelchairs, VCRs, PCs) Dictation (letters, reports, records, forms) Translation (languages, understanding) Training (simulations, vocal procedures) REQUIREMENT Intuitive operation Speaker independent Reliable and accurate Large vocabulary Multi lingual Complex interaction NOISE Line, caller environment Bandwidth limited Outdoor, commercial Office, multi speakers Office environment Vehicles, machinery Phonetic speech recognition systems, such as the SPSR Tool Kit, can be commercially applied in each of these applications. Robust speaker independent recognition is already replacing the frustrating touch tone voice menus in many telephone answering systems. As speech recognition becomes more widespread, specialized digital speech microphones will become a standard input tool alongside the mouse and keyboard. Operation in noisy environments with multiple speakers and multiple languages will be taken for granted. Figure 4-2 gives the SOS approach for the requirements identified for an advanced audio interface for noisy environment speech recognition and spoken language translation applications. Figure 4-2 Advanced Audio Interface for Noisy Environments Requirements REQUIREMENT Noisy Environments Noise Cancellation Multiple Sound Inputs Speech Recognition Noise Resistant Features Speaker Independent Multi Lingual Capability Spoken Translation Interactive Visual Tools Speaker Adaptation Speaker Enrollment SOS APPROACH Tested up to 90 db live noise and speech inputs Adaptive digital filtering with two independent inputs Dual microphones using two low cost PC sound cards Utterance processing from silence period to silence period Auditory based feature models and computations for training Phonetic based with multilingual dialects, and accents Phonetic alphabet covers over 350 spoken languages Automatic translator generation for specific domains Grammar definition, phonetic word entry, noise models Session to session file to improve recognition performance Simple enrollment process for speaker dependent recognition