Speech and Sound Use in a Remote Monitoring System for Health Care

Similar documents
Noise-Robust Speech Recognition Technologies in Mobile Environments

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

A Sound Database For Health Smart Home

A HMM recognition of consonant-vowel syllables from lip contours: the Cued Speech case

Automatic Voice Activity Detection in Different Speech Applications

Exploiting visual information for NAM recognition

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

Telephone Based Automatic Voice Pathology Assessment.

Speech as HCI. HCI Lecture 11. Human Communication by Speech. Speech as HCI(cont. 2) Guest lecture: Speech Interfaces

Robust Speech Detection for Noisy Environments

Recognition & Organization of Speech & Audio

General Soundtrack Analysis

Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

Recognition & Organization of Speech and Audio

Amplitude Compression: Timing is Everything

Acoustic Signal Processing Based on Deep Neural Networks

Performance of Gaussian Mixture Models as a Classifier for Pathological Voice

Broadband Wireless Access and Applications Center (BWAC) CUA Site Planning Workshop

Research Article Automatic Speaker Recognition for Mobile Forensic Applications

Price list Tools for Research & Development

A Multi-Sensor Speech Database with Applications towards Robust Speech Processing in Hostile Environments

A Smart Texting System For Android Mobile Users

Providing Effective Communication Access

Smart Gloves for Hand Gesture Recognition and Translation into Text and Audio

VIRTUAL ASSISTANT FOR DEAF AND DUMB

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Lip shape and hand position fusion for automatic vowel recognition in Cued Speech for French

Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phase Components

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

Speech Intelligibility Measurements in Auditorium

Development of Audio Sensing Technology for Ambient Assisted Living: Applications and Challenges

April 23, Roger Dynamic SoundField & Roger Focus. Can You Decipher This? The challenges of understanding

Dutch Multimodal Corpus for Speech Recognition

The SRI System for the NIST OpenSAD 2015 Speech Activity Detection Evaluation

Lecture 9: Speech Recognition: Front Ends

Ambiguity in the recognition of phonetic vowels when using a bone conduction microphone

Speech Enhancement Based on Deep Neural Networks

Using the Soundtrack to Classify Videos

Gender Based Emotion Recognition using Speech Signals: A Review

GENERALIZATION OF SUPERVISED LEARNING FOR BINARY MASK ESTIMATION

Sound, Mixtures, and Learning

Assistive Listening Technology: in the workplace and on campus

Modulation and Top-Down Processing in Audition

Tandem modeling investigations

CHAPTER 1 INTRODUCTION

Study about Software used in Sign Language Recognition

Magazzini del Cotone Conference Center, GENOA - ITALY

Review of SPRACH/Thisl meetings Cambridge UK, 1998sep03/04

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

Categorical Perception

Assistive Technologies

Using Source Models in Speech Separation

EEL 6586, Project - Hearing Aids algorithms

VITHEA: On-line word naming therapy in Portuguese for aphasic patients exploiting automatic speech recognition

INTELLIGENT LIP READING SYSTEM FOR HEARING AND VOCAL IMPAIRMENT

The Institution of Engineers Australia Communications Conference Sydney, October

Priya Rani 1, A N Cheeran 2, Vaibhav D Awandekar 3 and Rameshwari S Mane 4

Masking release and the contribution of obstruent consonants on speech recognition in noise by cochlear implant users

Audio-Visual Integration in Multimodal Communication

Computational Perception /785. Auditory Scene Analysis

HCS 7367 Speech Perception

Comparative Analysis of Vocal Characteristics in Speakers with Depression and High-Risk Suicide

Sound Interfaces Engineering Interaction Technologies. Prof. Stefanie Mueller HCI Engineering Group

ERO SCAN. Otoacoustic Emission Testing

Effects of Aircraft Noise on Student Learning

Chapter 7. Communication Elements and Features

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

HISTOGRAM EQUALIZATION BASED FRONT-END PROCESSING FOR NOISY SPEECH RECOGNITION

Perceptual Effects of Nasal Cue Modification

KALAKA-3: a database for the recognition of spoken European languages on YouTube audios

TESTS OF ROBUSTNESS OF GMM SPEAKER VERIFICATION IN VoIP TELEPHONY

Speech recognition in noisy environments: A survey

Using Speech Models for Separation

GAUSSIAN MIXTURE MODEL BASED SPEAKER RECOGNITION USING AIR AND THROAT MICROPHONE

Noise reduction in modern hearing aids long-term average gain measurements using speech

Recognition & Organization of Speech and Audio

Detection and Classification of Acoustic Events for In-Home Care

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

Research Proposal on Emotion Recognition

International Journal of Scientific & Engineering Research, Volume 5, Issue 3, March ISSN

HCS 7367 Speech Perception

Presenter. Community Outreach Specialist Center for Sight & Hearing

LIE DETECTION SYSTEM USING INPUT VOICE SIGNAL K.Meena 1, K.Veena 2 (Corresponding Author: K.Veena) 1 Associate Professor, 2 Research Scholar,

Human-Robotic Agent Speech Interaction

Section Web-based Internet information and applications VoIP Transport Service (VoIPTS) Detail Voluntary Product Accessibility Template

Speaker Independent Isolated Word Speech to Text Conversion Using Auto Spectral Subtraction for Punjabi Language

FIR filter bank design for Audiogram Matching

Avaya IP Office 10.1 Telecommunication Functions

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

I. INTRODUCTION. OMBARD EFFECT (LE), named after the French otorhino-laryngologist

Microphone Input LED Display T-shirt

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

International Forensic Science & Forensic Medicine Conference Naif Arab University for Security Sciences Riyadh Saudi Arabia

Thrive Hearing Control Application

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

Smart Speaking Gloves for Speechless

Transcription:

Speech and Sound Use in a Remote System for Health Care M. Vacher J.-F. Serignat S. Chaillol D. Istrate V. Popescu CLIPS-IMAG, Team GEOD Joseph Fourier University of Grenoble - CNRS (France) Text, Speech and Dialogue 2006 11 th to 15 th of September Page 1/28 TSD 2006 CLIPS-IMAG

Outline Page 2/28 TSD 2006 CLIPS-IMAG

Outline Page 3/28 TSD 2006 CLIPS-IMAG

Medical Remote : various sensors Position Medical Sound infrared tensiometer microphones contact heart rate LOCAL UNDER MONITORING DATA FUSION SYSTEM SOUND MEDICAL ANALYSIS SYSTEM SENSORS EMERGENCY CALL Medical Center Extracted informations through Sound : patient s activity: door lock, phone, "It s hot!"... patient s physiology: cough... scream, glass breaking, distress situation: "Help me!"... Page 4/28 TSD 2006 CLIPS-IMAG

Outline Page 5/28 TSD 2006 CLIPS-IMAG

Speech Corpus The "Normal-Distress" corpus : 21 speakers 11 men, 10 women 20 years age 65 years French 64 normal situation sentences "Bonjour" (Good morning) "Où est le sel?" (Where is the salt) 64 distress sentences "Au secours!" (Help me) "Un médecin vite!" (Doctor quickly) 2,646 audio speech files duration: 38 mn sampling rate: 16 khz Page 6/28 TSD 2006 CLIPS-IMAG

Sound Corpus 7 sound classes: normal sounds: door clapping, phone ringing, dishes... abnormal sounds: breaking glasses, falls, screams 1,577 audio sound files duration: 20 mn sampling rate: 16 khz Class of sound % of the corpus Average of (duration) (number) duration Dishes 6% 10.4% 402 ms Door lock 1% 12.7% 36 ms Door slap 33% 33.2% 737 ms Glass breaking 6% 5.6% 861 ms Ringing phone 40% 32.8% 928 ms Scream 12% 4.6% 1,930 ms Step sound 2% 0.8% 2,257 ms Page 7/28 TSD 2006 CLIPS-IMAG

Noisy Corpus Signal to noise ratio (SNR) : 0, +10, +20, +40dB Experimental HIS Noise : non stationary recorded inside experimental apartment Page 8/28 TSD 2006 CLIPS-IMAG

Outline Page 9/28 TSD 2006 CLIPS-IMAG

Global System (1/2) HALL EMITTER SENNHEISER ew500 Micro. RECEIVER Sound/Speech System Data Acquisition 8 Channels KITCHEN BATH ROOM Micro+Em Micro+Em RECEIVER RECEIVER XML Speech Recognition System Micro+Em RECEIVER Key Word List BED ROOM Sound List Data Acquisition Card: 200 ksamples/s 8 differential channels Sampling rate: 16 ksamples/s for each channel Page 10/28 TSD 2006 CLIPS-IMAG

Global System (2/2) 1st application 8 inputs SOUND/SPEECH SYSTEM Acquisition/Detection Event Detection 5 to 8 chanels Sound 3a Classification if sound Speech/Sound Segmentation 2 Keyword Recognition Message Formatting 4 HYP 1 Acquisition Driving 8 chanels 16 khz WAV if speech 2nd application SPEECH RECOGNITION SYSTEM 3b () XML Output Page 11/28 TSD 2006 CLIPS-IMAG

Outline Page 12/28 TSD 2006 CLIPS-IMAG

Sound System 1st application: SOUND/SPEECH SYSTEM 8 inputs Sequencer Timer() Acquisition/Detection Event Detection 5 to 8 chanels Sound 3a Classification if sound Speech/Sound Segmentation 2 Keyword Recognition Message Formatting 4 HYP 1 Acquisition Driving 8 chanels 16 khz WAV if speech 2nd application SPEECH RECOGNITION SYSTEM 3b () XML Output Page 13/28 TSD 2006 CLIPS-IMAG

Timer and Signal Acquisition Page 14/28 TSD 2006 CLIPS-IMAG

Detection + Speech/Sound Segmentation Detection [1]: Wavelet Tree Detection Equal Error Rate: (ROC curves) EER = 6.5% if SNR = 0dB, 0% if SNR +10dB Segmentation frame 16 ms - overlap 8 ms GMM, 24 Gaussian models 16 LFCC coupled with energy Error Segmentation Rate "cross validation protocol": SNR ESR gobal Speech Sound +40dB 4.1% 0.8% 9.5% +20dB 3.9% 0.8% 9.1% +10dB 3.9% 1.5% 7.9% 0dB 14.5% 21.0% 3.6% [1] M. Vacher et al., Sound Detection and Classification through Transient Models using Wavelet Coefficient Trees, EUSIPCO, 20 th September 2004. Page 15/28 TSD 2006 CLIPS-IMAG

Sound Classification for Medical Remote (1/2) a more complete sound corpus, a new class: object falls adapted to HMM training and classification: one sound per file 1,315 files, 25 mn Class of sound % of the corpus Average of (duration) (number) duration Dishes 5.3% 12.4% 380 ms Door lock 27.2% 12.5% 3,150 ms Door slap 22% 26.2% 950 ms Glass breaking 12.3% 8.2% 1,690 ms Object falls 5.4% 5.5% 1,150 ms Ringing phone 11.7% 19.4% 700 ms Scream 13% 7.2% 2,110 ms Step sound 3.7% 8.4% 450 ms Page 16/28 TSD 2006 CLIPS-IMAG

Sound Classification for Medical Remote (2/2) analysis window 16 ms - overlap 8 ms 12 Gaussian models 16 LFCC with and GMM or 3 state HMM "cross validation protocol" Error classification rate: SNR GMM HMM +50dB 3.2% 2% +40dB 10.2% 9% +20dB 16.5% 10.8% +10dB 12.6% 15.4% 0dB 23.6% 30% Page 17/28 TSD 2006 CLIPS-IMAG

XML Output Speech: analysis is initiated <appli:segmentation description="appli audio"> Module 2: segmentation </pièce>cuisine</pièce> <horodate>1 12 2005 à 15:19:20</horodate> <résultat>parole</résultat> <information>probabilité de son= 20.2018, Probabilité de parole= 17.2258</information> </appli:segmentation> <appli:reconnaissance description="appli audio"> </pièce>cuisine</pièce> <horodate>1 12 2005 à 15:19:20</horodate> <résultat>un docteur vite</résultat> </appli:reconnaissance> Module 3b: sentence recognized by Page 18/28 TSD 2006 CLIPS-IMAG

Acquiring Front Panel Page 19/28 TSD 2006 CLIPS-IMAG

(1/3) 1st application: SOUND/SPEECH SYSTEM 8 inputs Sequencer Timer() Acquisition/Detection Event Detection 5 to 8 chanels Sound 3a Classification if sound Speech/Sound Segmentation 2 Keyword Recognition Message Formatting 4 HYP 1 Acquisition Driving 8 chanels 16 khz WAV if speech 2nd application SPEECH RECOGNITION SYSTEM 3b () XML Output Page 20/28 TSD 2006 CLIPS-IMAG

00010110 11010110 01011010 00010110 01000110 00010110 11110110 01010111 Speech and Sound (2/3) Recognized Sentences "Il fait chaud" Speech Signal Auditory Front End Acoustical Module Phoneme matching Word Matching _ Sentence Matching Feature Extraction 16 ms 50% 16 MFCC + E + + Lexicon (HMM) Dictionary Language Model (3 gram) a1 a2 a3 b1 b2 b3 {chaud} {{S WB} {o WB}} {chaude} {{S WB} o d {& WB}} {chaudes} {{S WB} o {d WB}} Il fait chaud 0.002 Il fait beau 0.00123 Il fait froid 0.00076 Acoustic models of phoneme {chauffage} {{S WB} oo f aa {Z WB}} Phonetic models of words Page 21/28 TSD 2006 CLIPS-IMAG

(3/3) Acoustic models: training with large corpora Braf80, Braf100, Braf120 large number of French speakers more than 200 Dictionary: 11,000 words (French) (Speech Assessment Methods Phonetic Alphabet) Language model: n - gram, n {1, 2, 3} extraction from the WEB and "Le Monde" corpora optimization for distress sentences Page 22/28 TSD 2006 CLIPS-IMAG

Recognition Results all the sentences of our corpus 5 speakers Corpus Recognition Error Normal Sentences False Alarm Sentence 6% "Quel temps fait-il dehors" / "Quel temps fait-il de mort" Distress Sentences Missed Alarm Sentence 16% "Faîtes vite" / "Equilibre" "A moi" / "Un grand" Page 23/28 TSD 2006 CLIPS-IMAG

Sound classification for distress detection. Speech recognition allowing call for help. Real-Time operation on an operating system without real-time capacity. Speaker independence of the ASR system. For 10dB and upper: errorless detection and segmentation error below 5%. Outlook Improvement of sound classification through HMM Development of a complete acoustical analysis system for life-sized tests. Page 24/28 TSD 2006 CLIPS-IMAG

Detection and Speech/Sound Segmentation in a Smart Room Environment Thank you for your attention. Page 25/28 TSD 2006 CLIPS-IMAG

Appendix For Further Reading Other For Further Reading I D. Istrate et al. Information Extraction From Sound for Medical Telemonitoring, IEEE Trans. on Information Tech. in Biomedicine Vol. 10, issue 2, pp. 264-274, 2006. M. Vacher, D. Istrate. Sound detection and classification for medical telesurveillance, BIOMED 2004 Proceedings, ACTA PRESS, pp. 395 398, 2004. D. Istrate, M. Vacher. Détection et classification des sons : application aux sons de la vie courante et à la parole, 20 th GRETSI Proceedings, Vol. 1, pp. 485-488, 2005. Page 26/28 TSD 2006 CLIPS-IMAG

Acknowledgement I Appendix For Further Reading Other This study is a part of the DESDHIS-ACI "Technologies pour la Santé" project of the French Research Ministry. This project is a collaboration between the CLIPS laboratory, in charge of the sound analysis, and the TIMC laboratory, in charge of the medical sensor analysis and data fusion. CLIPS and TIMC are 2 laboratories of the IMAG institute. IMAG has funded the implementation of the habitat used for our studies through the RESIDE-HIS project. Page 27/28 TSD 2006 CLIPS-IMAG

Appendix For Further Reading Other SAMPA Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters, based on the International Phonetic Alphabet (IPA). It was originally developed in the late 1980s for six European languages by the EEC ESPRIT information technology research and development program. As many symbols as possible have been taken over from the IPA; where this is not possible, other signs that are available are used, e.g. [@] for schwa (IPA [9]), [2] for the vowel sound found in French deux (IPA ø), and [9] for the vowel sound found in French neuf (IPA œ). [Wells, J.C.], 1997. SAMPA computer readable phonetic alphabet. In Gibbon, D., Moore, R. and Winski, R. (eds.), 1997. Handbook of Standards and Resources for Spoken Language Systems. Berlin and New York: Mouton de Gruyter. Part IV, section B. Page 28/28 TSD 2006 CLIPS-IMAG