Dutch Multimodal Corpus for Speech Recognition

Similar documents
Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Noise-Robust Speech Recognition Technologies in Mobile Environments

Sociable Robots Peeping into the Human World

easy read Your rights under THE accessible InformatioN STandard

CSE 118/218 Final Presentation. Team 2 Dreams and Aspirations

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

easy read Your rights under THE accessible InformatioN STandard

General Soundtrack Analysis

ITU-T. FG AVA TR Version 1.0 (10/2013) Part 3: Using audiovisual media A taxonomy of participation

A Multi-Sensor Speech Database with Applications towards Robust Speech Processing in Hostile Environments

Online Speaker Adaptation of an Acoustic Model using Face Recognition

Effects of speaker's and listener's environments on speech intelligibili annoyance. Author(s)Kubo, Rieko; Morikawa, Daisuke; Akag

INTELLIGENT LIP READING SYSTEM FOR HEARING AND VOCAL IMPAIRMENT

Audiovisual to Sign Language Translator

Requirements for Maintaining Web Access for Hearing-Impaired Individuals

1. Stating the purpose of the speech is the first step in creating a speech.

Assistant Professor, PG and Research Department of Computer Applications, Sacred Heart College (Autonomous), Tirupattur, Vellore, Tamil Nadu, India

2/25/2013. Context Effect on Suprasegmental Cues. Supresegmental Cues. Pitch Contour Identification (PCI) Context Effect with Cochlear Implants

Language Volunteer Guide

Research Proposal on Emotion Recognition

Gick et al.: JASA Express Letters DOI: / Published Online 17 March 2008

Psy /16 Human Communication. By Joseline

Director of Testing and Disability Services Phone: (706) Fax: (706) E Mail:

MULTI-CHANNEL COMMUNICATION

Providing Effective Communication Access

Gender Based Emotion Recognition using Speech Signals: A Review

Unit III Verbal and Non-verbal Communication

Automatic Classification of Perceived Gender from Facial Images

Outline. Teager Energy and Modulation Features for Speech Applications. Dept. of ECE Technical Univ. of Crete

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Real World Computing Partnership Multimodal Functions Sharp Laboratories.

Realidades A 2011 Correlated to: (Grades 4-8)

Sign Language Recognition using Webcams

Communication with low-cost hearing protectors: hear, see and believe

Testing FM Systems on the 7000 Hearing Aid Test System

Using Deep Convolutional Networks for Gesture Recognition in American Sign Language

REASON FOR REFLECTING

A visual concomitant of the Lombard reflex

Question 2. The Deaf community has its own culture.

Face to Face: Technology to support exercises for facial weakness

Communication and ASD: Key Concepts for Educational Teams

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

Consonant Perception test

Multimodal interactions: visual-auditory

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Facial expression recognition with spatiotemporal local descriptors

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Sign Language Interpreting in Academia The Austrian Perspective

Gesture Recognition using Marathi/Hindi Alphabet

How Hearing Impaired People View Closed Captions of TV Commercials Measured By Eye-Tracking Device

Audio-Visual Integration in Multimodal Communication

Assistive Technologies

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

An accessible delivery mode for face-to-face speaking tests: enhancing mobility of professionals/students

Facial expressions in vocal performance: Visual communication of emotion

Speech and Sound Use in a Remote Monitoring System for Health Care

Public Speaking Chapter 1. Speaking in Public

Phonak Wireless Communication Portfolio Product information

THE IMPACT OF SUBTITLE DISPLAY RATE ON ENJOYMENT UNDER NORMAL TELEVISION VIEWING CONDITIONS

Comparison of two newly introduced bone anchored hearing instruments in first-time users (Updated September 2010)

- - Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, Weiquan Fan South China University of Technology

Recognising Emotions from Keyboard Stroke Pattern

A Practical Introduction to Content Analysis. Catherine Corrigall-Brown Department of Sociology

Title:Video-confidence: a qualitative exploration of videoconferencing for psychiatric emergencies

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

Captions on Holodeck: Exploring Quality Implications of Augmented Reality

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

Prediction of Negative Symptoms of Schizophrenia from Facial Expressions and Speech Signals

Bone conduction hearing solutions from Sophono

DREAM feature and product Quick guide

Video Captioning Workflow and Style Guide Overview

Analysis of the Audio Home Environment of Children with Normal vs. Impaired Hearing

Conducting Groups. March 2015

Single channel noise reduction in hearing aids

Magazzini del Cotone Conference Center, GENOA - ITALY

Lip-reading automatons

Microphone Input LED Display T-shirt

Implants. Slide 1. Slide 2. Slide 3. Presentation Tips. Becoming Familiar with Cochlear. Implants

Biologically-Inspired Human Motion Detection

EDITORIAL POLICY GUIDANCE HEARING IMPAIRED AUDIENCES

CHINO VALLEY UNIFIED SCHOOL DISTRICT INSTRUCTIONAL GUIDE AMERICAN SIGN LANGUAGE 1

Bringing Your A Game: Strategies to Support Students with Autism Communication Strategies. Ann N. Garfinkle, PhD Benjamin Chu, Doctoral Candidate

Individualizing Noise Reduction in Hearing Aids

Lindsay De Souza M.Cl.Sc AUD Candidate University of Western Ontario: School of Communication Sciences and Disorders

Audioconference Fixed Parameters. Audioconference Variables. Measurement Method: Physiological. Measurement Methods: PQ. Experimental Conditions

Speech perception in individuals with dementia of the Alzheimer s type (DAT) Mitchell S. Sommers Department of Psychology Washington University

Skill Council for Persons with Disability Expository for Speech and Hearing Impairment E004

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

Evaluating Language and Communication Skills

What Parents Should Know About Hearing Aids: What s Inside, How They Work, & What s Best for Kids

Glove for Gesture Recognition using Flex Sensor

A Cooperative Multiagent Architecture for Turkish Sign Tutors

Emotion Recognition using a Cauchy Naive Bayes Classifier

Accessibility and Lecture Capture. David J. Blezard Michael S. McIntire Academic Technology

An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

Designing Caption Production Rules Based on Face, Text and Motion Detections

Top Ten Tips for Supporting Communication

Incorporation of Imaging-Based Functional Assessment Procedures into the DICOM Standard Draft version 0.1 7/27/2011

Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviours: Validating the Annotation of TV Interviews

SUPPORTING TERTIARY STUDENTS WITH HEARING IMPAIRMENT

Affective Computing Ana Paiva & João Dias. Lecture 1. Course Presentation

Transcription:

Dutch Multimodal Corpus for Speech Recognition A.G. ChiŃu and L.J.M. Rothkrantz E-mails: {A.G.Chitu,L.J.M.Rothkrantz}@tudelft.nl Website: http://mmi.tudelft.nl

Outline: Background and goal of the paper. How is the data recorded? What type of data is recorded? Conclusions. 2

What is the background and the goal of the paper? ICIS Crisis management. Supported by the Dutch Ministry of Economic Affairs, grant nr: BSIK03024. Facilitating communications among actors. Visual enhanced speech recognition. - people do it all the time: McGurk effect - Nature, 1976 Not sufficient data resources. 3

What is the background and the goal of the paper? Lipreading process is insufficiently understood. Start discussions about a set of guiding rules for building audio-visual databases used in multimodal speech-recognition research. Underline a first set of directions and spot some places that could/(need to) be improved. 4

How is the data recorded? What are the issues related with the current audio-visual datasets? Small number of respondents» Unbalanced w.r.t. gender, age, race, dialect, etc. Poor language coverage» Limited coverage of phonemes/visemes Poor quality of the recordings: esp. video» Controlled/Uncontrolled environment? Not (freely) available 5

How is the data recorded? Video quality Difficult to handle, stronger requirements from the recording device Grayscale/Color Resolution: 80x60 to 720x576 Frame rate: 24fps to 29.97fps Illumination, background Region of Interest ROI? 6

How is the data recorded? Video quality: poor coverage of the visemes Fast speech rate vs. slow speech rate 7

How is the data recorded? Video device Allied Vision Technology Pike F032C, purely computer vision camera: max 206Hz b/w. We recorded at 100Hz color images, setting chroma subsampling 4:2:2 and resolution ½ PAL 384x288. Controlled environment, chroma keying, ROI face level. Audio device High fidelity MICS: NT2A Studio Condensators. Stereo signal at 48kHz and 16bits sample size. Controlled environment. 8

How is the data recorded? Video quality: dual view Solutions: 1. Two cameras Synchronization 2. One camera + mirror 9

How is the data recorded? Video quality: dual view Solutions: 1. Two cameras Synchronization 2. One camera + mirror 10

How is the data recorded? Prompter Speaker control of the devices 11

What type of data is recorded? Language content for lipreading Start from DUTAVSC DB» based on POLYPHONE for Dutch + Newspaper clips + banking application» enriched by adding: short interview questions, klink -ers, many new connected digits series, many new random words.» recorded in 4 styles: normal rate, fast rate, whispering, free answer interview. Enrich the language coverage!!! 12

What type of data is recorded? Collected data: statistics 1966 unique words 427 phonetically rich sentences 91 context aware sentences (banking application) 72 different klink -ers 41 simple open answer questions 64 items per speaker 16 categories w.r.t. content and speech style ~45min per speaker 13

What type of data is recorded? Collected data: statistics native Dutch speakers hope for >100 speakers ( we have already 60 ) more than 5000 utterances Collected data: demographic age gender education address: region 14

What type of data is recorded? Emotional speech Acted emotion based on reading emotional stories. Second data corpus. The respondents were both naïve speakers and professional actors. Capture passport view to observe all facial expressions and gestures 15

What type of data is recorded? Emotional speech 16

What type of data is recorded? Emotional speech Motivation Influence of the affective state on the shown visemes and vice versa. Problems In real life these is not such thing as clean emotion: amalgam of emotions. Assessing the quality of the acted emotions. We plan to use questionnaires to elicit speakers opinion, and to weight naïve speakers against professional actors. 17

What type of data is recorded? Different environmental conditions While the recordings were performed in laboratory conditions with reduced noise level, we want to expose the speakers to different noise conditions. It is not clear to us how is the speaking style changing with the environment. It seams common sense that under heavy noise the speaker will change his speaking style to increase lips readability?! But how, in what sense and degree? 18

What type of data is recorded? Different environmental conditions Hence we think of asking speakers to record while: Very low feedback is possible but silence Very low feedback is possible but heavy noise conditions Half feedback and half noisy input. 19

Take home facts. Lipreading is a very active subject at this moment. At TUDelft there will be a new advanced data corpus available for multimodal speech recognition. Frontal and side view lipreading. High speed video recording. Multiple recording conditions are simulated. Emotional speech is included and facial expressions and gestures are recoded together with speech. 20

Comments Thank you very much for your attention! A.G. ChiŃu and L.J.M. Rothkrantz E-mails: {A.G.Chitu,L.J.M.Rothkrantz}@tudelft.nl Website: http://mmi.tudelft.nl 21