Research Proposal on Emotion Recognition

Similar documents
Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Audiovisual to Sign Language Translator

Speech as HCI. HCI Lecture 11. Human Communication by Speech. Speech as HCI(cont. 2) Guest lecture: Speech Interfaces

easy read Your rights under THE accessible InformatioN STandard

easy read Your rights under THE accessible InformatioN STandard

Open Research Online The Open University s repository of research publications and other research outputs

Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain

Assistive Technology for Regular Curriculum for Hearing Impaired

Enhancing Telephone Communication in the Dental Office

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Effect of Sensor Fusion for Recognition of Emotional States Using Voice, Face Image and Thermal Image of Face

Real Time Sign Language Processing System

Speech Group, Media Laboratory

THE VOICE DOES MATTER

Communication (Journal)

Assistive Technologies

A Smart Texting System For Android Mobile Users

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

Gender Based Emotion Recognition using Speech Signals: A Review

Communication. Jess Walsh

AVR Based Gesture Vocalizer Using Speech Synthesizer IC

INTRODUCTION. Just because you know what you re talking about doesn t mean that I do

Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System

WHAT IS SOFT SKILLS:

The power to connect us ALL.

SCRIPTING AND SOCIAL STORIES Holly Ricker, MA, CSW, CSP School Social Worker, School Psychologist Presenting

Emote to Win: Affective Interactions with a Computer Game Agent

MyDispense OTC exercise Guide

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

Performance of Gaussian Mixture Models as a Classifier for Pathological Voice

Recognition of sign language gestures using neural networks

Elements of Communication

Noise-Robust Speech Recognition Technologies in Mobile Environments

Psychology Formative Assessment #2 Answer Key

Consonant Perception test

Debsubhra Chakraborty Institute for Media Innovation, Interdisciplinary Graduate School Nanyang Technological University, Singapore

Sign Language Interpretation Using Pseudo Glove

The ipad and Mobile Devices: Useful Tools for Individuals with Autism

Chapter 7. M.G.Rajanandh, Department of Pharmacy Practice, SRM College of Pharmacy, SRM University.

I. Language and Communication Needs

Unit III Verbal and Non-verbal Communication

Involving people with autism: a guide for public authorities

What you re in for. Who are cochlear implants for? The bottom line. Speech processing schemes for

IT S A WONDER WE UNDERSTAND EACH OTHER AT ALL!

Oral Health Literacy What s New, What s Hot in Communication Skills

Emotion Recognition using a Cauchy Naive Bayes Classifier

Optical Illusions 4/5. Optical Illusions 2/5. Optical Illusions 5/5 Optical Illusions 1/5. Reading. Reading. Fang Chen Spring 2004

An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

VITHEA: On-line word naming therapy in Portuguese for aphasic patients exploiting automatic speech recognition

Speech to Text Wireless Converter

Peer Support Meeting COMMUNICATION STRATEGIES

1. The first step in creating a speech involves determining the purpose of the speech. A) True B) False

INTELLIGENT LIP READING SYSTEM FOR HEARING AND VOCAL IMPAIRMENT

Prediction of Psychological Disorder using ANN

Running head: HEARING-AIDS INDUCE PLASTICITY IN THE AUDITORY SYSTEM 1

A Guide to Theatre Access: Marketing for captioning

Requirements for Maintaining Web Access for Hearing-Impaired Individuals

Inventions on expressing emotions In Graphical User Interface

Situation Reaction Detection Using Eye Gaze And Pulse Analysis

Date: April 19, 2017 Name of Product: Cisco Spark Board Contact for more information:

DOWNLOAD OR READ : THE VOICE IN SPEAKING PDF EBOOK EPUB MOBI

Speech Processing / Speech Translation Case study: Transtac Details

BFI-Based Speaker Personality Perception Using Acoustic-Prosodic Features

Making Sure People with Communication Disabilities Get the Message

AUDIO-VISUAL EMOTION RECOGNITION USING AN EMOTION SPACE CONCEPT

FACIAL EXPRESSION RECOGNITION FROM IMAGE SEQUENCES USING SELF-ORGANIZING MAPS

A Communication tool, Mobile Application Arabic & American Sign Languages (ARSL) Sign Language (ASL) as part of Teaching and Learning

EMOTION DETECTION FROM TEXT DOCUMENTS

ADVANCES in NATURAL and APPLIED SCIENCES

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

5 Quick Tips for Improving Your Emotional Intelligence. and Increasing Your Success in All Areas of Your Life

Psy /16 Human Communication. By Joseline

Communicating with Patients/Clients Who Know More Than They Can Say

Dimensional Emotion Prediction from Spontaneous Head Gestures for Interaction with Sensitive Artificial Listeners

ITU-T. FG AVA TR Version 1.0 (10/2013) Part 3: Using audiovisual media A taxonomy of participation

Information Session. What is Dementia? People with dementia need to be understood and supported in their communities.

Hand-Gesture Recognition System For Dumb And Paraplegics

Voluntary Product Accessibility Template (VPAT)

Interpreting, translation and communication policy

Member 1 Member 2 Member 3 Member 4 Full Name Krithee Sirisith Pichai Sodsai Thanasunn

EDITORIAL POLICY GUIDANCE HEARING IMPAIRED AUDIENCES

Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest

Skill Council for Persons with Disability Expository for Speech and Hearing Impairment E004

Dutch Multimodal Corpus for Speech Recognition

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Phonak Wireless Communication Portfolio Product information

draft Big Five 03/13/ HFM

Audio Visual Speech Synthesis and Speech Recognition for Hindi Language

Dealing with Difficult People 1

How can the Church accommodate its deaf or hearing impaired members?

Tips for Youth Group Leaders

Chapter 3 Self-Esteem and Mental Health

COMBINING CATEGORICAL AND PRIMITIVES-BASED EMOTION RECOGNITION. University of Southern California (USC), Los Angeles, CA, USA

AUTISM. Social Communication Skills

Hand Gesture Recognition and Speech Conversion for Deaf and Dumb using Feature Extraction

Speak Out! Sam Trychin, Ph.D. Copyright 1990, Revised Edition, Another Book in the Living With Hearing Loss series

Good Communication Starts at Home

support support support STAND BY ENCOURAGE AFFIRM STRENGTHEN PROMOTE JOIN IN SOLIDARITY Phase 3 ASSIST of the SASA! Community Mobilization Approach

Transcription:

Research Proposal on Emotion Recognition Colin Grubb June 3, 2012 Abstract In this paper I will introduce my thesis question: To what extent can emotion recognition be improved by combining audio and visual information? In addition to covering background on audio information, I will introduce new information on image processing and some work that has been done in the field. I will then discuss methodologies for combining the two sources of information and evaluating them. Introduction Robots and computers have already become a prominent aspect of our lives, and their presence will only continue to grow, giving way to unique technologies. However, there are numerous obstacles to overcome before robots can interact fluidly with humans on a day to day basis. Imagine a robot that can act as a psychiatrist. This robot can interpret a patient s emotions and formulate an appropriate response. Reading emotions is a complicated process, but one that humans are very good at. Humans can fuse both visual information (a scowl on a person s face) and audio information (loud and intense speech) in order to gauge an emotion such as anger. If robots and computers are to be able to interact with humans effectively in scenarios such as the one suggested above, they need to be able to process both audio and visual information in order to produce a single output. Audio Information One of the major tasks in spoken dialogue systems is speech recognition, the act of converting from spoken word to text that a system can then interpret. The speech begins as an acoustic signal, which is then converted into a digital form, which is ultimately turned into phonemes that the system uses to create words [4]. Like other aspects of natural language processing, there are many difficulties in speech recognition. Speech recognition has many useful applications, one of which is emotion recognition. The ability to recognize a speaker s emotional state has many potential applications and numerous projects have been undertaken in the area. There are many existing applications in the real world that make use of emotional recognition, as well as numerous interesting areas of research. The most obvious application of emotional recognition is to classify a users emotional state. A more specific type of this research is dividing a users emotional state between two categories. One study of this nature was conducted as early as 1999 by researcher Valery Petrushin, in which researchers constructed recognizers that could be used in these types of applications, in which the system could recognize the speaker as agitated or calm. Emotion recognition is particularly important for its application in call centers; monitoring the user s frustration level is important for quality service, and this system was was used for a automated call center that could prioritize calls [6]. In this system, neural networks were trained using a small corpus of telephone messages, a portion of which contained angry sentences. A later study, conducted by research Chul Min Lee, involved collecting data and speech from a call center and creating recognizers that accounted for language and discourse information in addition to acoustic information [3]. Another field of applications lies in online emotion recognition, and a system called EmoVoice has been used in numerous applications such as Greta, a virtual agent which recognizes a user s emotion and mirrors it. [9]. 1

Many commonalities exist between research projects and applications in the area of emotion recognition, including the features of voice used to classify emotion, the way in which the features are extracted, and how the recognizers themselves are constructed. When analyzing emotional state, there are numerous features of speech that can be analyzed in order to classify emotion. Prosodic information is important for both humans and computers to identify a particular emotional state. Prosody refers to information such as pitch, loudness, and rhythm and can contain information about attitude [4]. One of the most common features of speech that is used to classify emotion is the pitch of a speaker s voice; a study conducted by researcher Bjrn Schuller in 2003 used features of the pitch of speaker s voice to classify a speaker s emotion. Pitch contains a large amount of information about a speaker s emotional state [8]. However, while prosodic information has long been important to emotion, the study conducted by Chul Min Lee in 2005 came up with a method for identifying certain words as being important to particular emotions. This study found that the addition of lexical and discourse information improved the system s ability to correctly identify an emotional state [3]. In creating a speech recognizer, the system must be trained to recognize particular emotions, using the features being used for the study. Typically, a corpus of sentences is gathered with sentences being pronounced with emotion. Structurally, Hidden Markov Models have been widely used in the construction of speech recognition systems [4]. Neural Networks have also been trained via backpropagation to recognize particular emotions. When creating a system that can classify a speaker s emotional state, the simplest way to judge the system s performance is to keep track of how often the system correctly identifies the correct emotion. It is important to keep track not only if the recognizer identified the correct emotion, but which emotions are being misidentified more often than others. A prominent commonality between results of previous studies is that anger is the easiest emotion to recognize, whereas fear is the hardest for recognizers (and humans) to correctly identify [6]. Visual Information Visual processing has two main commonalities with audio recognition; systems in both fields must extract important features from the input source in order to formulate an answer about the inputted emotion, and systems from both areas must undergo training in order to give the appropriate outputs for a given input. Like audio recognition, a large amount of research has been conducted in image evaluation and means of improving the processing, particularly with faces. Numerous databases are freely available over the internet. A study conducted at Union College by Shane Cotter utilized a database called the Japanese Female Facial Expression Database as a means of input. [1] [2]. This study focused on analyzing regions of the face individually, rather than the face as a whole, and then combining the information from selected regions together in order to classify emotions. This study found that this method was an improvement over analyzing the face as a whole. Some basic hands on research has also been conducted on image processing. For my project in CSC333 - Introduction to Parallel Computing, I am writing a program that finds intakes a series of image files and analyzes each of the pictures, calculating the center of mass in the X and y direction. I obtained a freely downloadable database of faces from the University of Sheffield s Image Processing Laboratory [5]. The files are a PGM format, which stands for Portable Gray Map, and the file format is designed to be easy to edit; the pixel information is contained within a 2-D array within the file [7]. For the parallel computing class, I intend to analyze this corpus of faces, counting the number of black, white, and greyscale pixels, and also attempt to analyze the concentration of black pixels in the images. This project is not quite on par with some of the research being done in image processing, it is an interesting begininning to image processing. The Process The analyzation process will involve several steps. A video feed will be taken of a user speaking with an emotional undertone. The video stream will be split into two separate inputs: a sound clip of the user s speech, and several, or potentially one, frame(s) chosen from the video stream. How the particular frame, or frames, is chosen is yet to be determined; a selection method could be developed, or they could be chosen at random. After the two inputs have been chosen, two separate recognizer systems, one for audio recognition and one for visual recognition, shall be applied to the inputs to extract important features and produce an output. The EmoVoice framework shall be used for audio recognition; the visual processing software/algorithm has yet to be selected. Another possibility to consider is the combination of the two systems in some way, so that instead of producing two separate outputs and comparing them, they would 2

produce a single output. This possibility is only a speculation at this point. Figure 1.1 The process to analyze emotional state (Face images [2]) Testing and Evaluation To train the systems, a large amount of video data will have to be gathered and fed to the systems. A similar process will be followed to test them. There are several issues that have to be considered when evaluating the system, one of which is the form of the output that a system can produce. One method of analyzing output, such as the method Shane Cotter implements in his occluded facial study [1], showed the success rates of several methods of facial analysis. Another method of output, such as the output used by the Virtual Agent Greta, which implements EmoVoice [9], outputs what the system identifies the emotion to be. It will be important to keep track of failure rates to see which emotions the systems have trouble identifying. Another issue to consider is conflicting output; if the systems identify their inputs to have different emotions, several questions must be asked: Which system was right? Are they both wrong? If one system is wrong, which emotion did it identify? Is one system, or both, misidentifying particular emotions more than others? Audio recognition might be better at than visual processing at recognizing certain emotions, and visual processing could perform better in other cases. As previously stated, certain emotions have been easier (and harder) for humans and systems, so it will be interesting to see if this study follows these trends. Another consideration that must be taken into account when analyzing the data and comparing the performance of the two systems is the personalization of emotion expression. For example, a particular user might express anger strongly in their voice, but not in their facial expression, and vice versa. Conclusion The combination of audio and visual recognition is a fascinating task. A great deal of research has been conducted in both areas, giving a good foundation upon which to start research. Overall, there is still a good amount of research to be done and design choices to flesh out, particularly in the visual processing realm and in the selection and usage of already existing recognizers. While the basic process of analysis has been laid out, there is still a great potential for change and modification. The research question is likely to remain the same. The project should present some interesting challanges, and should also produce some interesting data. At this point, the research has gone quite well and hopefully it will continue to proceed smoothly as the main portion of the thesis begins. 3

References [1] Shane Cotter. Recognition of occluded facial expressions using a fusion of localized sparse representation classifiers. In Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop (DSP/SPE), 2011 IEEE, pages 437 442, 2011. This paper was a recent study on studying regions of faces in order to combine information from each region and classify the facial expression of the image. I still only have a basic understanding of visual processing so I will likely need to read additional sources as well as examine this source in more detail. [2] Miyuki Kamachi. The japanese female facial expression (jaffe) database. This is the database of images of various facial expressions used by Shane Cotter in his research on occluded facial expressions. This database is freely downloadable. - http://www.kasrl.org/jaffe.html [3] Chul Min Lee. Toward detecting emotions in spoken dialogs. In IEEE Transactions on Speech and Audio Processing, volume 13, pages 293 303, 2005. This paper stuck out from the others because their study attempted to analyze more than just acoustic information (lexical and discourse) in order to classify emotions for several reasons (ex: finding that certain words were often associated with a particular emotion). Their study showed improved performance when combining other information categories. It is certainly interesting, but I am not sure if I have the time to look at more than acoustic signals. [4] Michael F. McTear. Spoken Dialogue Technology: Toward the Conversational User Interface. Springer, 2004. This book s section on speech recognition offers a good overview on the procedures and difficulties of recognizing speech, as well as touching upon Hidden Markov Models and how they can be used to structure a speech recognizer. [5] The University of Sheffield: Image Engineering Laboratory. Face database, 2012. I acquired the face database from this laboratory; it is free to use so long as I do not publish commercially and if I were to make a publication, let them know. I plan on sending the head of the department an email explaining how I plan on using the database - http://www.sheffield.ac.uk/eee/research/iel/research/face. [6] Valery A. Petrushin. Emotion in speech: Recognition and application to call centers. In In Engr, pages 7 10, 1999. This article discussed experiments in which people s ability to judge certain types of emotions were gauged, as well as specific aspects of the spoken word that they deemed most important to recognizing certain emotions. It was found that certain emotions were easier to recognize than others. These aspects of speech that were found to be important were used to train neural networks. The article also talked about applications to a call center in which a caller s emotional state could be classified. [7] Jef Posnaker. pgm, 2003. This is where I learned about the structure of PGM files and how I could acquire data on the greyscale of individual pixels, leading to more calculation potentials and a hands on introduction to basic image analysis. - http://netpbm.sourceforge.net/doc/pgm.html [8] Bjoern Schuller, Gerhard Rigoll, and Manfred Lang. Hidden markov model-based speech emotion recogntion. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages II:1 II:4, 2003. 4

The article was an interesting read on another method of training recognizers via hidden markov models. Like other experiments, the training data and recognizers worked with a set of predefined emotions, and used certain aspects of speech to train the system. I m a little confused by all of the statistics jargon; I m no stranger to statistics but I could use a refresher. [9] Thurid Vogt, Elisabeth Andre, and Nikolaus Bee. Emovoice - a framework for online recognition of emotions from voice. In Perception in Multimodal Dialogue Systems - 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, volume 5078, 2008. This paper introduces an online emotion recognition system called EmoVoice. The article describes how the system works, and shows several examples of EmoVoice implemented in other applications. There is a strong possibility that my thesis will be some sort of application or system (robot, perhaps) that uses EmoVoice for emotional recognition 5