Research Proposal on Emotion Recognition Colin Grubb June 3, 2012 Abstract In this paper I will introduce my thesis question: To what extent can emotion recognition be improved by combining audio and visual information? In addition to covering background on audio information, I will introduce new information on image processing and some work that has been done in the field. I will then discuss methodologies for combining the two sources of information and evaluating them. Introduction Robots and computers have already become a prominent aspect of our lives, and their presence will only continue to grow, giving way to unique technologies. However, there are numerous obstacles to overcome before robots can interact fluidly with humans on a day to day basis. Imagine a robot that can act as a psychiatrist. This robot can interpret a patient s emotions and formulate an appropriate response. Reading emotions is a complicated process, but one that humans are very good at. Humans can fuse both visual information (a scowl on a person s face) and audio information (loud and intense speech) in order to gauge an emotion such as anger. If robots and computers are to be able to interact with humans effectively in scenarios such as the one suggested above, they need to be able to process both audio and visual information in order to produce a single output. Audio Information One of the major tasks in spoken dialogue systems is speech recognition, the act of converting from spoken word to text that a system can then interpret. The speech begins as an acoustic signal, which is then converted into a digital form, which is ultimately turned into phonemes that the system uses to create words [4]. Like other aspects of natural language processing, there are many difficulties in speech recognition. Speech recognition has many useful applications, one of which is emotion recognition. The ability to recognize a speaker s emotional state has many potential applications and numerous projects have been undertaken in the area. There are many existing applications in the real world that make use of emotional recognition, as well as numerous interesting areas of research. The most obvious application of emotional recognition is to classify a users emotional state. A more specific type of this research is dividing a users emotional state between two categories. One study of this nature was conducted as early as 1999 by researcher Valery Petrushin, in which researchers constructed recognizers that could be used in these types of applications, in which the system could recognize the speaker as agitated or calm. Emotion recognition is particularly important for its application in call centers; monitoring the user s frustration level is important for quality service, and this system was was used for a automated call center that could prioritize calls [6]. In this system, neural networks were trained using a small corpus of telephone messages, a portion of which contained angry sentences. A later study, conducted by research Chul Min Lee, involved collecting data and speech from a call center and creating recognizers that accounted for language and discourse information in addition to acoustic information [3]. Another field of applications lies in online emotion recognition, and a system called EmoVoice has been used in numerous applications such as Greta, a virtual agent which recognizes a user s emotion and mirrors it. [9]. 1
Many commonalities exist between research projects and applications in the area of emotion recognition, including the features of voice used to classify emotion, the way in which the features are extracted, and how the recognizers themselves are constructed. When analyzing emotional state, there are numerous features of speech that can be analyzed in order to classify emotion. Prosodic information is important for both humans and computers to identify a particular emotional state. Prosody refers to information such as pitch, loudness, and rhythm and can contain information about attitude [4]. One of the most common features of speech that is used to classify emotion is the pitch of a speaker s voice; a study conducted by researcher Bjrn Schuller in 2003 used features of the pitch of speaker s voice to classify a speaker s emotion. Pitch contains a large amount of information about a speaker s emotional state [8]. However, while prosodic information has long been important to emotion, the study conducted by Chul Min Lee in 2005 came up with a method for identifying certain words as being important to particular emotions. This study found that the addition of lexical and discourse information improved the system s ability to correctly identify an emotional state [3]. In creating a speech recognizer, the system must be trained to recognize particular emotions, using the features being used for the study. Typically, a corpus of sentences is gathered with sentences being pronounced with emotion. Structurally, Hidden Markov Models have been widely used in the construction of speech recognition systems [4]. Neural Networks have also been trained via backpropagation to recognize particular emotions. When creating a system that can classify a speaker s emotional state, the simplest way to judge the system s performance is to keep track of how often the system correctly identifies the correct emotion. It is important to keep track not only if the recognizer identified the correct emotion, but which emotions are being misidentified more often than others. A prominent commonality between results of previous studies is that anger is the easiest emotion to recognize, whereas fear is the hardest for recognizers (and humans) to correctly identify [6]. Visual Information Visual processing has two main commonalities with audio recognition; systems in both fields must extract important features from the input source in order to formulate an answer about the inputted emotion, and systems from both areas must undergo training in order to give the appropriate outputs for a given input. Like audio recognition, a large amount of research has been conducted in image evaluation and means of improving the processing, particularly with faces. Numerous databases are freely available over the internet. A study conducted at Union College by Shane Cotter utilized a database called the Japanese Female Facial Expression Database as a means of input. [1] [2]. This study focused on analyzing regions of the face individually, rather than the face as a whole, and then combining the information from selected regions together in order to classify emotions. This study found that this method was an improvement over analyzing the face as a whole. Some basic hands on research has also been conducted on image processing. For my project in CSC333 - Introduction to Parallel Computing, I am writing a program that finds intakes a series of image files and analyzes each of the pictures, calculating the center of mass in the X and y direction. I obtained a freely downloadable database of faces from the University of Sheffield s Image Processing Laboratory [5]. The files are a PGM format, which stands for Portable Gray Map, and the file format is designed to be easy to edit; the pixel information is contained within a 2-D array within the file [7]. For the parallel computing class, I intend to analyze this corpus of faces, counting the number of black, white, and greyscale pixels, and also attempt to analyze the concentration of black pixels in the images. This project is not quite on par with some of the research being done in image processing, it is an interesting begininning to image processing. The Process The analyzation process will involve several steps. A video feed will be taken of a user speaking with an emotional undertone. The video stream will be split into two separate inputs: a sound clip of the user s speech, and several, or potentially one, frame(s) chosen from the video stream. How the particular frame, or frames, is chosen is yet to be determined; a selection method could be developed, or they could be chosen at random. After the two inputs have been chosen, two separate recognizer systems, one for audio recognition and one for visual recognition, shall be applied to the inputs to extract important features and produce an output. The EmoVoice framework shall be used for audio recognition; the visual processing software/algorithm has yet to be selected. Another possibility to consider is the combination of the two systems in some way, so that instead of producing two separate outputs and comparing them, they would 2
produce a single output. This possibility is only a speculation at this point. Figure 1.1 The process to analyze emotional state (Face images [2]) Testing and Evaluation To train the systems, a large amount of video data will have to be gathered and fed to the systems. A similar process will be followed to test them. There are several issues that have to be considered when evaluating the system, one of which is the form of the output that a system can produce. One method of analyzing output, such as the method Shane Cotter implements in his occluded facial study [1], showed the success rates of several methods of facial analysis. Another method of output, such as the output used by the Virtual Agent Greta, which implements EmoVoice [9], outputs what the system identifies the emotion to be. It will be important to keep track of failure rates to see which emotions the systems have trouble identifying. Another issue to consider is conflicting output; if the systems identify their inputs to have different emotions, several questions must be asked: Which system was right? Are they both wrong? If one system is wrong, which emotion did it identify? Is one system, or both, misidentifying particular emotions more than others? Audio recognition might be better at than visual processing at recognizing certain emotions, and visual processing could perform better in other cases. As previously stated, certain emotions have been easier (and harder) for humans and systems, so it will be interesting to see if this study follows these trends. Another consideration that must be taken into account when analyzing the data and comparing the performance of the two systems is the personalization of emotion expression. For example, a particular user might express anger strongly in their voice, but not in their facial expression, and vice versa. Conclusion The combination of audio and visual recognition is a fascinating task. A great deal of research has been conducted in both areas, giving a good foundation upon which to start research. Overall, there is still a good amount of research to be done and design choices to flesh out, particularly in the visual processing realm and in the selection and usage of already existing recognizers. While the basic process of analysis has been laid out, there is still a great potential for change and modification. The research question is likely to remain the same. The project should present some interesting challanges, and should also produce some interesting data. At this point, the research has gone quite well and hopefully it will continue to proceed smoothly as the main portion of the thesis begins. 3
References [1] Shane Cotter. Recognition of occluded facial expressions using a fusion of localized sparse representation classifiers. In Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop (DSP/SPE), 2011 IEEE, pages 437 442, 2011. This paper was a recent study on studying regions of faces in order to combine information from each region and classify the facial expression of the image. I still only have a basic understanding of visual processing so I will likely need to read additional sources as well as examine this source in more detail. [2] Miyuki Kamachi. The japanese female facial expression (jaffe) database. This is the database of images of various facial expressions used by Shane Cotter in his research on occluded facial expressions. This database is freely downloadable. - http://www.kasrl.org/jaffe.html [3] Chul Min Lee. Toward detecting emotions in spoken dialogs. In IEEE Transactions on Speech and Audio Processing, volume 13, pages 293 303, 2005. This paper stuck out from the others because their study attempted to analyze more than just acoustic information (lexical and discourse) in order to classify emotions for several reasons (ex: finding that certain words were often associated with a particular emotion). Their study showed improved performance when combining other information categories. It is certainly interesting, but I am not sure if I have the time to look at more than acoustic signals. [4] Michael F. McTear. Spoken Dialogue Technology: Toward the Conversational User Interface. Springer, 2004. This book s section on speech recognition offers a good overview on the procedures and difficulties of recognizing speech, as well as touching upon Hidden Markov Models and how they can be used to structure a speech recognizer. [5] The University of Sheffield: Image Engineering Laboratory. Face database, 2012. I acquired the face database from this laboratory; it is free to use so long as I do not publish commercially and if I were to make a publication, let them know. I plan on sending the head of the department an email explaining how I plan on using the database - http://www.sheffield.ac.uk/eee/research/iel/research/face. [6] Valery A. Petrushin. Emotion in speech: Recognition and application to call centers. In In Engr, pages 7 10, 1999. This article discussed experiments in which people s ability to judge certain types of emotions were gauged, as well as specific aspects of the spoken word that they deemed most important to recognizing certain emotions. It was found that certain emotions were easier to recognize than others. These aspects of speech that were found to be important were used to train neural networks. The article also talked about applications to a call center in which a caller s emotional state could be classified. [7] Jef Posnaker. pgm, 2003. This is where I learned about the structure of PGM files and how I could acquire data on the greyscale of individual pixels, leading to more calculation potentials and a hands on introduction to basic image analysis. - http://netpbm.sourceforge.net/doc/pgm.html [8] Bjoern Schuller, Gerhard Rigoll, and Manfred Lang. Hidden markov model-based speech emotion recogntion. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages II:1 II:4, 2003. 4
The article was an interesting read on another method of training recognizers via hidden markov models. Like other experiments, the training data and recognizers worked with a set of predefined emotions, and used certain aspects of speech to train the system. I m a little confused by all of the statistics jargon; I m no stranger to statistics but I could use a refresher. [9] Thurid Vogt, Elisabeth Andre, and Nikolaus Bee. Emovoice - a framework for online recognition of emotions from voice. In Perception in Multimodal Dialogue Systems - 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, volume 5078, 2008. This paper introduces an online emotion recognition system called EmoVoice. The article describes how the system works, and shows several examples of EmoVoice implemented in other applications. There is a strong possibility that my thesis will be some sort of application or system (robot, perhaps) that uses EmoVoice for emotional recognition 5