Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest

Similar documents
Gender Based Emotion Recognition using Speech Signals: A Review

EMOTIONS are one of the most essential components of

Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain

Some Studies on of Raaga Emotions of Singers Using Gaussian Mixture Model

Emotion Recognition using a Cauchy Naive Bayes Classifier

EMOTION DETECTION FROM VOICE BASED CLASSIFIED FRAME-ENERGY SIGNAL USING K-MEANS CLUSTERING

COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM: APPLICATION TO SPEECH EMOTION RECOGNITION

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Effect of Sensor Fusion for Recognition of Emotional States Using Voice, Face Image and Thermal Image of Face

Emotion Recognition Modulating the Behavior of Intelligent Systems

EMOTION DETECTION FROM TEXT DOCUMENTS

Speech Emotion Recognition with Emotion-Pair based Framework Considering Emotion Distribution Information in Dimensional Emotion Space

Noise-Robust Speech Recognition Technologies in Mobile Environments

Research Proposal on Emotion Recognition

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY

Facial expression recognition with spatiotemporal local descriptors

Fuzzy Model on Human Emotions Recognition

Decision tree SVM model with Fisher feature selection for speech emotion recognition

Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System

FILTWAM. A Framework for Online Affective Computing in Serious Games. Kiavash Bahreini Wim Westera Rob Nadolski

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Dimensional Emotion Prediction from Spontaneous Head Gestures for Interaction with Sensitive Artificial Listeners

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

COMBINING CATEGORICAL AND PRIMITIVES-BASED EMOTION RECOGNITION. University of Southern California (USC), Los Angeles, CA, USA

EMOTIONS RECOGNITION BY SPEECH AND FACIAL EXPRESSIONS ANALYSIS

EMOTION DETECTION THROUGH SPEECH AND FACIAL EXPRESSIONS

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

AUDIO-VISUAL EMOTION RECOGNITION USING AN EMOTION SPACE CONCEPT

Facial Emotion Recognition with Facial Analysis

Emotion Affective Color Transfer Using Feature Based Facial Expression Recognition

Modeling and Recognizing Emotions from Audio Signals: A Review

LIE DETECTION SYSTEM USING INPUT VOICE SIGNAL K.Meena 1, K.Veena 2 (Corresponding Author: K.Veena) 1 Associate Professor, 2 Research Scholar,

Audiovisual to Sign Language Translator

Facial Expression Recognition Using Principal Component Analysis

Performance of Gaussian Mixture Models as a Classifier for Pathological Voice

Selection of Emotionally Salient Audio-Visual Features for Modeling Human Evaluations of Synthetic Character Emotion Displays

World Journal of Engineering Research and Technology WJERT

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Robust Speech Detection for Noisy Environments

TESTS OF ROBUSTNESS OF GMM SPEAKER VERIFICATION IN VoIP TELEPHONY

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

A Review on Dysarthria speech disorder

Research Article Automatic Speaker Recognition for Mobile Forensic Applications

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

Classroom Data Collection and Analysis using Computer Vision

Multimodal emotion recognition from expressive faces, body gestures and speech

An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

The MIT Mobile Device Speaker Verification Corpus: Data Collection and Preliminary Experiments

HUMAN EMOTION DETECTION THROUGH FACIAL EXPRESSIONS

Study on Aging Effect on Facial Expression Recognition

Audio-visual Classification and Fusion of Spontaneous Affective Data in Likelihood Space

The Ordinal Nature of Emotions. Georgios N. Yannakakis, Roddy Cowie and Carlos Busso

Affective Game Engines: Motivation & Requirements

Recognising Emotions from Keyboard Stroke Pattern

Development of Soft-Computing techniques capable of diagnosing Alzheimer s Disease in its pre-clinical stage combining MRI and FDG-PET images.

Contour-based Hand Pose Recognition for Sign Language Recognition

Prediction of Psychological Disorder using ANN

Emotion Classification along Valence Axis Using ERP Signals

DISCRETE WAVELET PACKET TRANSFORM FOR ELECTROENCEPHALOGRAM- BASED EMOTION RECOGNITION IN THE VALENCE-AROUSAL SPACE

Acoustic Sensing With Artificial Intelligence

Affect Intensity Estimation using Multiple Modalities

This is the accepted version of this article. To be published as : This is the author version published as:

SPEECH EMOTION RECOGNITION: ARE WE THERE YET?

Emotion based E-learning System using Physiological Signals. Dr. Jerritta S, Dr. Arun S School of Engineering, Vels University, Chennai

Using the Soundtrack to Classify Videos

Speaker Independent Isolated Word Speech to Text Conversion Using Auto Spectral Subtraction for Punjabi Language

Empirical Cognitive Modeling

The SRI System for the NIST OpenSAD 2015 Speech Activity Detection Evaluation

Emotional Design. D. Norman (Emotional Design, 2004) Model with three levels. Visceral (lowest level) Behavioral (middle level) Reflective (top level)

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

Hierarchical Classification of Emotional Speech

INTELLIGENT LIP READING SYSTEM FOR HEARING AND VOCAL IMPAIRMENT

Gesture-Based Affective Computing on Motion Capture Data

AVR Based Gesture Vocalizer Using Speech Synthesizer IC

Heart Murmur Recognition Based on Hidden Markov Model

Sound Texture Classification Using Statistics from an Auditory Model

REVIEW ON ARRHYTHMIA DETECTION USING SIGNAL PROCESSING

Recognizing Emotions from Facial Expressions Using Neural Network

REZUMAT TEZA DE DOCTORAT

Audio Visual Speech Synthesis and Speech Recognition for Hindi Language

Development of novel algorithm by combining Wavelet based Enhanced Canny edge Detection and Adaptive Filtering Method for Human Emotion Recognition

SVM based Emotion Recognition using Spectral Features and PCA

Impact of Ethnic Group on Human Emotion Recognition Using Backpropagation Neural Network

Open-Domain Chatting Machine: Emotion and Personality

Comparative Analysis of Vocal Characteristics in Speakers with Depression and High-Risk Suicide

ADVANCES in NATURAL and APPLIED SCIENCES

ERI User s Guide. 2. Obtaining the ERI for research purposes

Introduction to affect computing and its applications

Advances in Intelligent Systems and 533: Book Title: Proceedings of the Inter.

- - Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, Weiquan Fan South China University of Technology

A Smart Texting System For Android Mobile Users

Introduction to Psychology. Lecture no: 27 EMOTIONS

A Common Framework for Real-Time Emotion Recognition and Facial Action Unit Detection

Real Time Sign Language Processing System

A HMM-based Pre-training Approach for Sequential Data

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

A PUBLIC DOMAIN SPEECH-TO-TEXT SYSTEM

CEPP: Perceiving the Emotional State of the User Based on Body Posture

Transcription:

Published as conference paper in The 2nd International Integrated Conference & Concert on Convergence (2016) Divide-and-Conquer based Ensemble to Spot Emotions in Speech using MFCC and Random Forest Abdul Malik Badshah, Jamil Ahmad, Mi Young Lee, Sung Wook Baik * College of Electronics and Information Engineering, Sejong University, Seoul, South Korea * Corresponding author email: sbaik@sejong.ac.kr Abstract Besides spoken words, speech signals also carry information about speaker gender, age, and emotional state which can be used in a variety of speech analysis applications. In this paper, a divide and conquer strategy for ensemble classification has been proposed to recognize emotions in speech. Intrinsic hierarchy in emotions has been utilized to construct an emotions tree, which assisted in breaking down the emotion recognition task into smaller sub tasks. The proposed framework generates predictions in three phases. Firstly, emotions are detected in the input speech signal by classifying it as neutral or emotional. If the speech is classified as emotional, then in the second phase, it is further classified into positive and negative classes. Finally, individual positive or negative emotions are identified based on the outcomes of the previous stages. Several experiments have been performed on a widely used benchmark dataset. The proposed method was able to achieve improved recognition rates as compared to several other approaches. Keywords: speech emotions, divide-and-conquer, emotion classification, random forest 1. Introduction Body poses, facial expressions, and speech are the most common ways to express emotions [1]. However, certain emotions like disgust and boredom cannot be identified from gestures or facial expressions but could be effectively recognized from speech tone due to the difference in energy, speaking rate, speech linguistic and semantic information [2]. Speech emotion recognition (SER) has attracted increasing attention in the past few decades due to the increasing use of emotions in affective computing and human computer interaction. It has played a tremendous role in changing the way we interact with computers over the last few years [3]. Affective computing techniques use emotions to interact in more natural ways. Similarly, automatic voice response (AVR) systems can become more adaptive and user-friendly if affect states of users could be identified during interactions. The performance of such systems highly depend on the ability to accurately recognize emotions. SER is a challenging task due to the inherent complexity in emotional expression. Information regarding the emotions of a person can help in several speech analysis applications such as human computer interaction, humanoid robots, mobile communication, and call centers [4]. Emergency call centers all around the world deal with fake calls on a daily basis. It has become necessary to avoid wastage of precious resources in responding to fake calls. Utilizing information like age, gender, emotional state, environmental sounds, and speech transcript, the situational seriousness could be assessed effectively. If a caller report an abnormal situation by calling an emergency call center, the SER can be used to assess whether the person is under stress, fear or not which can increase the truth rate of person and can help the call centers in making an effective decision. The emotion detection work presented here is a portion of a large project for lie detection from speech in emergency calls. Manuscript received: MM/DD/YYYY / revised: MM/DD/YYYY / Corresponding Author: sbaik@sejong.ac.kr Tel:+82-02-3408-3797 Sejong University

2 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik This paper describes a divide-and-conquer (DC) approach to recognize emotions from speech using acoustic features. Three classifiers namely support vector machine (SVM) [5], Random Forest (RF) [6] and Decision Tree (DT) [7] were evaluated. Four models for each classifier were used at four stages to recognize the speech signals at each stage of the proposed method. 2. Related Work Speech emotion recognition has been studied vigorously in the past decade. Several different strategies involving a variety of feature extraction methods, and classification schemes have been proposed. Some of the recent works in this area are being presented here. Vogt and Andre [8] proposed a gender-dependent emotion recognition model for emotion detection from speech. They used 20 different features consisting of 17 MFCC, 2 energy coefficients, and 1 pitch value for gender-independent and 30 features consisting of 22 MFCC, 3 energy and 5 pitch value for male and female emotions to train Naïve Bayes classifiers. They evaluated their framework on two different datasets including Berlin and SmartKom. They achieved an improvement of 2% in overall recognition rates for Berlin dataset and 4% for SmartKom datasets, respectively. Lalitha et al. [9] developed a speaker emotion recognition system to recognize different emotions with a generalized feature set. They used and compared the result of two different classifiers including continuous hidden markov model (HMM) and support vector machine. Their proposed algorithm was able to achieve higher recognition rates for sad (83.33%), anger (91.66%), and fear (%) on the validation set. However, emotions like boredom, disgust and happy were not recognized accurately and achieved 50%, 23.07 % and 26.31% accuracy rates, respectively. Shah et al. [10] conducted an experiment using Discrete Wavelet Transforms (DWTs) and Mel Frequency Cepstral Coefficients (MFCCs) for speaker independent automatic emotion recognition from speech. They achieved 68.5% recognition accuracy using DWT and 55% accuracy using MFCC. These accuracy rates were not satisfactory. Moreover, they only used four emotions namely neutral, happy, sad, and anger. Krishna et al. [11] also used MFCC and wavelet features. They used Gaussian Mixture Model (GMM) as classifier for emotion recognition. They concluded that using Sub-band based Cepstral Parameter (SBC) method provides higher recognition rates compared to MFCC. They achieved 70% recognition rate using SBC as compared to 51% using MFCC on their dataset. Despite the efforts in constructing efficient SER systems using various features and classification frameworks, there is still significant room for improvement. Implementing SER for real world applications is a very challenging task which require careful design strategies [12]. 3. Proposed Method This section presents the overall architecture of the proposed framework, explains feature extraction method, and highlights key design aspects of the classification scheme. 3.1 Feature Extraction Feature extraction is an important task in recognizing emotions from speech. The entire classification framework depends on the representational capability of the features being used. The proposed method uses acoustic features to recognize emotions. Mel Frequency Cepstral Coefficients (MFCCs) are robust to noise, possesses significant representational capability, and performs best with several speech analysis applications. Speech signals are divided into small frames of several milliseconds and 16 MFCC coefficients are extracted from each frame. These features are then used to train various classifiers for emotion modeling. MFCC features are well known and the readers are referred to [13, 14] for more details.

3 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik 3.2 Classification Framework The Proposed method use DC rule to detect and classify seven different types of emotions in speech. The emotion recognition system is divided into 3 stages. The first stage detects whether the sound signal contains any emotion or not (neutral speech). The second stage classify emotional signal into positive and negative emotions. The third stage further classify positive emotions into happy or surprise, and the negative emotions into angry, disgust, fear or sad shown in figure 1. This break down of emotions into several stages offer several advantages. Firstly, it simplifies the emotion recognition task by dividing in to several phases allowing us to effectively train efficient classifiers for each phase. Secondly, it reduces the amount of effort needed to detect emotions because if no emotions are spotted in the first stage, no further processing is required. Thirdly, heterogeneous classification schemes could be combined to form a high performance ensemble. Finally, the confidence of outcomes at the different phases can be used as evidence to compute the final results. Stage 1: Emotion Detection Stage 2 Stage 3 Figure 1. Breakdown of emotional modeling for DC based classification For each stage in the classification framework, three different classifiers were evaluated. In phase 1, the objective is to detect emotions in speech. The training data was divided into neutral and emotional sets and the various classifiers were trained with it. Speech signals identified as neutral is not processed any further and the classification process ends immediately after phase 1. Furthermore, the classification schemes returned almost perfect recognition rate for neutral speech. The second phase attempts to further determine the nature of emotions being expressed by categorizing the emotional speech into positive or negative. Positive emotions include happy and surprise, whereas negative emotions include fear, anger, disgust, and sad. Finally, the positive or negative emotions into their respective base emotions. The DC based approach achieved better results as compared to using a single classifier for SER. 4. Experiments and Results This section presents results of various experiments performed with the speech dataset for emotion recognition. It also highlights the major strengths and weaknesses observed during the evaluation process. 4.1 Database Description Surrey Audio-Visual Expressed Emotion (SAVEE) Database [15] is used to evaluate the performance of the proposed method. The SAVEE database was recorded from four native English male speakers at the University of Surrey. It consist of 7 different type of emotions namely anger, disgust, fear, happiness, sadness, surprise and neutral. The sound files vary from 2 to 7 seconds in duration depending on the sentence uttered by the speaker. Total of 120 utterance per speaker were recorded.

4 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik Accuracy 4.2 Experimental Setup The experiment was performed on a Desktop PC with Intel Core i7-3770 CPU @ 3.40GHz, 16.0 GB RAM and 64-bit Windows 7 Professional. The feature extraction and classification modules were implemented using C++. Experimental results analysis were performed in Matlab 2015a. Mainly two types of experiments were performed. The first experiment was carried out to compare the performance of DC based approach and single multiclass classifier based approach. The second experiment was to evaluate the proposed method through different classifier including SVM, Decision Tree and Random Forest. Results of these experiments are described in detail in the following sections. 4.3 Stage-wise Classification Performance Stage wise classification performance is evaluated and discussed in this section. Fig. 2 shows the classification performance of DT, SVM, and RF. All these schemes achieved almost perfect recognition rates for neutral speech. However, recognition rates of emotional speech identification were relatively low with DT and SVM, achieving 81.11% and % accuracy, respectively. RF classifier achieved 92.2% accuracy for identifying emotional speech. Overall accuracy achieved by RF was the highest 96.11% at this stage. Accuracy 120 40 20 0 81.11 92.22 Decision Tree SVM Random Forest Neutral Classifiers Emotion 120 Figure 2. Stage 1 classifier performance 93.33 83 86.67 76.67 70 40 20 0 Decision Tree SVM Random Forest Classifiers Negative Positive Figure 3. Stage 2 classifier performance At Stage 2, classifiers are evaluated for identifying emotional speech obtained from previous phase as negative or positive. Accuracy rates are shown in Fig. 3. SVM Classifier has a lowest accuracy among the classifiers used at this stage for both negative and positive emotions, achieving accuracies around 70% and 76.67%. Decision Tree achieved better performance in identifying positive emotions. However, its

Accuracy 5 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik performance with negative emotions is relatively low. RF achieved steady performance for both emotion classes achieving 90% overall accuracy at this stage. 93.33 93.33 73.33 93.33 Accuracy 40 20 0 46.67 Decision Tree SVM Random Forest Classifiers Happy Surprise Figure 4. Classification performance in stage 3 for identifying positive emotions Positive emotions from stage 2 are then identified as happy or surprise. Similarly, negative emotions are classified into angry, disgust, fear, and sad. Results in Fig. 4 reveal that RF classifier achieves higher emotion recognition accuracy as compared to DT and SVM. RF achieved 87% accuracy for identifying positive emotions. Though, DT and SVM offers higher accuracy in recognizing happy emotions but their performance in recognizing surprise emotion was relatively low, 73.33% accuracy rate for SVM and 46.67% for DT. 120 93.33 93.33 86.6 73.33 46.67 46.67 33 37 40 20 0 Decision Tree SVM Random Forest Classifiers Angry Disgust Fear Sad Figure 5. Classification performance in stage 3 for identifying negative emotions A multi-class emotion recognition model was trained to classify negative emotions from stage 2 into angry, disgust, fear and sad. Results in Fig. 5 show that SVM classifier failed to classify fear emotions achieving only 37% accuracy, while its accuracy for angry, disgust and sad was %, 33% and 86.6%, respectively. DT achieved high accuracy in recognition of sad emotion, whereas its performance for angry, disgust and fear were 93%, % and 46.6%, respectively. In case of RF, angry, disgust, and sad emotions were successfully recognized with respective accuracies of %, 93.33%, 73.3%. However, the accuracy for fear emotion was relatively low at 46.67 %. Most of the test samples having fear emotions were missclassified as disgust, as shown in Table 2. Comparatively, RF achieved high accuracy in classifying 4

6 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik different emotions with overall 78.3% recognition rate, Decision Tree achieved accuracy of 75%, and SVM was able to achieve accuracy of % only. Overall classification results for using RF as a single classifier is provided in Table 1, and results of the proposed DC scheme is given in Table 2. By using RF as a single multi-class classifier for SER, accuracy rates were very low for disgust, happy, and sad emotions where each of these emotions were classified with less than 50% accuracy. Similarly, other emotions like angry, fear, and surprise achieved accuracy of about 66% each. The proposed DC based scheme was able to improve overall recognition accuracy for almost all emotions. From the performance comparisons of different classifiers, it can be easily concluded that RF is suited best among the classifiers for the proposed method. Table 1. Classification results for RF used as a single classifier Emotion Class Angry Disgust Fear Happy Neutral Sad Surprise Angry 66% 7% 14% 0 0 0 13% Disgust 0 46.6% 6.8% 0 46.6% 0 0 Fear 0 0 66.6% 0 26.4% 0 7% Happy 7% 0 13% 46.6% 0 0 33.4 Neutral 0 0 0 0 % 0 0 Sad 0 0 0 0 54% 46% 0 Surprise 13 0 21% 0 0 0 66% Table 2. Overall classification results using DC based scheme Emotion class Angry Disgust Fear Happy Neutral Sad Surprise Angry % 0 0 0 0 0 0 Disgust 0 73.33% 13% 0 0 13.7% 0 Fear 13% 27% 47% 0 0 13% 0 Happy 0 0 0 % 0 0 20% Neutral 0 0 0 0 % 0 0 Sad 0 0 6.7% 0 0 93.3% 0 Surprise 0 0 0 6.7% 0 0 93.3% 4.4 Comparison with other SER approaches Performance of the proposed scheme is compared with two similar approaches in Table 3. Results suggest that the proposed classification approach outperforms the other methods in identifying anger, disgust, neutral, and sad emotions. In case of fear, the proposed method performed better than [11], however, it was lower than the method described in [9]. Similarly, in case of happy emotion, our method achieved significantly better performance than [9] and slightly lower accuracy than [11]. Overall, the proposed method outperformed both methods [9] and [11] by 23% and 17%, respectively. 5. Conclusion In this paper, we presented a divide and conquer based approach to detect and classify emotions from speech signals. MFCC features are extracted from the speech signals which are used to train classifiers at three different stages of the proposed scheme. At the first stage, generic emotions are detected in speech by using a simple binary classifier which separates neutral speech from the emotional speech. At the second stage, emotions are further identified as positive or negative. Finally, individual emotions are identified at the last stage of the classification scheme. Three different classification schemes including DT, SVM, and RF

7 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik were evaluated at each of these three stages. From the evaluations, it is concluded that the proposed DC based emotions recognition scheme achieves high accuracy with RF classifier at each stage. In future, we plan to incorporate Dempster-Shafer theory of fusion to combine the decisions of intermediate classifiers before making a final decision. We hope that this fusion will help to improve accuracy even further. Table 3. Performance comparison with other relevant approaches Emotions Emotion Recognition Accuracy Proposed Method Method [9] Method [11] Anger 91 90 Boredom - 50 - Disgust 73 23.7 70 Fear 47 20 Happy 26.31 90 Neutral 75 Sad 93.3 83.33 90 Surprise 93.3 - - Overall (excluding Boredom and Surprise) 82.21 66.5 70.0 Acknowledgements This work was supported by the ICT R&D program of MSIP/IITP. (No. R0126-15-1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds). References [1] J. A. Russell, J.-A. Bachorowski, and J.-M. Fernandez-Dols, "Facial and vocal expressions of emotion," Annual review of psychology, vol. 54, pp. 329-349, 2003. [2] S. E. Duclos, J. D. Laird, E. Schneider, M. Sexter, L. Stern, and O. Van Lighten, "Emotion-specific effects of facial expressions and postures on emotional experience," Journal of Personality and Social Psychology, vol. 57, p., 1989. [3] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, et al., "Emotion recognition in human-computer interaction," Signal Processing Magazine, IEEE, vol. 18, pp. 32-, 2001. [4] J. Ahmad, K. Muhammad, S. i. Kwon, S. W. Baik, and S. Rho, "Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications," in 2016 International Conference on Platform Technology and Service (PlatCon), 2016, pp. 1-4. [5] V. Vapnik, The nature of statistical learning theory: Springer Science & Business Media, 2013. [6] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001. [7] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, "Emotion recognition using a hierarchical binary decision tree approach," Speech Communication, vol. 53, pp. 1162-1171, 2011. [8] T. Vogt and E. André, "Improving automatic emotion recognition from speech via gender differentiation," in Proc. Language Resources and Evaluation Conference (LREC 2006), Genoa, 2006. [9] S. Lalitha, S. Patnaik, T. Arvind, V. Madhusudhan, and S. Tripathi, "Emotion Recognition through Speech Signal for Human-Computer Interaction," in Electronic System Design (ISED), 2014 Fifth International Symposium on, 2014, pp. 217-218. [10] A. Firoz Shah, V. Vimal Krishnan, A. Raji Sukumar, A. Jayakumar, and P. Babu Anto, "Speaker independent automatic emotion recognition from speech: a comparison of MFCCs and discrete

8 A.M. Badshah, J Ahmad, MY Lee, S.W. Baik wavelet transforms," in Advances in Recent Technologies in Communication and Computing, 2009. ARTCom'09. International Conference on, 2009, pp. 528-531. [11] K. Krishna Kishore and P. Krishna Satish, "Emotion recognition in speech using MFCC and wavelet features," in Advance Computing Conference (IACC), 2013 IEEE 3rd International, 2013, pp. 842-847. [12] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognition, vol. 44, pp. 572-587, 2011. [13] F. Zheng, G. Zhang, and Z. Song, "Comparison of different implementations of MFCC," Journal of Computer Science and Technology, vol. 16, pp. 582-589, 2001. [14] J. Ahmad, M. Fiaz, S.-i. Kwon, M. Sodanil, B. Vo, and S. W. Baik, "Gender Identification using MFCC for Telephone Applications-A Comparative Study," arxiv preprint arxiv:11.01577, 2016. [15] S. Haq and P. J. Jackson, "Multimodal emotion recognition," Machine audition: principles, algorithms and systems, IGI Global, Hershey, pp. 398-423, 2010. Authors: Abdul Malik Badshah received his BCS degree in Computer Science from the Islamia College, Peshawar, Pakistan in 2013. Currently, he is pursuing Master degree in digital contents from Sejong University, Seoul, Korea. His research interests include speech analysis, machine learning, and applications of machine learning. Jamil Ahmad received his BCS degree in Computer Science from the University of Peshawar, Pakistan in 2008. He received his Master s degree in Computer Science with specialization in image processing from Islamia College, Peshawar, Pakistan. Currently, he is pursuing PhD degree in digital contents from Sejong University, Seoul, Korea. His research interests include image analysis, machine intelligence, semantic image representation and content based multimedia retrieval. Mi Young Lee is a research professor at Sejong University. She received her PhD degree in Image and Information Engineering at Pusan National University. Her research interests include Interactive Contents, UI, UX and developing Digital Contents. She has a MS in the Department of Image and Information Engineering from Pusan National University. Sung Wook Baik is a Professor in the Department of Digital Contents at Sejong University. He received the B.S. degree in computer science from Seoul National University, Seoul, Korea, in 1987, the M.S. degree in computer science from Northern Illinois University, Dekalb, in 1992, and the Ph.D. degree in information technology engineering from George Mason University, Fairfax, VA, in 1999. In 2002, he joined the faculty of the School of Electronics and Information Engineering, Sejong University, Seoul, Korea, where he is currently a Full Professor and Dean of Digital Contents. His research interests include computer vision, multimedia, pattern recognition, machine learning, data mining, virtual reality, and computer games.