Enhanced Autocorrelation in Real World Emotion Recognition

Size: px

Start display at page:

Download "Enhanced Autocorrelation in Real World Emotion Recognition"

Bernadette Brown
5 years ago
Views:

1 Enhanced Autocorrelation in Real World Emotion Recognition Sascha Meudt Institute of Neural Information Processing University of Ulm Friedhelm Schwenker Institute of Neural Information Processing University of Ulm ABSTRACT Multimodal emotion recognition in real world environments is still a challenging task of affective computing research. Recognizing the affective or physiological state of an individual is difficult for humans as well as for computer systems, and thus finding suitable discriminative features is the most promising approach in multimodal emotion recognition. In the literature numerous features have been developed or adapted from related signal processing tasks. But still, classifying emotional states in real world scenarios is difficult and the performance of automatic classifiers is rather limited. This is mainly due to the fact that emotional states can not be distinguished by a well defined set of discriminating features. In this work we present an enhanced autocorrelation feature as a multi pitch detection feature and compare its performance to feature well known, and state-of-the-art in signal and speech processing. Results of the evaluation show that the enhanced autocorrelation outperform other state-of-the-art features in case of the challenge data set. The complexity of this benchmark data set lies in between real world data sets showing naturalistic emotional utterances, and the widely applied and well-understood acted emotional data sets. Keywords enhanced autocorrelation, audio features, emotion recognition, human computer interaction, affective computing 1. INTRODUCTION Human affective state has been attracted researchers in medical, psychological and cognitive science since the mid 196s [8, 24]. Because of the dramatically increasing computational power of modern information technology, numerous technical devices such as smart phones, tablets, laptops, personal computers and many more became more and more part of our daily life, and more even more important, the way we are using these computing devises has totally changed: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ICMI 14, November 12 16, 214, Istanbul, Turkey Copyright 214 ACM /14/11...$ Human-Computer interaction (HCI) in the 198s and 199s was dominated by keyboard and mouse interaction following a strict input-operation-output work flow. Recently, more and more options for multimodal interaction have been developed, e.g. control via speech or hand gestures, and in the future the analysis of the user s intentions or affective state will come into to focus of HCI research. Examples are the analysis of para-linguistic patterns or facial expressions, or more general, social signals of humans implicitly produced as a by-product when interacting or communicating to a computing device. This change in HCI caused computer scientists to focus on recognition of social signals, such as emotions, with the intention to use this information to alter subsequent actions in HCI scenarios. Humans express their emotions in many different ways and modalities, this makes automatic emotion recognition a very challenging task in computer science. In this emerging field, called affective computing, computers are seen as intelligent companions offering context sensitive advice and support to the human user, and techniques based on machine learning and pattern pattern recognition came more and more in the focus of research. In the past, pattern recognition research in emotion recognition has been done mostly on recognizing acted, usually over-expressed, well defined emotional categories, and segmented data (isolated video clips, audio files) [33, 3], typical datasets consisted of audio [2], video [16] or audio-video recordings [1]. The classification of such data simple, and thus, the achieved classification results are accurate [9, 26]. However, such acted emotions do not appear in every day life HCI situations, thus the community changed their scope more towards non-acted scenarios and benchmark data [2, 25] containing more realistic emotional appearances. A famous benchmark data set is the data base introduced in AVEC Challenge [31], the EmoRec [36] and the Last-Minute data set [23] Unfortunately the ground truth is somewhat unclear and the data are very noisy in this datasets [15]. In case of the EmotiW challenge the ground truth is given by the storyboard of the movies the videos had been taken from. Hence, it is even for humans difficult to categorize the acted emotion without context. Recognition systems thus have to deal with unreliable data containing various levels of expressiveness. In order to overcome those problems, different approaches are proposed [1, 19, 11]. Focus can, for example, be set on a single modality like video or speech signals and classification systems can thus be tailored on the special characteristics of those modalities. Advances in affective computing in recent 52

EmotiW 214 and 213 are data sets derived from cinema movies and thus the complexity of the emotional utterances are in between acted and realistic emotions.

2 years have come from better classifiers and fusion systems, but also from the adaptation and development of novel, discriminating features. The EmotiW challenge [6, 4] in 214 [5] again offers a challenging bimodal benchmark data set. EmotiW 214 and 213 are data sets derived from cinema movies and thus the complexity of the emotional utterances are in between acted and realistic emotions. For this reason the EmotiW database is an important benchmark to develop new features for emotion recognition. Speech signals are appealing for emotion recognition because they are simple to process and their analyses present promising ways for future research [7, 27, 28, 29]. One of the main issues in designing an automatic emotion recognition system is the choice of the features that can represent the corresponding emotions. In recent years, several different feature types proved to be useful in the context of emotion recognition from speech: Nicholson et. al [21] used feature extraction based on pitch and linear predictive coding (LPC). In earlier work two feature types, comprising modulation spectrum (ModSpec) and relative spectral transform - perceptual linear prediction (RASTA-PLP) where used to recognize seven different emotions with an accuracy of approximately 79% on the Berlin database of emotional speech (EMO-DB [2]) [28, 22, 3, 32]. Other feature choices include the Mel Frequency Cepstral Coefficients (MFCC) as used in [17]. Developing powerful features which discriminate speech in their underling emotional categories in a good manor is still a hot topic in current research. In this paper we present the usage of the enhanced autocorrelation (EAC) feature in the field of emotion recognition. The feature was originally developed by Tolonen and Karjalainen [34] as multi pitch detector for music classification, to the best of our knowledge this feature has so far never been used in emotion estimation. In this paper we compare the performance of the EAC feature to a set of commonly applied features in emotion recognition from speech. The results show that EAC yields comparable and slightly better results on the AFEW 4. dataset. As many other features the EAC computation is based on an analysis in the frequency domain of the audio signal, hence resulting in a much higher output dimension which gives the classifier a better chance to discriminate the emotional classes. In the following we give a short overview of the used common features followed by a detailed introduction of the EAC feature. Afterwards we compare the features based on the challenge dataset and present the results of our approach on the test dataset. A final conclusion and discussion is given in the closing section of the paper. 2. FEATURES In the following 4 subsections we give a brief introduction on commonly used features in emotion recognition from speech. This is followed by a more detailed description of the EAC method. To the best of our knowledge EAC is used in the field of emotion recognition for the first time. Figure 1 gives an overview on the dimensions and shape of the utilized features. All features have in common that the audio signal is divided in overlapping windows, subsequently a Hamming window function w(i) = cos( 2πi ), M i = M,..., M 1 with window size M and sample index i 2 2 is applied to prevent edge effects. Finally in subsection 2.6 MFCC ModSpec LPC RastaPLP EAC Spectrogram Audio Signal Feature comparison of file avi (Angry) Time in seconds Figure 1: Shape comparison of all features on file avi of class angry. From top to bottom: Mel Frequency Cepstral Coefficients, Modulation Spectrum, Linear Predictive Coding, Relative Spectral Perceptual Linear Predictive, Enhanced Autocorrelation, audio signal in frequency domain and audio signal in time domain. we propose a feature fusion method to reduce the number of instances to one per film sequence. 2.1 Mel Frequency Cepstral Coefficients The MFCC representation of a signal is motivated by the way how humans percept audio modulation in their ear. A filter bank with linear spaced filters in the lower frequencies and logarithmic scaled filters in the upper frequencies called MEL filter bank is used during feature extraction. MFCC are commonly used features in short term feature cases since their compact ability to represent the important information in speech processing applications while retaining most of the phonetical significant acoustics [18] On each window a short-term fast Fourier transformation (FFT) is computed, and on the result the MEL filter bank of triangular filters is applied. The discrete cosine transformation (DCT) on each filter output yields the de-correlated cepstral coefficients of each window. A typical filter order is in the ranged from 8 and 24. In this work we extracted the coefficients on 4msec windows with an overlap of 2msec and a filter bank order of 2. 53

.1.5.5.1 1 2 3 4 5 6 7 8 9 1 Time in sample numbers Sample value.1.5.5 Audiosignal of 34264.avi.4.3.2.1 FFT time doubled FFT.1 2 4 6 Time in sample numbers 8 1 12 x 1 4 EAC values of 34264.avi.1 5 1 15 2 25 3 35 4 45 5 FFT dimension.

3 Time in sample numbers Sample value Audiosignal of avi FFT time doubled FFT Time in sample numbers x 1 4 EAC values of avi FFT dimension Non negative difference (EAC value) EAC dimension Time in window numbers Figure 2: A window out of the audio signal (top) and corresponding EAC analysis (bottom) of film sequence wav In the middle the zero clipped autocorrelation signal and their time doubled curve is shown. 2.2 Modulation Spectrum To reduce the influence of environmental noise Hermansky [12] introduced the Modulation Spectrum feature of speech to obtain the temporal dynamics of the speech segment. To gain the long term temporal modulations in the range from 2Hz to 16Hz window sizes of 2msec up to several seconds are applied. Due to the short audio segments of the EmotiW-challenge we fixed 2msec as the used window size. As for MFCC, firstly a FFT is computed on overlapping subwindows, and the MEL filter bank is applied too (we used 8 filters in this work). Instead of applying the DCT on each subwindow the response of all subwindows is aggregate and a FFT over the energy response per filter is calculated yielding the final feature vector. Strong rates of change of the vocal tract, holding important linguistic information are represented in the corresponding dimensions of the feature. 2.3 Linear Predictive Coding Instead of analysing the signal by using FFT based approaches in LPC the speech sample s(t) is approximated by linear combination of the past p samples [14]: s(t) a 1s(t 1) a ps(t p), (1) where the coefficients a 1,..., a p are assumed to be constant in a short signal segment. This results in a p dimensional vector corresponding to a curve fitting around the peaks of the short term log magnitude spectrum of a signal. As in the previous features the information is compressed, avoiding a transformation from time to frequency domain. In this work we use p = 12, extracted from 4msec windows with 2msec overlap. 2.4 Relative Spectral Perceptual Linear Predictive Coding The perceptually and biologically motivations of critical bands and equally loudness curve build the basis of the perceptual linear prediction (PLP) [13]. The sound pressure (db) that is required to perceive a sound of any frequency Figure 3: Audio signal and the corresponding EAC analysis of film sequence wav. Falling multi pitch in the very beginning, a low frequent single pitch from window 12 to 16 and a very noisy and wide pitch in the end can be observed. as loud as a reference sound of 1 khz is approximated by: (w E(w) = ) w 2 (2) (w ) (w ) As a result of this equation, frequencies below 1 khz need higher sound pressure levels than the reference sound while sounds between 2 and 5 khz need less pressure. As for MFCC filtering a critical band filtering is done (usual with about 21 filters), but in contrast to MFCC the filtering is linear in the Bark Scale and the filter shape is not triangular. PLP is sensitive towards spectral changes caused by transmission channels caused by different microphones or telephone voice compression algorithms. Therefore after band filtering Hermansky suggested equal loudness conversion by transforming the spectrum to the logarithmic domain and applying a relative spectral (Rasta) filter processing followed by using exponential re-transformation. Finally, LPC coefficients are calculated similar to subsection 2.3 over the critical band energies. In this work the features with filter order of 21, a window length of 25msec and an offset of 1msec have been applied. 2.5 Enhanced Auto Correlation The EAC feature was originally introduced by Tolonen and Karjalainen [34] as multi pitch analyses in various different fields of audio signal processing. In our work we use a slightly modified version of their extraction process in order to extract the main harmonics of the vocal cord and the glottal source of the speech signal. Different emotional states could cause a variety of muscle tension in the vocal tract which influence the produced sound signal. Extracting the multi pitch harmonics based on the periodicity of the signal could therefore be a reliable feature containing a lot of emotional information independent of the underling spoken content. In addition the EAC feature is a very high dimensional feature in the field of audio processing which might help a classifier to distinguish between the distributions of the emotional classes. 54

4 Fusion type instance wise window average window variance Feature type Voting - Late Fusion Early Fusion Early Fusion MFCC 24.7 (2.44) 25.6 (1.56) 22.8 (2.6) ModSpec 28.4 (1.71) 28. (3.86) 25.1 (1.95) LPC 21.9 (.8) 2.6 (2.18) 2.6 (3.23) RastaPLP 22.8 (1.61) 19.9 (1.36) 29.1 (1.6) EAC 29.2 (1.13) 3.7 (3.82) 3.7 (2.8) Table 1: Results for emotion recognition on the utilized features based on a 1-fold cross-validation on the unification of train and validation set (accuracy in percent, standard derivation in brackets). As in most other feature extraction procedures we first divide the signal in overlapping Hamming windows. For each window containing the input signal S the autocorrelation function ( ( acf(s) = Re ifft FFT (Hamming(S)) k)) (3) is applied in order to detect periodic signal parts. The parameter k allows the control of the periodicity detection with some non linear processing in the frequency domain. Peaks in the autocorrelation curve are an indicator for pitch periods. Normally, a lot of redundant information and noise is part of this curve. To improve the reliability of the detection the autocorrelation result is clipped at zero and all negative values are set to zero. In order to remove multiple peaks of the fundamental periods which are caused by the autocorrelation function the time doubled signal of the autocorrelation function is subtracted from the autocorrelation signal, and again all negative values are set to zero, yielding the final EAC feature vector. Figure 2 shows a audio signal window applied with the Hamming function (upper part). In the middle plot the red curve shows the autocorrelation curve of the window clipped to zero and the time doubled curve in green. Finally in the bottom line the EAC curve is displayed, showing a wide single peak at 31. Keeping in mind that this procedure is repeated for every sliding window on the audio signal (Figure 3 top) the EAC result could be drawn as in Figure 3 bottom where the x-axis displays the time, the y-axis donates the dimensions of the feature and the color corresponds to the value at a given point (from blue = zero to red = one). One could see a falling multi pitch in the very beginning, a low frequent single pitch from window 12 to 16 and a very noisy and wide pitch in the end. In the following the parameters of the EAC had been set to window size 124 sample values (47msec) with a overlap of 512 samples resulting in 512 EAC dimensions per window. The parameter k was set to 1/3 designated by empirical experiences. 2.6 Per film feature representation All above described features are derived from short time windows of a signal, resulting in a large number (2-3) of feature vectors per utterance or in case of the challenge film sequences. Keeping in mind that a typical EmotiWchallenge subset contains about 4 sequences the total number of instances presented to the classifier is above 1,. In order to reduce the number of training instances we applied two different early fusion techniques which reduces the number of instances per film sequence to one per sequence. First we compute the average vector x µ = 1 T T t=1 xt of all t = 1,..., T feature vectors per sequence. Computing the average loses the variety within a sequence, in order to reduce this disadvantage we second took the variance x σ 2 = 1 T T t=1 (xt xµ)2 of a vector sequence. In the following we build a separate classifier on the windowed sequence, sequence averages and variances. 3. RESULTS The evaluation is divided in two parts. First the results according to the comparison of the features ignoring the underling challenge are presented. Second we choose the most promising features containing EAC and ModSpec and a fusion of all features to participate in the challenge. The other feature types are evaluated on the test set without taking them into account in the challenge. 3.1 Feature comparison To compare the performance of the features we rearranged the train and validation set by combining them to a single dataset containing 962 film sequences. We extracted all features presented in section 2 in the windowed sequence, sequence average and variance option and applied a Support Vector Machine (SVM) with Gaussian kernel for each feature type. Each SVM was optimized according to their specific parameters C and γ based on unweighed accuracy measure. A 1-fold cross validation was applied. Table 1 gives an overview of the achieved results. In case of single frame based as well as sequence average and variance features the EAC outperforms all other features. The most common used MFCC feature performs only moderate on the EmotiW dataset, which is not very surprising based on our results of last years challenge participation[2]. On window instance based and sequence average vectors the ModSpec feature performs second best and third best on the variance of window based vectors. 3.2 Challenge Results Based on the previous results we decided to use the Mod- Sepc and EAC feature in all three (full, average and variance) options for the participation in the challenge. The six SVM s were trained and optimized on the train and validation set according to the challenge guidelines. Table 2 gives an overview of the results and endorses the dominance of the EAC feature family. Finally a fusion architecture was applied by summarizing the probabilistic outputs of the SVM classifiers which build an ensemble of the six members. The fusion architecture yield a overall accuracy of 4.1% which is gently higher than each single feature. As on the feature comparison results again the EAC features are better than the ModSpec feature. 55

5 Fusion type instance wise window average window variance Feature type Voting - Late Fusion Early Fusion Early Fusion MFCC ModSpec LPC RastaPLP EAC Fusion of all 4.1 Baseline 26.2 Table 2: Results comparison for emotion recognition, trained on the train set according to the challenge guidelines. Classification Accuracy in percent Angry Disgust Fear Happy Neutral Sad Surprise Accuracy Angry Disgust Fear Happy Neutral Sad Surprise Table 3: Confusion matrix of average EAC feature results on test set. Accuracy in percent, confusion matrix in absolute numbers. In table 3 the confusion matrix is evinced, which shows a strong imbalance oft the classification accuracies along the different classes. Neutral and angry could be recognized very well while disgust and surprise couldn t be perceived at all. Compared to the last year results where we also had a imbalance of the classification accuracy towards the classes angry, neutral and happiness this imbalancing is not very surprising. May the poor results of disgust, sadness and surprise result from the lower train data amount and a priory probability in test data. In case of the challenge this result is a bit unsatisfying, while in the scope of affective computing were neutral and angry are very important classes it is a very promising result. 4. CONCLUSIONS After our work in the field of feature selection in the last year we focused more on the development and usage of new or alternative features in the field of speech based emotion recognition on datasets which are recorded under mostly realistic and not strict controlled environments like the AFEW 4. dataset. In this work we presenter the EAC as a new feature and compared it to several state of the art features. All common features had been outperformed in all cases. The best result on the challenge dataset (4% accuracy) was archived by using a fusion combination of all features together an was slightly better than the best single feature (EAC average with 37% accuracy). In our opinion it is very important to search for new features which could discriminate speech towards emotions better than the existing ones which had mostly been developed for speech recognition, speaker identification or music classification. Based on better features the underlain classification problem could get much easier, which then result in higher classification accuracy based on single frame or short term signal analyses. May it is also suitable to analyse features towards their discrimination ability by using correlation dimension analyses [35]. Improving the basic input results could then finally improve results after a well designed multi modal fusion architecture dramatically. 5. REFERENCES [1] T. Bänziger, H. Pirker, and K. Scherer. Gemep-geneva multimodal emotion portrayals: A corpus for the study of multimodal emotional expressions. In Proceedings of LREC, volume 6, pages 15 19, 26. [2] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Interspeech, volume 5, pages , 25. [3] F. Dellaert, T. Polzin, and A. Waibel. Recognizing emotion in speech. In Spoken Language, ICSLP 96. Proceedings., Fourth International Conference on, volume 3, pages IEEE, [4] A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. Emotion recognition in the wild challenge. ACM ICMI, 213. [5] A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. Emotion recognition in the wild challenge 214: Baseline, data and protocol. ACM ICMI, 214. [6] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia, 19:34 41, 212. [7] N. Fragopanagos and J. Taylor. Emotion recognition in human-computer interaction. Neural Networks, 18:389 45, 25. [8] N. H. Frijda. Recognition of emotion. Advances in experimental social psychology, 4: , [9] M. Glodek, M. Schels, G. Palm, and F. Schwenker. Multi-modal fusion based on classifiers using reject 56

6 options and markov fusion networks. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages IEEE, 212. [1] M. Glodek, S. Scherer, F. Schwenker, and G. Palm. Conditioned hidden markov model fusion for multimodal classification. In INTERSPEECH, pages , 211. [11] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, M. Kächele, M. Schmidt, H. Neumann, G. Palm, et al. Multiple classifier systems for the classification of audio-visual emotional states. In Affective Computing and Intelligent Interaction, pages Springer, 211. [12] H. Hermansky. The modulation spectrum in the automatic recognition of speech. In Automatic Speech Recognition and Understanding, Proceedings., 1997 IEEE Workshop on, pages IEEE, [13] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Rasta-plp speech analysis technique. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 1, pages IEEE, [14] F. Itakura. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoustical Society of America, 57(S1):S35 S35, [15] M. Kächele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In Procȯf ICPRAM, pages , 214. [16] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facial expression analysis. In Automatic Face and Gesture Recognition, 2. Proceedings. Fourth IEEE International Conference on, pages IEEE, 2. [17] C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. S. Narayanan. Emotion recognition based on phoneme classes. In Proceedings of ICSLP 4, 24. [18] B. Logan et al. Mel frequency cepstral coefficients for music modeling. In ISMIR, 2. [19] S. Meudt and F. Schwenker. On instance selection in audio based emotion recognition. In Artificial Neural Networks in Pattern Recognition, pages Springer, 212. [2] S. Meudt, D. Zharkov, M. Kächele, and F. Schwenker. Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In Proceedings of the 15th ACM on International conference on multimodal interaction, pages ACM, 213. [21] J. Nicholson, K. Takahashi, and R. Nakatsu. Emotion recognition in speech using neural networks. Neural Computing and Applications, 9:29 296, 2. [22] G. Palm and F. Schwenker. Sensor-fusion in neural networks. In E. Shahbazian, G. Rogova, and M. J. DeWeert, editors, Harbour Protection Through Data Fusion Technologies, pages Springer, 29. [23] D. Rösner, J. Frommer, R. Friesen, M. Haase, J. Lange, and M. Otto. Last minute: a multimodal corpus of speech-based user-companion interactions. In LREC, pages , 212. [24] S. Schachter. The interaction of cognitive and physiological determinants of emotional state. Advances in Experimental Social Psychology, 1(Bd. 1):49 8, [25] M. Schels, M. Glodek, S. Meudt, S. Scherer, M. Schmidt, G. Layher, S. Tschechne, T. Brosch, D. Hrabal, S. Walter, G. Palm, H. Neumann, H. Traue, and F. Schwenker. Multi-modal classifier-fusion for the recognition of emotions. In M. Rojc and N. Campbell, editors, Coverbal synchrony in Human-Machine Interaction, chapter 4. CRC Press, 213. [26] M. Schels, M. Kächele, M. Glodek, D. Hrabal, S. Walter, and F. Schwenker. Using unlabeled data to improve classification of emotional states in human computer interaction. Journal on Multimodal User Interfaces, 8(1):5 16, 214. [27] K. R. Scherer, T. Johnstone, and G. Klasmeyer. Handbook of Affective Sciences - Vocal expression of emotion, chapter 23, pages Affective Science. Oxford University Press, 23. [28] S. Scherer, M. Oubbati, F. Schwenker, and G. Palm. Real-time emotion recognition from speech using echo state networks. In Artificial Neural Networks in Pattern Recognition, pages Springer Berlin Heidelberg, 28. [29] S. Scherer, F. Schwenker, and G. Palm. Emotion recognition from speech using multi-classifier systems and rbf-ensembles. In Speech, Audio, Image and Biomedical Signal Processing using Neural Networks, pages Springer Berlin Heidelberg, 28. [3] S. Scherer, F. Schwenker, and G. Palm. Classifier fusion for emotion recognition from speech. In W. Minker, M. Weber, H. Hagras, V. Callagan, and A. D. Kameas, editors, Advanced Intelligent Environments, pages Springer, 29. [31] B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. Avec 211 the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages Springer, 211. [32] F. Schwenker, S. Scherer, M. Schmidt, M. Schels, and M. Glodek. Multiple classifier systems for the recognition of human emotions. In N. E. Gayar, J. Kittler, and F. Roli, editors, Proceedings of the 9th International Workshop on Multiple Classifier Systems (MCS 1), LNCS 5997, pages Springer, 21. [33] Y.-L. Tian, T. Kanade, and J. F. Cohn. Facial expression analysis. In Handbook of face recognition, pages Springer, 25. [34] T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. Speech and Audio Processing, IEEE Transactions on, 8(6):78 716, 2. [35] C. Traina, A. Traina, L. Wu, and C. Faloutsos. Fast feature selection using fractal dimension. 2. [36] S. Walter, S. Scherer, M. Schels, M. Glodek, D. Hrabal, M. Schmidt, R. Böck, K. Limbrecht, H. C. Traue, and F. Schwenker. Multimodal emotion classification in naturalistic user behavior. In Human-Computer Interaction. Towards Mobile and Intelligent Interaction Environments, pages Springer,

Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain

Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain F. Archetti 1,2, G. Arosio 1, E. Fersini 1, E. Messina 1 1 DISCO, Università degli Studi di Milano-Bicocca, Viale Sarca,