SPEECH EMOTION RECOGNITION: ARE WE THERE YET?

Size: px

Start display at page:

Download "SPEECH EMOTION RECOGNITION: ARE WE THERE YET?"

Bartholomew McDonald
5 years ago
Views:

1 SPEECH EMOTION RECOGNITION: ARE WE THERE YET? CARLOS BUSSO Multimodal Signal Processing (MSP) lab The University of Texas at Dallas Erik Jonsson School of Engineering and Computer Science

2 Why study emotion or attitude? Emotions play a crucial role in human interaction Emotional (vs. cognitive) reasoning Emotion is reflected in our body Our emotions change the minds of others People rely on emotion for making decisions Knowing the user s emotional state should help to adjust system performance User can be more engaged and have a more effective interaction with the system 2

3 Speech: a multimodal signal 3

4 Emotion Recognition in the Lab Databases Acted data Categorical representation of emotions Few speakers Limited data Features Many features are selected Feature set is reduced (pca, fisher linear discriminant, sequential forward feature selection, etc ) Results From 50% - 85% depending on the task [Pantic_2003, Cowie_2001] 4

Emotion Recognition in Real Applications Too much variability Speaker dependency Emotional descriptors Differences in acoustic environments Emotional models do not generalize!

5 Emotion Recognition in Real Applications Too much variability Speaker dependency Emotional descriptors Differences in acoustic environments Emotional models do not generalize!!! Results are strongly dependent on the recording condition Models are not easily generalized to other databases or online recognition task Speaker dependent models give better performance than Speaker independent models [Austermann et al. 2005] Cross-corpus testing resulted in drop in performance [Shami and Verhelst, 2007] How can we build models that generalize across problems? 5

6 Examples Sample 1: [fru; ()] [ang; ()] [neu; ()] [fru; ()] [oth; (exasperated)] [neu; ()] Sample 2: Sample 3: [ang; ()] [ang; ()] [ang; ()] 6

7 Robustness and Generalization We have made important progress, but challenges remain: At MSP: Databases Features Models Big corpora Reliable labels Natural behaviors Feature normalization Feature selection Feature representation Model adaptation Specialized detectors Temporal/contextual modeling 7

OUTLINE Introduction MSP-PODCAST: The The largest Largest speech Speech emotional Emotional database Database Case study 1: Multi-Task Learning Case Study 2: Training with Soft Labels Conclusions

8 OUTLINE Introduction MSP-PODCAST: The The largest Largest speech Speech emotional Emotional database Database Case study 1: Multi-Task Learning Case Study 2: Training with Soft Labels Conclusions Reza Lotfian and Carlos Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. To appear, Alec Burmania, Srinivas Parthasarathy, and Carlos Busso, "Increasing the reliability of crowdsourcing evaluations using online quality assessment," IEEE Transactions on Affective Computing, vol. 7, no. 4, October-December Soroosh Mariooryad, Reza Lotfian, and Carlos Busso, "Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora," in Interspeech 2014, Singapore, September 2014, pp

Current Emotional Corpora Lack of naturalness Limited in size Limited number of speakers Unbalanced emotional content Corpus Size # Spkr. Type Lang.

9 Current Emotional Corpora Lack of naturalness Limited in size Limited number of speakers Unbalanced emotional content Corpus Size # Spkr. Type Lang. IEMOCAP 12h26m 10 acted English MSP-IMPROV 9h35m 12 acted English CREMA-D 7,442 samples 91 acted English Chen Bimodal 9,900 samples 100 acted English Emo-DB 22m 10 acted German GEMEP 1,260 samples 10 acted - VAM-Audio 48m 47 spont. German TUM AVIC 10h23m 21 spont. English SEMAINE 6h21m 20 spont. English FAU-AIBO 9h12m 51 spont. German RECOLA 2h50m 46 spont. French 9

Unbalanced Emotional Content Categorical

10 Unbalanced Emotional Content Categorical labels Anger, happiness, sadness, neutral Dimensional or attribute based labels Valence (negative vs positive) Arousal (calm vs active) More accurate emotion descriptors (intensity) Emoticons from IEMOCAP MSP-IMPROV SEMAINE RECOLA VAM acted databases 10

11 Contribution: MSP-PODCAST Use existing podcast recordings Divide into speaker turns Emotion retrieval to balance the emotional content Annotate using crowdsourcing framework Podcast recording 11

12 MSP-PODCAST Corpus Audio sharing website Podcast Audio 16kHz, 16b PCM, Mono Diarization Duration filter 2.75s< <11s SNR filter Perceptual Evaluation Manual screening Emotion retrieval Remove telephone quality Speech only audio Music detection High quality audio Collection of audio recordings (Podcasts) Naturalness and the diversity of emotions Creative Commons copyright licenses Interviews, talk shows, news, discussions, education, storytelling, comedy, science, technology, politics, economics, business, arts, culture, sports 12

13 MSP-PODCAST Corpus Audio sharing website Podcast Audio 16kHz, 16b PCM, Mono Diarization Duration filter 2.75s< <11s SNR filter Perceptual Evaluation Manual screening Emotion retrieval Remove telephone quality Speech only audio Music detection High quality audio Automatic speaker diarization Single speaker segments Duration: Longer than 2.75sec: Long enough for annotators + extract reliable features Shorter than 11sec: Emotion content not changing significantly High SNR, no music, no phone quality The speaker attribution intelligent service, 13

14 MSP-PODCAST Corpus Audio sharing website Podcast Audio 16kHz, 16b PCM, Mono Diarization Duration filter 2.75s< <11s SNR filter Perceptual Evaluation Manual screening Emotion retrieval Remove telephone quality Speech only audio Music detection High quality audio Retrieve samples that convey desired emotion Developing and optimizing different machine learning framework using existing databases Balance the emotional content 14

15 MSP-PODCAST Corpus Audio sharing website Podcast Audio 16kHz, 16b PCM, Mono Diarization Duration filter 2.75s< <11s SNR filter Perceptual Evaluation Manual screening Emotion retrieval Remove telephone quality Speech only audio Music detection High quality audio Subjective annotation is costly: screening only retrieved samples before uploading for annotations Diarization fails on overlapping speech and interrupting speakers 15

16 Perceptual Evaluation Use Amazon Mechanical Turk Crowdsourcing Verify if a worker is spamming in real time Collect reference set Collect Reference Set (Gold Standard) R R R R R R R R R R Interleave Reference Set with Data (Online Quality Assessment) Data R End Data R End Data R Phase A Phase B Trace performance in real time videos x REFERENCE SET videos REFERENCE SET videos Alec Burmania, Srinivas Parthasarathy, and Carlos Busso, "Increasing the reliability of crowdsourcing evaluations using online quality assessment," IEEE Transactions on Affective Computing, vol. 7, no. 4, October-December

17 Status of the MSP-PODCAST: Ongoing Work With emotion labels: 20,988 sentences (35h, 1m) Segmented turns 199,836 sentences from 1,019 podcasts Arousal Valence 17

18 Status of the MSP-PODCAST: Ongoing Work Hot anger Cold anger Arousal Happiness Neutral Natural recordings The largest database Valence Multiple speakers Rich emotional content 18

19 MSP-PODCAST: Power of Data Effect of amount of data on the performance of emotion recognition models Arousal Linear output layer Number layers [2,3,4,5] Number of nodes [128, 256, 512, 1,024] Data: add 1,000 sentences at a time Example: Arousal Concordance correlation coefficient (CCC) IS-2013 (6,372 features Input nodes) Hidden layers (N nodes) Two layers Five layers 256 nodes per layer 19

20 OUTLINE Introduction MSP-PODCAST: The Largest Speech Emotional Database Case study 1: 1: Multi-Task Learning Learning Case Study 2: Training with Soft Labels Conclusions Srinivas Parthasarathy and Carlos Busso, "Jointly predicting arousal, valence and dominance with multi-task learning," in Interspeech 2017, Stockholm, Sweden, August Nominated for Best Student Paper at Interspeech 2017!

21 Multi-Task Learning Prediction of arousal, valence and dominance Previous studies have considered these dimensions as orthogonal descriptors Interrelation between these emotional attributes Goal: predicting emotional attributes with a unified framework Multi-task learning (MTL) implemented with deep neural networks (DNN) Srinivas Parthasarathy and Carlos Busso, "Jointly predicting arousal, valence and dominance with multi-task learning," in Interspeech 2017, Stockholm, Sweden, August

22 Multi-Task Emotion Recognition Leverage the relationship between attributes Arousal (calm versus active) Valence (negative versus positive) Dominance (weak versus strong) MTL-1 MTL-2 22

23 Multi-Task Emotion Recognition Weights learned using the development set MTL STL STL !=1, "=0 Arousal!=0, "=1 Valence !=0, "=0 Dominance α β

Multi-Task Emotion Recognition Within-corpus evaluation Multi-task learning (MTL) always better than single task learning (STL) Performance increase as we increase number of nodes Nodes / Layers 256

24 Multi-Task Emotion Recognition Within-corpus evaluation Multi-task learning (MTL) always better than single task learning (STL) Performance increase as we increase number of nodes Nodes / Layers 256 / / / 2 Type of task Concordance Correlation Coefficient Arousal Valence Dominance STL MTL MTL STL MTL MTL STL MTL MTL Within-Corpus Evaluation Validation Set Testing Set 10 speakers 887 sent. Training Set Rest of corpus 6,710 sentences 50 speakers 5,024 sent.

25 Multi-Task Emotion Recognition Cross-corpus evaluation Performance drops with respect to within-corpus evaluations Benefit of multi-task increases - 14% Concordance Correlation Coefficient Nodes / Layers Type of task Arousal Valence Dominance STL Best performance with lower number of nodes per layer 256 / 2 MTL MTL Cross-Corpus Evaluation Validation Set Testing Set 512 / 2 STL MTL speakers 887 sent. 50 speakers 5,024 sent. MTL STL Training Set 1024 / 2 MTL IEMOCAP MSP-IMPROV MTL

26 OUTLINE Introduction MSP-PODCAST: The Largest Speech Emotional Database Case study 1: Multi-Task Learning Case Study 2: 2: Training with with Soft Soft Labels Labels Conclusions R. Lotfian and C. Busso. Formulating emotion perception as a probabilistic model with application to categorical emotion classification. In Affective Computing and Intelligent Interaction (ACII 2017), San Antonio, TX, USA, October 2017

Training with Soft Labels Emotional labels often come from perceptual evaluations from multiple evaluators Expressive behaviors tend to be ambiguous with blended emotions Happy Sad Angry Evaluators

27 Training with Soft Labels Emotional labels often come from perceptual evaluations from multiple evaluators Expressive behaviors tend to be ambiguous with blended emotions Happy Sad Angry Evaluators disagree on the perceived emotion Noise or information? Assigning a single emotion per sentence oversimplifies the subjectivity in emotion perception Goal: leverage information provided by multiples evaluators Training emotion recognition with soft labels 27

28 Training with Soft Labels Straightforward approach Use distribution of emotions assigned by evaluators [Fayek et al., 2016] happiness happiness neutral happiness Sentence 1 Sentence 2 neutral happiness neutral neutral apple apple neu 0.25 hap = 0.75 apple apple neu 0.75 hap = 0.25 This approach ignores relationship between emotional classes Prioritize separation of unrelated categories (e.g., anger versus happiness) over related emotions (anger versus disgust) 28

29 Emotion Perception as a Probabilistic Model Each speech segment has a non-observable multivariate Gaussian distribution Task is to derive the distribution for a speech segment Use the expected value of distribution as a soft label apple apple neu 0.3 hap = 0.7 happiness happiness neutral happiness neutral happiness neutral neutral Happiness H H H N H N N N Happiness Neutral apple apple neu 0.55 hap = Neutral Sentence 1 Sentence 2

30 Experimental Evaluations MSP-Podcast Test set: data from 50 speakers (4,283 segments), Development set: data from 10 speakers (1,860 segments) Training rest of the corpus (7,289 segments) Seven-class problem: anger, sadness, happiness, surprised, disgust, contempt, and neutral (chances is 14%) Acoustic features: egemaps set [Eyben et al., 2016] DNN 2 hidden layers 512 nodes softmax layer Hap Neu Sad egemaps (88D) so.max layer 2 Hidden layers 30

31 Results Performance metrics average recall, average precision, and F1-score Human performance is only 39.6% (hard problem) Soft-Labels with from the expected intensity of emotion (SL-EIE) improved performance over majority vote labels and soft-labels proposed in previous work Rec [%] Pre [%] F1-Score Human Performance Majority vote Soft-label [Fayek, 2016] SL-EIE [proposed]

32 OUTLINE Introduction MSP-PODCAST: The Largest Speech Emotional Database Case study 1: Multi-Task Learning Case Study 2: Training with Soft Labels Conclusions

33 Summary Important contributions to increase robustness of emotion recognition systems: Resource level Model level using deep learning Are we there yet? Temporal dynamic modeling not yet, but soon Understand and modeling impact of emotion on other tasks Multimodal fusion using deep architectures Next step: use these models in real applications 33

34 Potential Impact Instrumental tools for health care Distance learning Security and defense (credibility assessment) 34

CARLOS BUSSO Tel: (972) 883-4351 Email: Web: http://utdallas.edu/~busso/ Multimodal Signal Processing (MSP) NAJMEH SADOUGHI Ph.D. Student Virtual characters REZA LOTFIAN Ph.D. Student Affective computing SRINIVAS PARTHASARATHY Ph.

35 CARLOS BUSSO Tel: (972) Web: Multimodal Signal Processing (MSP) NAJMEH SADOUGHI Ph.D. Student Virtual characters REZA LOTFIAN Ph.D. Student Affective computing SRINIVAS PARTHASARATHY Ph.D. Student Affective computing KUSHA MURTHY Ph.D. Student Affective computing FEI TAO Ph.D. Student Audiovisual ASR MOHAMMED ABD EL-WAHAB Ph.D. Student Affective computing SUMIT JHA Ph.D. Student In-vehicle safety systems MICHELLE BANCROFT Undergraduate Student Emotion and Speaker verification DOROTHY MANTLE Undergraduate Student In-vehicle safety system ASIM GAZI Undergraduate Student Human robot interaction 35

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Reza Lotfian and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas