Using the Soundtrack to Classify Videos

Similar documents
LabROSA Research Overview

Using Source Models in Speech Separation

General Soundtrack Analysis

Sound Analysis Research at LabROSA

Using Speech Models for Separation

Recognition & Organization of Speech & Audio

Robustness, Separation & Pitch

Recognition & Organization of Speech and Audio

Noise-Robust Speech Recognition Technologies in Mobile Environments

Adaptation of Classification Model for Improving Speech Intelligibility in Noise

Sound, Mixtures, and Learning

Enhanced Feature Extraction for Speech Detection in Media Audio

Outline. Teager Energy and Modulation Features for Speech Applications. Dept. of ECE Technical Univ. of Crete

Recognition & Organization of Speech and Audio

Sound Texture Classification Using Statistics from an Auditory Model

SPEECH TO TEXT CONVERTER USING GAUSSIAN MIXTURE MODEL(GMM)

EECS 433 Statistical Pattern Recognition

Noise-Robust Speech Recognition in a Car Environment Based on the Acoustic Features of Car Interior Noise

HCS 7367 Speech Perception

Lecture 9: Speech Recognition: Front Ends

Supplemental Materials for Paper 113

Acoustic Signal Processing Based on Deep Neural Networks

Modulation and Top-Down Processing in Audition

Robust Speech Detection for Noisy Environments

EMOTION DETECTION FROM VOICE BASED CLASSIFIED FRAME-ENERGY SIGNAL USING K-MEANS CLUSTERING

Recognition & Organization of Speech and Audio

COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM: APPLICATION TO SPEECH EMOTION RECOGNITION

Single-Channel Sound Source Localization Based on Discrimination of Acoustic Transfer Functions

Gender Based Emotion Recognition using Speech Signals: A Review

HCS 7367 Speech Perception

SPEECH EMOTION RECOGNITION: ARE WE THERE YET?

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

What is sound? Range of Human Hearing. Sound Waveforms. Speech Acoustics 5/14/2016. The Ear. Threshold of Hearing Weighting

Lecture 3: Perception

Your hearing. Your solution.

Full Utilization of Closed-captions in Broadcast News Recognition

SUPPRESSION OF MUSICAL NOISE IN ENHANCED SPEECH USING PRE-IMAGE ITERATIONS. Christina Leitner and Franz Pernkopf

Effects of Cochlear Hearing Loss on the Benefits of Ideal Binary Masking

CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System

Computational Perception /785. Auditory Scene Analysis

EMOTIONS are one of the most essential components of

USING AUDITORY SALIENCY TO UNDERSTAND COMPLEX AUDITORY SCENES

Putting the focus on conversations

Facial expression recognition with spatiotemporal local descriptors

Acoustic Sensing With Artificial Intelligence

Some Studies on of Raaga Emotions of Singers Using Gaussian Mixture Model

Online Speaker Adaptation of an Acoustic Model using Face Recognition

Hearing Lectures. Acoustics of Speech and Hearing. Auditory Lighthouse. Facts about Timbre. Analysis of Complex Sounds

ARCHETYPAL ANALYSIS FOR AUDIO DICTIONARY LEARNING. Aleksandr Diment, Tuomas Virtanen

Power Instruments, Power sources: Trends and Drivers. Steve Armstrong September 2015

! Can hear whistle? ! Where are we on course map? ! What we did in lab last week. ! Psychoacoustics

I. INTRODUCTION. OMBARD EFFECT (LE), named after the French otorhino-laryngologist

SPEECH recordings taken from realistic environments typically

Real-Time Implementation of Voice Activity Detector on ARM Embedded Processor of Smartphones

The SRI System for the NIST OpenSAD 2015 Speech Activity Detection Evaluation

Speech Enhancement Based on Deep Neural Networks

Speech Enhancement Using Deep Neural Network

Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phase Components

Error Detection based on neural signals

CHAPTER 1 INTRODUCTION

Audio-based Emotion Recognition for Advanced Automatic Retrieval in Judicial Domain

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

AUDIO-VISUAL EMOTION RECOGNITION USING AN EMOTION SPACE CONCEPT

Tandem modeling investigations

Consonant Perception test

Voice Detection using Speech Energy Maximization and Silence Feature Normalization

VIRTUAL ASSISTANT FOR DEAF AND DUMB

Your hearing. Your solution.

EMOTION DETECTION THROUGH SPEECH AND FACIAL EXPRESSIONS

Advanced Audio Interface for Phonetic Speech. Recognition in a High Noise Environment

Emotion Recognition using a Cauchy Naive Bayes Classifier

Robust Neural Encoding of Speech in Human Auditory Cortex

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

The Unifi-EV Protocol for Evalita 2009

Combination of Bone-Conducted Speech with Air-Conducted Speech Changing Cut-Off Frequency

HISTOGRAM EQUALIZATION BASED FRONT-END PROCESSING FOR NOISY SPEECH RECOGNITION

Automatic Signer Diarization the Mover is the Signer Approach

Telephone Based Automatic Voice Pathology Assessment.

M. Shahidur Rahman and Tetsuya Shimamura

Unit 1.P.2: Sensing Sound

Auditory Scene Analysis: phenomena, theories and computational models

LIE DETECTION SYSTEM USING INPUT VOICE SIGNAL K.Meena 1, K.Veena 2 (Corresponding Author: K.Veena) 1 Associate Professor, 2 Research Scholar,

Noise Ordinance Update PS&I 4/12/2018

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

Frequency Tracking: LMS and RLS Applied to Speech Formant Estimation

GENERALIZATION OF SUPERVISED LEARNING FOR BINARY MASK ESTIMATION

Speech to Text Wireless Converter

16.400/453J Human Factors Engineering /453. Audition. Prof. D. C. Chandra Lecture 14

SymDetector: Detecting Sound-Related Respiratory Symptoms Using Smartphones

Use the following checklist to ensure that video captions are compliant with accessibility guidelines.

Lecturer: T. J. Hazen. Handling variability in acoustic conditions. Computing and applying confidence scores

Blue Eyes Technology

Computational Auditory Scene Analysis: An overview and some observations. CASA survey. Other approaches

Decision tree SVM model with Fisher feature selection for speech emotion recognition

Proceedings of Meetings on Acoustics

HOW TO USE THE SHURE MXA910 CEILING ARRAY MICROPHONE FOR VOICE LIFT

Hearing the Universal Language: Music and Cochlear Implants

Transcription:

Using the Soundtrack to Classify Videos Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Machine Listening 2. Global Classification 3. Foreground Classification 4. Outstanding Issues Soundtrack Classification - Dan Ellis 211-5-4 1 /16

1. Machine Listening Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription Sound Intelligence VAD Speech/Music Environmental Sound Speech Dectect Task Extracting useful information from sound Soundtrack Classification - Dan Ellis Music Domain 211-5-4 2 /16

The Information in Audio Environmental recordings contain info on: location type (restaurant, street,...) and specifics activity talking, walking, typing,... people generic (2 males), specific (Chuck & John) spoken content (sometimes) but not: what people and things looked like day/night...... except when correlated with audible features Soundtrack Classification - Dan Ellis 211-5-4 3 /16

Applications Audio Lifelog Diarization 9: 9:3 1: 1:3 11: 11:3 24-9-13 preschool cafe preschool cafe lecture Ron 24-9-14 L2 cafe office outdoor lecture DSP3 12: 12:3 13: office office outdoor group outdoor meeting2 compmtg 13:3 lab 14: 14:3 cafe meeting2 Manuel outdoor office cafe office Mike 15: Arroyo? office 15:3 Consumer Video Classification 16: office outdoor office postlec office Sambarta? Soundtrack Classification - Dan Ellis 211-5-4 4 /16

Consumer Video Dataset 25 concepts from 1G+KL2 (1/15) Kodak user study boat, crowd, cheer, dance,... from YouTube search filter for raw, unedited = 1873 videos manually relabel with concepts Concept overlap: Soundtrack Classification - Dan Ellis museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ Labeled Concepts Grab top 2 videos 1 8GMM+Bha (9/15).9.8.7.6.5.4.3.2.1 3mc c s o n g s d b p b p p b s g s s b a w pm Overlapped Concepts plsa5+lognorm (12/15) 211-5-4 5 /16

2. Global Classification Baseline for soundtrack classification divide sound into short frames (e.g. 3 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = bag of features Video Soundtrack MFCC features freq / khz MFCC bin 8 7 6 5 4 3 2 1 2 18 16 14 12 1 8 6 4 2 VTS_4_1 - Spectrogram 1 2 3 4 5 6 7 8 9 time / sec 1 2 3 4 5 6 7 8 9 time / sec Classify by e.g. Mahalanobis distance + SVM Soundtrack Classification - Dan Ellis 211-5-4 6 /16 3 2 1-1 -2 level / db 2 15 1 5-5 -1-15 -2 value MFCC dimension 2 18 16 14 12 1 8 6 4 2 MFCC Covariance Matrix MFCC covariance 5 1 15 2 MFCC dimension 5-5

Codebook Histograms Convert nonplanar distributions to multinomial MFCC(1) 8 6 4 Global Gaussian Mixture Model 7 count Per-Category Mixture Component Histogram 15 MFCC features 2-2 -4 5 2 6 1 1 14 8 15 12 9 13 3 4 1 5-6 11-8 MFCC() Classify by distance on histograms KL, Chi-squared + SVM -1-2 -1 1 2 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 GMM mixture Soundtrack Classification - Dan Ellis 211-5-4 7 /16

Average Precision 1.9.8.7.6 Global Classification Results Lee & Ellis 1 Guessing MFCC + GMM Classifier Single Gaussian + KL2 8 GMM + Bha 124 GMM Histogram + 5 log(p(z c)/pz)).5.4.3.2.1 Wide range in performance audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes group of 3+ music crowd cheer singing one person night group of two show dancing beach park baby playground parade boat sports graduation ski sunset birthday animal wedding picnic museum MEAN Concept class Soundtrack Classification - Dan Ellis 211-5-4 8 /16

Average Precision.9.8.7.6.5.4.3.2.1 Combining with Video Video classification by SIFT codebooks Random Baseline Video Only Audio Only A+V Fusion one person group of 3+ music cheer group of 2 sport park crowd night animal singing shows ski dancing boat baby beach wedding museum parade playground Chang et al. 7 sunset picnic birthday Audio adds most for dancing, baby, museum... Soundtrack Classification - Dan Ellis 211-5-4 9 /16 MAP

3. Foreground Classification Global vs. local class models tell-tale acoustics may be washed out in statistics try iterative realignment of HMMs: Lee, Ellis & Loui 1 YT baby 2: voice baby laugh freq / khz 4 3 2 Old Way: All frames contribute freq / khz 4 3 2 New Way: Limited temporal extents voice bg baby 1 1 5 1 15 time / s 5 1 15 voice voice baby laugh bg baby laugh background model shared by all clips time / s laugh Soundtrack Classification - Dan Ellis 211-5-4 1/16

Transient Features Cotton, Ellis, Loui 11 Onset detector finds energy bursts best SNR PCA basis to represent each 3 ms x auditory frq bag of transients Object-related... Soundtrack Classification - Dan Ellis 211-5-4 11/16

Nonnegative Matrix Factorization Decompose spectrograms into 5478 templates 3474 223 + activation X = W H freq / Hz 1398 883 442 Original mixture Smaragdis Brown 3 Abdallah Plumbley 4 Virtanen 7 fast forgiving gradient descent algorithm 2D patches sparsity control... Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 1 2 3 4 5 6 7 8 9 1 time / s Soundtrack Classification - Dan Ellis 211-5-4 12/16

4. Outstanding Issues How to separate foreground & background? How to exploit prior knowledge of sounds? How to make classification credible? Soundtrack Classification - Dan Ellis 211-5-4 13/16

pts + 2 global BGs Classifier Insight How can we understand classification results?.. to inspire confidence & enrich results Temporal breakdown freq / Hz Spectrogram : YT_beach_1.mpeg, Manual Clip-level Labels : beach, group of 3+, cheer 4k 3k 2k 1k etc speaker4 speaker3 speaker2 unseen1 Background global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground Manual Frame-level Annotation drum wind voice voice cheer voice cheer voice voice voice voice water wave Viterbi Path by 27 state Markov Model Cluster examples picnic Soundtrack Classification - Dan Ellis 211-5-4 14/16 5 cluster 123 (baby)

Summary Machine Listening: Getting useful information from sound Environmental sound classification... from whole-clip statistics? Transients energy peaks... separate foreground background Classifier insight... for confidence & analysis Soundtrack Classification - Dan Ellis 211-5-4 15/16

References [Chang et al. 27] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, & J. Luo, Large-scale multimodal semantic concept detection for consumer video, Proc. ACM MIR, 255-264, 27. [Lee & Ellis 21] K. Lee & D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Tr. Audio, Speech, Lang. Process. 18 (6): 146-1416, Aug. 21. [Lee, Ellis & Loui 21] K. Lee, D. Ellis, & A. Loui, Detecting Local Semantic Concepts in Environmental Sounds using Markov Model based Clustering, Proc. ICASSP, 2278-2281, Dallas, 21. [Cotton, Ellis & Loui 211] C. Cotton, D. Ellis, & A. Loui, Soundtrack classification by transient events, Proc. ICASSP, to appear, Prague, 211. [Ellis, Zhang & McDermott 211] D. Ellis, X. Zheng, & J. McDermott, Classifying soundtracks with audio texture features, Proc. ICASSP, to appear, Prague, 211. Soundtrack Classification - Dan Ellis 211-5-4 16/16