Using the Soundtrack to Classify Videos

Size: px

Start display at page:

Download "Using the Soundtrack to Classify Videos"

Reynold Clarke
5 years ago
Views:

Using the Soundtrack to Classify Videos Dan Ellis Laboratory

Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.

Global Classification 3. Foreground Classification 4.

1 Using the Soundtrack to Classify Videos Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Machine Listening 2. Global Classification 3. Foreground Classification 4. Outstanding Issues Soundtrack Classification - Dan Ellis /16

1. Machine Listening Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription Sound Intelligence VAD

2 1. Machine Listening Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription Sound Intelligence VAD Speech/Music Environmental Sound Speech Dectect Task Extracting useful information from sound Soundtrack Classification - Dan Ellis Music Domain /16

3 The Information in Audio Environmental recordings contain info on: location type (restaurant, street,...) and specifics activity talking, walking, typing,... people generic (2 males), specific (Chuck & John) spoken content (sometimes) but not: what people and things looked like day/night except when correlated with audible features Soundtrack Classification - Dan Ellis /16

Applications Audio Lifelog Diarization 9: 9:3 1: 1:3 11: 11:3 24-9-13

lecture DSP3 12: 12:3 13: office office outdoor group outdoor meeting2

4 Applications Audio Lifelog Diarization 9: 9:3 1: 1:3 11: 11: preschool cafe preschool cafe lecture Ron L2 cafe office outdoor lecture DSP3 12: 12:3 13: office office outdoor group outdoor meeting2 compmtg 13:3 lab 14: 14:3 cafe meeting2 Manuel outdoor office cafe office Mike 15: Arroyo? office 15:3 Consumer Video Classification 16: office outdoor office postlec office Sambarta? Soundtrack Classification - Dan Ellis /16

5 Consumer Video Dataset 25 concepts from 1G+KL2 (1/15) Kodak user study boat, crowd, cheer, dance,... from YouTube search filter for raw, unedited = 1873 videos manually relabel with concepts Concept overlap: Soundtrack Classification - Dan Ellis museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ Labeled Concepts Grab top 2 videos 1 8GMM+Bha (9/15) mc c s o n g s d b p b p p b s g s s b a w pm Overlapped Concepts plsa5+lognorm (12/15) /16

2. Global Classification Baseline for soundtrack classification divide sound into short frames (e.g.

MFCC) for each frame describe clip by statistics of frames (mean, covariance) = bag of features Video

1 2 3 4 5 6 7 8 9 time / sec 1 2 3 4 5 6 7 8 9 time / sec Classify by e.g.

6 2. Global Classification Baseline for soundtrack classification divide sound into short frames (e.g. 3 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = bag of features Video Soundtrack MFCC features freq / khz MFCC bin VTS_4_1 - Spectrogram time / sec time / sec Classify by e.g. Mahalanobis distance + SVM Soundtrack Classification - Dan Ellis / level / db value MFCC dimension MFCC Covariance Matrix MFCC covariance MFCC dimension 5-5

7 Codebook Histograms Convert nonplanar distributions to multinomial MFCC(1) Global Gaussian Mixture Model 7 count Per-Category Mixture Component Histogram 15 MFCC features MFCC() Classify by distance on histograms KL, Chi-squared + SVM GMM mixture Soundtrack Classification - Dan Ellis /16

8 Average Precision Global Classification Results Lee & Ellis 1 Guessing MFCC + GMM Classifier Single Gaussian + KL2 8 GMM + Bha 124 GMM Histogram + 5 log(p(z c)/pz)) Wide range in performance audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes group of 3+ music crowd cheer singing one person night group of two show dancing beach park baby playground parade boat sports graduation ski sunset birthday animal wedding picnic museum MEAN Concept class Soundtrack Classification - Dan Ellis /16

9 Average Precision Combining with Video Video classification by SIFT codebooks Random Baseline Video Only Audio Only A+V Fusion one person group of 3+ music cheer group of 2 sport park crowd night animal singing shows ski dancing boat baby beach wedding museum parade playground Chang et al. 7 sunset picnic birthday Audio adds most for dancing, baby, museum... Soundtrack Classification - Dan Ellis /16 MAP

10 3. Foreground Classification Global vs. local class models tell-tale acoustics may be washed out in statistics try iterative realignment of HMMs: Lee, Ellis & Loui 1 YT baby 2: voice baby laugh freq / khz Old Way: All frames contribute freq / khz New Way: Limited temporal extents voice bg baby time / s voice voice baby laugh bg baby laugh background model shared by all clips time / s laugh Soundtrack Classification - Dan Ellis /16

11 Transient Features Cotton, Ellis, Loui 11 Onset detector finds energy bursts best SNR PCA basis to represent each 3 ms x auditory frq bag of transients Object-related... Soundtrack Classification - Dan Ellis /16

12 Nonnegative Matrix Factorization Decompose spectrograms into 5478 templates activation X = W H freq / Hz Original mixture Smaragdis Brown 3 Abdallah Plumbley 4 Virtanen 7 fast forgiving gradient descent algorithm 2D patches sparsity control... Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) time / s Soundtrack Classification - Dan Ellis /16

13 4. Outstanding Issues How to separate foreground & background? How to exploit prior knowledge of sounds? How to make classification credible? Soundtrack Classification - Dan Ellis /16

14 pts + 2 global BGs Classifier Insight How can we understand classification results?.. to inspire confidence & enrich results Temporal breakdown freq / Hz Spectrogram : YT_beach_1.mpeg, Manual Clip-level Labels : beach, group of 3+, cheer 4k 3k 2k 1k etc speaker4 speaker3 speaker2 unseen1 Background global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground Manual Frame-level Annotation drum wind voice voice cheer voice cheer voice voice voice voice water wave Viterbi Path by 27 state Markov Model Cluster examples picnic Soundtrack Classification - Dan Ellis /16 5 cluster 123 (baby)

15 Summary Machine Listening: Getting useful information from sound Environmental sound classification... from whole-clip statistics? Transients energy peaks... separate foreground background Classifier insight... for confidence & analysis Soundtrack Classification - Dan Ellis /16

16 References [Chang et al. 27] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, & J. Luo, Large-scale multimodal semantic concept detection for consumer video, Proc. ACM MIR, , 27. [Lee & Ellis 21] K. Lee & D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Tr. Audio, Speech, Lang. Process. 18 (6): , Aug. 21. [Lee, Ellis & Loui 21] K. Lee, D. Ellis, & A. Loui, Detecting Local Semantic Concepts in Environmental Sounds using Markov Model based Clustering, Proc. ICASSP, , Dallas, 21. [Cotton, Ellis & Loui 211] C. Cotton, D. Ellis, & A. Loui, Soundtrack classification by transient events, Proc. ICASSP, to appear, Prague, 211. [Ellis, Zhang & McDermott 211] D. Ellis, X. Zheng, & J. McDermott, Classifying soundtracks with audio texture features, Proc. ICASSP, to appear, Prague, 211. Soundtrack Classification - Dan Ellis /16

LabROSA Research Overview

LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1.