Using the Soundtrack to Classify Videos Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Machine Listening 2. Global Classification 3. Foreground Classification 4. Outstanding Issues Soundtrack Classification - Dan Ellis 211-5-4 1 /16
1. Machine Listening Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription Sound Intelligence VAD Speech/Music Environmental Sound Speech Dectect Task Extracting useful information from sound Soundtrack Classification - Dan Ellis Music Domain 211-5-4 2 /16
The Information in Audio Environmental recordings contain info on: location type (restaurant, street,...) and specifics activity talking, walking, typing,... people generic (2 males), specific (Chuck & John) spoken content (sometimes) but not: what people and things looked like day/night...... except when correlated with audible features Soundtrack Classification - Dan Ellis 211-5-4 3 /16
Applications Audio Lifelog Diarization 9: 9:3 1: 1:3 11: 11:3 24-9-13 preschool cafe preschool cafe lecture Ron 24-9-14 L2 cafe office outdoor lecture DSP3 12: 12:3 13: office office outdoor group outdoor meeting2 compmtg 13:3 lab 14: 14:3 cafe meeting2 Manuel outdoor office cafe office Mike 15: Arroyo? office 15:3 Consumer Video Classification 16: office outdoor office postlec office Sambarta? Soundtrack Classification - Dan Ellis 211-5-4 4 /16
Consumer Video Dataset 25 concepts from 1G+KL2 (1/15) Kodak user study boat, crowd, cheer, dance,... from YouTube search filter for raw, unedited = 1873 videos manually relabel with concepts Concept overlap: Soundtrack Classification - Dan Ellis museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ Labeled Concepts Grab top 2 videos 1 8GMM+Bha (9/15).9.8.7.6.5.4.3.2.1 3mc c s o n g s d b p b p p b s g s s b a w pm Overlapped Concepts plsa5+lognorm (12/15) 211-5-4 5 /16
2. Global Classification Baseline for soundtrack classification divide sound into short frames (e.g. 3 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = bag of features Video Soundtrack MFCC features freq / khz MFCC bin 8 7 6 5 4 3 2 1 2 18 16 14 12 1 8 6 4 2 VTS_4_1 - Spectrogram 1 2 3 4 5 6 7 8 9 time / sec 1 2 3 4 5 6 7 8 9 time / sec Classify by e.g. Mahalanobis distance + SVM Soundtrack Classification - Dan Ellis 211-5-4 6 /16 3 2 1-1 -2 level / db 2 15 1 5-5 -1-15 -2 value MFCC dimension 2 18 16 14 12 1 8 6 4 2 MFCC Covariance Matrix MFCC covariance 5 1 15 2 MFCC dimension 5-5
Codebook Histograms Convert nonplanar distributions to multinomial MFCC(1) 8 6 4 Global Gaussian Mixture Model 7 count Per-Category Mixture Component Histogram 15 MFCC features 2-2 -4 5 2 6 1 1 14 8 15 12 9 13 3 4 1 5-6 11-8 MFCC() Classify by distance on histograms KL, Chi-squared + SVM -1-2 -1 1 2 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 GMM mixture Soundtrack Classification - Dan Ellis 211-5-4 7 /16
Average Precision 1.9.8.7.6 Global Classification Results Lee & Ellis 1 Guessing MFCC + GMM Classifier Single Gaussian + KL2 8 GMM + Bha 124 GMM Histogram + 5 log(p(z c)/pz)).5.4.3.2.1 Wide range in performance audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes group of 3+ music crowd cheer singing one person night group of two show dancing beach park baby playground parade boat sports graduation ski sunset birthday animal wedding picnic museum MEAN Concept class Soundtrack Classification - Dan Ellis 211-5-4 8 /16
Average Precision.9.8.7.6.5.4.3.2.1 Combining with Video Video classification by SIFT codebooks Random Baseline Video Only Audio Only A+V Fusion one person group of 3+ music cheer group of 2 sport park crowd night animal singing shows ski dancing boat baby beach wedding museum parade playground Chang et al. 7 sunset picnic birthday Audio adds most for dancing, baby, museum... Soundtrack Classification - Dan Ellis 211-5-4 9 /16 MAP
3. Foreground Classification Global vs. local class models tell-tale acoustics may be washed out in statistics try iterative realignment of HMMs: Lee, Ellis & Loui 1 YT baby 2: voice baby laugh freq / khz 4 3 2 Old Way: All frames contribute freq / khz 4 3 2 New Way: Limited temporal extents voice bg baby 1 1 5 1 15 time / s 5 1 15 voice voice baby laugh bg baby laugh background model shared by all clips time / s laugh Soundtrack Classification - Dan Ellis 211-5-4 1/16
Transient Features Cotton, Ellis, Loui 11 Onset detector finds energy bursts best SNR PCA basis to represent each 3 ms x auditory frq bag of transients Object-related... Soundtrack Classification - Dan Ellis 211-5-4 11/16
Nonnegative Matrix Factorization Decompose spectrograms into 5478 templates 3474 223 + activation X = W H freq / Hz 1398 883 442 Original mixture Smaragdis Brown 3 Abdallah Plumbley 4 Virtanen 7 fast forgiving gradient descent algorithm 2D patches sparsity control... Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 1 2 3 4 5 6 7 8 9 1 time / s Soundtrack Classification - Dan Ellis 211-5-4 12/16
4. Outstanding Issues How to separate foreground & background? How to exploit prior knowledge of sounds? How to make classification credible? Soundtrack Classification - Dan Ellis 211-5-4 13/16
pts + 2 global BGs Classifier Insight How can we understand classification results?.. to inspire confidence & enrich results Temporal breakdown freq / Hz Spectrogram : YT_beach_1.mpeg, Manual Clip-level Labels : beach, group of 3+, cheer 4k 3k 2k 1k etc speaker4 speaker3 speaker2 unseen1 Background global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground Manual Frame-level Annotation drum wind voice voice cheer voice cheer voice voice voice voice water wave Viterbi Path by 27 state Markov Model Cluster examples picnic Soundtrack Classification - Dan Ellis 211-5-4 14/16 5 cluster 123 (baby)
Summary Machine Listening: Getting useful information from sound Environmental sound classification... from whole-clip statistics? Transients energy peaks... separate foreground background Classifier insight... for confidence & analysis Soundtrack Classification - Dan Ellis 211-5-4 15/16
References [Chang et al. 27] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, & J. Luo, Large-scale multimodal semantic concept detection for consumer video, Proc. ACM MIR, 255-264, 27. [Lee & Ellis 21] K. Lee & D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Tr. Audio, Speech, Lang. Process. 18 (6): 146-1416, Aug. 21. [Lee, Ellis & Loui 21] K. Lee, D. Ellis, & A. Loui, Detecting Local Semantic Concepts in Environmental Sounds using Markov Model based Clustering, Proc. ICASSP, 2278-2281, Dallas, 21. [Cotton, Ellis & Loui 211] C. Cotton, D. Ellis, & A. Loui, Soundtrack classification by transient events, Proc. ICASSP, to appear, Prague, 211. [Ellis, Zhang & McDermott 211] D. Ellis, X. Zheng, & J. McDermott, Classifying soundtracks with audio texture features, Proc. ICASSP, to appear, Prague, 211. Soundtrack Classification - Dan Ellis 211-5-4 16/16