Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Similar documents
Recognising Human-Object Interaction via Exemplar based Modelling

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Hierarchical Convolutional Features for Visual Tracking

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

Exemplar-based Action Recognition in Video

Part 1: Bag-of-words models. by Li Fei-Fei (Princeton)

Classroom Data Collection and Analysis using Computer Vision

Lecture 11 Visual recognition

Putting Context into. Vision. September 15, Derek Hoiem

Medical Image Analysis

MEMORABILITY OF NATURAL SCENES: THE ROLE OF ATTENTION

Rich feature hierarchies for accurate object detection and semantic segmentation

Learning to Generate Long-term Future via Hierarchical Prediction. A. Motion-Based Pixel-Level Evaluation, Analysis, and Control Experiments

Active Deformable Part Models Inference

Estimating scene typicality from human ratings and image features

Long Term Activity Analysis in Surveillance Video Archives. Ming-yu Chen. June 8, 2010

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

EXEMPLAR-BASED HUMAN INTERACTION RECOGNITION: FEATURES AND KEY POSE SEQUENCE MODEL

CAN A SMILE REVEAL YOUR GENDER?

BLACK PEAR TRUST SUBJECT PLAN - PE

CONSTRAINED SEMI-SUPERVISED LEARNING ATTRIBUTES USING ATTRIBUTES AND COMPARATIVE

Computational modeling of visual attention and saliency in the Smart Playroom

KS4 Physical Education

Physical Education Department COURSE Outline:

Biologically-Inspired Human Motion Detection

Object recognition and hierarchical computation

Vision: Over Ov view Alan Yuille

PE Scheme of Work Overview

different dance form from among folk, social,

Crowd behavior representation: an attribute based approach

ABSTRACT I. INTRODUCTION

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Shu Kong. Department of Computer Science, UC Irvine

Automatic Classification of Perceived Gender from Facial Images

Competing Frameworks in Perception

Competing Frameworks in Perception

Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

CS4495 Computer Vision Introduction to Recognition. Aaron Bobick School of Interactive Computing

Hierarchical Local Binary Pattern for Branch Retinal Vein Occlusion Recognition

(Visual) Attention. October 3, PSY Visual Attention 1

Object Detectors Emerge in Deep Scene CNNs

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags

Fine-Grained Image Classification Using Color Exemplar Classifiers

RECENT progress in computer visual recognition, in particular

Geneva CUSD 304 Content-Area Curriculum Frameworks Grades 6-12 Physical Education & Health

Nature Neuroscience: doi: /nn Supplementary Figure 1. Overlap between default mode network (DMN) and movie/recall maps.

Supplemental Materials for Paper 113

Change Blindness. The greater the lie, the greater the chance that it will be believed.

On Laban Movement Analysis and The Eight Efforts

Physical Education Scope and Sequence: Grades

RECENT progress in computer visual recognition, in particular

Multi-attention Guided Activation Propagation in CNNs

Describing Clothing by Semantic Attributes

Fine-Grained Image Classification Using Color Exemplar Classifiers

Predicting the location of interactees in novel human-object interactions

Using the Soundtrack to Classify Videos

Expression Recognition for Severely Demented Patients in Music Reminiscence-Therapy

Physical Education Curriculum Map Pool/Canoeing Grade: 7

SVM-based Discriminative Accumulation Scheme for Place Recognition

Decoding the Representation of Code in the Brain: An fmri Study of Code Review and Expertise

Multimodal interactions: visual-auditory

Designing Caption Production Rules Based on Face, Text and Motion Detections

arxiv: v1 [cs.cv] 16 Jun 2012

Rolling Programme of Outcomes and Themes. PE- Foundation

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Center Cass School District 66

Physical Education Skills Progression. Eden Park Primary School Academy

Shu Kong. Department of Computer Science, UC Irvine

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION

Social Group Discovery from Surveillance Videos: A Data-Driven Approach with Attention-Based Cues

Efficient Deep Model Selection

Convolutional Neural Networks (CNN)

Using Deep Convolutional Networks for Gesture Recognition in American Sign Language

VENUS: A System for Novelty Detection in Video Streams with Learning

Learning Spatiotemporal Gaps between Where We Look and What We Focus on

Complementary Computing for Visual Tasks: Meshing Computer Vision with Human Visual Processing

Emotion Recognition using a Cauchy Naive Bayes Classifier

Group Behavior Analysis and Its Applications

Automated Embryo Stage Classification in Time-Lapse Microscopy Video of Early Human Embryo Development

Firearm Detection in Social Media

Year 4 Physical Education Scheme of Work

arxiv: v1 [cs.cv] 18 May 2016

(SAT). d) inhibiting automatized responses.

CROWD ANALYTICS VIA ONE SHOT LEARNING AND AGENT BASED INFERENCE. Peter Tu, Ming-Ching Chang, Tao Gao. GE Global Research

An Attentional Framework for 3D Object Discovery

Automatic Visual Mimicry Expression Analysis in Interpersonal Interaction

Generic object recognition May 17 th, 2018

FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES

IDENTIFICATION OF REAL TIME HAND GESTURE USING SCALE INVARIANT FEATURE TRANSFORM

LOCOMOTION MOVEMENT SKILLS

EDGE DETECTION. Edge Detectors. ICS 280: Visual Perception

Fire P.I.T. Benefits of Fitness

Marking Period 3-5 week course- Activity Selection. Marking Period 1- One Marking Period dedicated to Health

Physical Education Curriculum Mapping Worksheet Benchmark 3

arxiv: v1 [cs.cv] 26 Jul 2016

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF)

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF)

Transcription:

Action Recognition Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem

This section: advanced topics Convolutional neural networks in vision Action recognition Vision and Language 3D Scenes and Context

What is an action? Action: a transition from one state to another Who is the actor? How is the state of the actor changing? What (if anything) is being acted on? How is that thing changing? What is the purpose of the action (if any)?

How do we represent actions? Categories Walking, hammering, dancing, skiing, sitting down, standing up, jumping Poses Nouns and Predicates <man, swings, hammer> <man, hits, nail, w/ hammer>

What is the purpose of action recognition? To describe

What is the purpose of action recognition? To predict

What is the purpose of action recognition? To understand the intention and motivation Predicting Motivations of Actions by Leveraging Text, CVPR 2016

How can we identify actions? Motion Pose Held Objects Nearby Objects

Representing Motion Optical Flow with Motion History Bobick Davis 2001

Representing Motion Space-Time Volumes Blank et al. 2005

Representing Motion Optical Flow with Split Channels optical flow split into pos/neg channels blurred pos/neg flow Efros et al. 2003

Representing Motion Tracked Points Matikainen et al. 2009

Representing Motion Space-Time Interest Points Moving corner Ball hits wall Corner detectors in space-time Balls collide Balls collide (different scale) Laptev 2005

Representing Motion Space-Time Interest Points Laptev 2005

Examples of Action Recognition Systems Feature-based classification Recognition using pose and objects

Action recognition as classification Retrieving actions in movies, Laptev and Perez, 2007

Remember image categorization Training Images Training Image Features Training Labels Classifier Training Trained Classifier Test Image Image Features Testing Trained Classifier Prediction Outdoor

Remember spatial pyramids. Compute histogram in each spatial bin

Features for Classifying Actions 1. Spatio-temporal pyramids Image Gradients Optical Flow

Features for Classifying Actions 2. Spatio-temporal interest points Corner detectors in space-time Descriptors based on Gaussian derivative filters over x, y, time

Classification Boosted stubs for pyramids of optical flow, gradient Nearest neighbor for STIP

Searching the video for an action 1. Detect keyframes using a trained HOG detector in each frame 2. Classify detected keyframes as positive (e.g., drinking ) or negative ( other )

Accuracy in searching video With keyframe detection Without keyframe detection

Talk on phone Get out of car Learning realistic human actions from movies, Laptev et al. 2008

Approach Space-time interest point detectors Descriptors HOG, HOF Pyramid histograms (3x3x2) SVMs with Chi-Squared Kernel Interest Points Spatio-Temporal Binning

Results

Action Recognition using Pose and Objects Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, B. Yao and Li Fei-Fei, 2010 Slide Credit: Yao/Fei-Fei

Human-Object Interaction Holistic image based classification Integrated reasoning Human pose estimation Head Torso Slide Credit: Yao/Fei-Fei

Human-Object Interaction Holistic image based classification Integrated reasoning Human pose estimation Object detection Tennis racket Slide Credit: Yao/Fei-Fei

Human-Object Interaction Holistic image based classification Integrated reasoning Human pose estimation Object detection Action categorization Tennis racket Head Torso Activity: Tennis Forehand Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Human pose estimation is challenging. Difficult part appearance Self-occlusion Image region looks like a body part Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Human pose estimation is challenging. Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Facilitate Given the object is detected. Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Small, low-resolution, partially occluded Object detection is challenging Image region similar to detection target Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Object detection is challenging Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009 Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Facilitate Given the pose is estimated. Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection Mutual Context Slide Credit: Yao/Fei-Fei

Mutual Context Model Representation A: Activity A O: Tennis forehand Croquet shot Volleyball smash Object Human pose H H: Tennis racket Croquet mallet Volleyball O P 1 P 2 Body parts P N f O f1 f2 fn Intra-class variations More than one H for each A; Unobserved during training. Image evidence P: l P : location; θ P : orientation; s P : scale. f: Shape context. [Belongie et al, 2002] Slide Credit: Yao/Fei-Fei

Learning Results Cricket defensive shot Cricket bowling Croquet shot Slide Credit: Yao/Fei-Fei

Learning Results Tennis forehand Tennis serve Volleyball smash Slide Credit: Yao/Fei-Fei

Model Inference I The learned models Slide Credit: Yao/Fei-Fei

Model Inference I The learned models Head detection Compositional Inference Torso detection Layout of the object and body parts. * * A1, H1, O1, P1, n n [Chen et al, 2007] Tennis racket detection Slide Credit: Yao/Fei-Fei

Model Inference I The learned models Output * * A1 H1 O1 P1, * * A, H, O, P,,,, n n K K K K n n Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] Slide Credit: Yao/Fei-Fei

Object Detection Results Valid region Cricket bat 1 0.8 Cricket ball Sliding window [Andriluka et al, 2009] Pedestrian context [Dalal & Triggs, 2006] Our Method Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall Croquet mallet Tennis racket Volleyball 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall Slide Credit: Yao/Fei-Fei

Precision Background clutter Small object Object Detection Results 1 Our Method 1 Our 1Method 0.8 Pedestrian as context Scanning 0.8 window detector Pedestrian as context Our Method Sliding window Pedestrian context Scanning Our window method detector 0.8 Pedestrian as context 0.6 1 Scanning window detector 0.6 0.8 0.4 0.6 0.4 0.6 0.2 0.4 0.4 0.2 0.2 0 0.2 0 0.2 0.4 0.6 0.8 1 Recall 0 0 0.2 0.4 0.6 0.8 1 0Recall 0 0.2 0.4 0.6 0.8 1 Recall Precision Precision Precision 0 0 0.2 0.4 0.6 0.8 1 Recall 1 0.8 Cricket ball Volleyball Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 59 Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model.52.22.22.21.28.24.28.17.14.42.50.31.30.31.27.18.19.11.11.45.66.43.39.44.34.44.40.27.29.58 Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model.52.22.22.21.28.24.28.17.14.42.50.31.30.31.27.18.19.11.11.45.66.43.39.44.34.44.40.27.29.58 Tennis serve model Our estimation result Andriluka et al, 2009 Volleyball smash model Our estimation Andriluka result et al, 2009 Slide Credit: Yao/Fei-Fei

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model One pose per class.52.22.22.21.28.24.28.17.14.42.50.31.30.31.27.18.19.11.11.45.66.43.39.44.34.44.40.27.29.58.63.40.36.41.31.38.35.21.23.52 Estimation result Estimation result Estimation result Estimation result Slide Credit: Yao/Fei-Fei

Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] Slide Credit: Yao/Fei-Fei

Activity Classification Results Classification accuracy 0.9 0.8 0.7 0.6 83.3% 78.9% 52.5% Cricket shot Tennis forehand 0.5 Our model Our model Gupta et al, 2009 Gupta et al, 2009 Bag-of- Words Bag-of-words SIFT+SVM Slide Credit: Yao/Fei-Fei

Motion features Dense Trajectory Action Recognition by Dense Trajectories, CVPR 2011 Action Recognition with Improved Trajectories, ICCV 2013

Video classification with CNNs Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

Video classification with CNNs Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

Two-stream CNN Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014

3D Convolutional Networks Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015

Action recognition -> Semantic role Labeling

Take-home messages Action recognition is an open problem. How to define actions? How to infer them? What are good visual cues? How do we incorporate higher level reasoning?

Take-home messages Some work done, but it is just the beginning of exploring the problem. So far Actions are mainly categorical (could be framed in terms of effect or intent) Most approaches are classification using simple features (spatial-temporal histograms of gradients or flow, s-t interest points, SIFT in images) Just a couple works on how to incorporate pose and objects Not much idea of how to reason about long-term activities or to describe video sequences