Learning to Generate Long-term Future via Hierarchical Prediction. A. Motion-Based Pixel-Level Evaluation, Analysis, and Control Experiments

Similar documents
Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Audiovisual to Sign Language Translator

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

ANALYSIS AND DETECTION OF BRAIN TUMOUR USING IMAGE PROCESSING TECHNIQUES

Holistically-Nested Edge Detection (HED)

Recurrent Refinement for Visual Saliency Estimation in Surveillance Scenarios

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Edge Detection Operators: Peak Signal to Noise Ratio Based Comparison

Copyright 2012 by Rob King. Copyright 2012 by Rob King. All rights reserved.

ELFIT 2016 Elite Teams Qualifier WODs Standards

Saliency Inspired Modeling of Packet-loss Visibility in Decoded Videos

Identify these objects

Lab 2: Investigating Variation Across Spatial Scales

Perceptual-Based Objective Picture Quality Measurements

Conditional spectrum-based ground motion selection. Part II: Intensity-based assessments and evaluation of alternative target spectra

A Computational Model of Saliency Depletion/Recovery Phenomena for the Salient Region Extraction of Videos

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans

ESC-100 Introduction to Engineering Fall 2013

Lesson Plan 2. Lesson Objectives. Fundamental Movement Skills Explored EQUIPMENT NEEDED

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction

Arthroscopic Rotator Cuff Repair Protocol:

Missouri Guidelines for the Use of Controlled Substances for the Treatment of Pain

Reveal Relationships in Categorical Data

Special guidelines for preparation and quality approval of reviews in the form of reference documents in the field of occupational diseases

An Attentional Framework for 3D Object Discovery

arxiv: v2 [cs.cv] 7 Jun 2018

Phase 2 Cardiac Rehabilitation. Physiotherapy Instructions. Physiotherapy Department

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

STCH EVENT 3 WOD 3+4. WOD 4 = Total of calories at the cap time. Regulars. 22 min Running Clock

Convolutional Neural Networks for Estimating Left Ventricular Volume

STCH EVENT 3 WOD 3+4. WOD 4 = Total of calories at the cap time. Firebreathers. 22 min Running Clock

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL

VIDEO SALIENCY INCORPORATING SPATIOTEMPORAL CUES AND UNCERTAINTY WEIGHTING

QUALIFIER 18.1 Elite/RX/INT

WFC3-IR Thermal Vacuum Testing: IR Channel Dark Current

How the brain works. ABSTRACT This paper analyses the associative capabilities of the brain and takes the consequences of that capability.

The Importance of Time in Visual Attention Models

Comparative Analysis of Canny and Prewitt Edge Detection Techniques used in Image Processing

EVALUATION OF THE ANKLE ROLL GUARD S EFFECTIVENESS TO IMPROVE ITS CLINICAL BENEFIT PROGRESS REPORT. Prepared By:

Performance and Saliency Analysis of Data from the Anomaly Detection Task Study

Video Game Rehabilitation. Joe Savoia MS, ATC Jeff Basilicato ATC

1. Check with your doctor before starting any new exercise or diet program.

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression


Swole Sisters 6-Week Program

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

Novice Division SATURDAY

OPEN WEEK 1 17:00 PT, THURSDAY, FEB. 22, THROUGH 17:00 PT, MONDAY, FEB. 26

Predictive Coding for Dynamic Visual Processing: Development of Functional Hierarchy. in a Multiple Spatio-Temporal Scales RNN Model

Floyd County Family YMCA

Feasibility Evaluation of a Novel Ultrasonic Method for Prosthetic Control ECE-492/3 Senior Design Project Fall 2011

Performance Comparison of Speech Enhancement Algorithms Using Different Parameters

Daniel Boduszek University of Huddersfield

arxiv: v1 [cs.cv] 7 Mar 2018

Hamstring Strain. 43 Thames Street, St Albans, Christchurch 8013 Phone: (03) Website: philip-bayliss.com.

Amateur Division SATURDAY

Key factors that help the javelin go far

StabInstrMnual_v6.0 4/10/04 4:40 PM Page 1. Stability Trainer INSTRUCTION MANUAL

Biologically-Inspired Human Motion Detection

Important: If you have access to a printer, please PRINT this report (as you have our full permission). You ll get a lot more out of it.

Pattern recognition in correlated and uncorrelated noise

Planes and axes of movement. Movement patterns and the bodies planes and axes

COMPARATIVE STUDY OF EDGE DETECTION ALGORITHM: VESSEL WALL ELASTICITY MEASUREMENT FOR DEEP VEIN THROMBOSIS DIAGNOSIS

The GBB Challenge Pre-Test

Designing Caption Production Rules Based on Face, Text and Motion Detections

Can Saliency Map Models Predict Human Egocentric Visual Attention?

What Are Your Odds? : An Interactive Web Application to Visualize Health Outcomes

Answers to end of chapter questions

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

Food Intake Detection from Inertial Sensors using LSTM Networks

Understanding Functional Training Part one of a two-part series

4. Model evaluation & selection

Attention Estimation by Simultaneous Observation of Viewer and View

Intelligent Agents. Philipp Koehn. 16 February 2017

Deep learning on biomedical images. Ruben Hemelings Graduate VITO KU Leuven. Data Innova)on Summit March, #DIS2017

Wendy Rodger, Kerby Centre Senior Manager Information, Education and Recreation September 14, 2012

PIB Ch. 18 Sequence Memory for Prediction, Inference, and Behavior. Jeff Hawkins, Dileep George, and Jamie Niemasik Presented by Jiseob Kim

Soccer. Workouts. Many of the workouts that we do are derived form these websites:

INCORPORATING VISUAL ATTENTION MODELS INTO IMAGE QUALITY METRICS

Exploring the reference point in prospect theory

Welcome to Crossfit DeCO!

Partial Representations of Scenes in Change Blindness: In the Eyes and in the Hands

Milton Academy Baseball Pre-Season Training Program

TRAINING GUIDE S H O U L D E R H E A L T H & P E R F O R M A N C E P A R T N E R W I T H

Physical Education Scope and Sequence: Grades

Using Diverse Cognitive Mechanisms for Action Modeling

ACTIVITY TYPE. Stretching COACHING RESOURCE

Visual & Auditory Skills Lab

Mathias Method Strength to Change the World By Ryan Mathias

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Natural Scene Statistics and Perception. W.S. Geisler

Aerobic/Anaerobic Conditioning. High Knee Drill

SLO Presentation. Cerritos College. PEX Date: 09/13/2018

OPEN WEEK 3 17:00 PT THUR, MAR 10 THROUGH 17:00 PT MON, MAR 14

WHAT MAKES PEOPLE ACTIVE?

Everything is about Physics! Inquiry for Learning by Anjuli Ahooja

Transcription:

Appendix A. Motion-Based Pixel-Level Evaluation, Analysis, and Control Experiments In this section, we evaluate the predictions by deciles of motion similar to Villegas et al. (2017) using Peak Signal-to-Noise Ratio (PSNR) measure, where the 10 th decile contains videos with the most overall motion. We add a modification to our hierarchical method based on a simple heuristic by which we copy the background pixels from the last observed frame using the predicted pose heat-maps as foreground/background masks (Ours BG). Additionally, we perform experiments based on an oracle that provides our image generator the exact future pose trajectories (Ours GT-pose ) and we also apply the previously mentioned heuristics (Ours GT-pose BG ). We put * marks to clarify that these are hypothetical methods as they require ground-truth future pose trajectories. In our method, the future frames are strictly dictated by the future structure. Therefore, the prediction based on the future pose oracle sheds light on how much predicting a different future structure affects PSNR scores. (Note: many future trajectories are possible given a single past trajectory.) Further, we show that our conditional image generator given the perfect knowledge of the future pose trajectory (e.g., Ours GT-pose ) produces high-quality video prediction that both matches the ground-truth video closely and achieves much higher PNSRs. These results suggest that our hierarchical approach is a step in the right direction towards solving the problem of long-term pixel-level video prediction. A.1. Penn Action In Figures 6, and 7, we show evaluation on each decile of motion. The plots show that our method outperforms the baselines for long-term frame prediction. In addition, by using the future pose determined by the oracle as input to our conditional image generator, our method can achieve even higher PSNR scores. We hypothesize that predicting future frames that reflect similar action semantics as the ground-truth, but with possibly different pose trajectories, causes lower PSNR scores. Figure 8 supports this hypothesis by showing that higher MSE in predicted pose tends to correspond to lower PSNR score. Figure 6. Quantitative comparison on Penn Action separated by motion decile.

Figure 7. (Continued from Figure 6.) Quantitative comparison on Penn Action separated by motion decile. Figure 8. Predicted frames PSNR vs. Mean Squared Error on the predicted pose for each motion decile in Penn Action. The fact that PSNR can be low even if the predicted future is one of the many plausible futures suggest that PSNR may not be the best way to evaluate long-term video prediction when only a single future trajectory is predicted. This issue might be alleviated when a model can predict multiple possible future trajectories, but this investigation using our hierarchical decomposition is left as future work. In Figures 9 and 10, we show videos where PSNR is low when a different future (from the ground-truth) is predicted (left), and video where PSNR is high because the predicted future is close to the ground-true future (right).

t=17 t=40 t=54 t=60 t=12 t=30 t=43 t=40 Figure 9. Quantitative and visual comparison on Penn Action for selected time-steps for the action of baseball pitch (top) and golf swing (bottom). Side by side video comparison can be found in our project website

t=10 t=12 t=20 t=20 t=5 t=25 t=11 t=40 Figure 10. Quantitative and visual comparison on Penn Action for selected time-steps for the actions of jumping jacks (top) and tennis forehand (bottom). Side by side video comparison can be found in our project website

To directly compare our image generator using the predicted future pose (Ours) and the ground-truth future pose given by the oracle (Ours GT-pose ), we present qualitative experiments in Figure 11 and Figure 12. We can see that the both predicted videos contain the action in the video. The oracle based video prediction reflects the exact future very well. t=11 t=20 t=29 t=38 t=47 t=56 t=65 Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Figure 11. Qualitative evaluation of our network for long-term pixel-level generation. We show the actions of baseball pitch (top row), baseball swing (middle row), and gold swing (bottom row). Side by side video comparison can be found in our project website.

t=11 t=20 t=29 t=38 t=47 t=56 t=65 Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours t=11 t=17 t=23 t=29 t=35 t=41 t=47 Figure 12. Qualitative evaluation of our network for long-term pixel-level generation. We show the actions of tennis serve (top row), clean and jerk (middle row), and tennis forehand (bottom row). We show a different timescale for tennis forehand because the ground-truth action sequence does not reach time step 65. Side by side video comparison can be found in our project website.

A.2. Human3.6M In Figure 13, we show evaluation (PSNRs over time) of different methods on each decile of motion. Figure 13. Quantitative comparison on Human3.6M separated by motion decile.

As shown in Figure 13, our hierarchical approach (e.g., Ours BG) tends to achieve PSNR performance that is better than optical flow based method and comparable to convolutional LSTM. In addition, when using the oracle future pose predictor as input to our image generator, the PSNR scores get a larger boost compared to Section A.1. This is because there is higher uncertainty of the actions being performed in the Human 3.6M dataset compared to Penn Action dataset. Therefore, even plausible future predictions can still deviate significantly from the ground-truth future trajectory, which can penalize PSNRs. Figure 14. Predicted frames PSNR vs. Mean Squared Error on the predicted pose for each motion decile in Human3.6M. To gain further insight on this problem, we provide two additional analysis. First, we compute how the average PSNR changes as the future pose MSE increases in Figure 14. The figure clearly shows the negative correlation between the predicted pose MSE and frame PSNR, meaning that larger deviation of the predicted future pose from the ground future pose tend to cause lower PSNRs. Second, we show snapshots of video prediction from different methods along with the PNSRs that change over time (Figures 15 and 16). Our method tend to make plausible future pose trajectory but it can deviate from the ground-truth future pose trajectory; in such case, our method tend to achieve low PSNRs. However, when the future pose prediction from our method matches well with the ground-truth, the PSNR is much higher and the generated image frame is perceptually very similar to the ground-truth frame. In contrast, optical flow and convolutional LSTM make prediction that often loses the structure of the foreground (e.g., human) over time, and eventually their predicted videos tend to become static. It is interesting to note that our method is comparable to convolutional LSTM in terms of PSNR, but that our method still strongly outperforms convolutional LSTM in terms of human evaluation, as described in Section 6.2. t=31 t=61 t=80 t=90 Figure 15. Quantitative and visual comparison on Human 3.6M for selected time-steps for the action of walking (left) and walk together (right). Side by side video comparison can be found in our project website.

t=36 t=35 t=117 t=91 t=48 t=61 t=93 t=109 Figure 16. Quantitative and visual comparison on Human 3.6M for selected time-steps for the actions of walk dog (top left), phoning (top right), sitting down (bottom left), and walk together (bottom right). Side by side video comparison can be found in our project website.

To directly compare our image generator using the predicted future pose (Ours) and the ground-truth future pose given by the oracle (Ours GT-pose ), we present qualitative experiments in Figure 17 and Figure 18. We can see that the both predicted videos contain the action in the video. However, the oracle based video reflects the exact future very well. t=11 t=29 t=47 t=65 t=83 t=101 t=119 Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Figure 17. Qualitative evaluation of our network for long-term pixel-level generation. We show the actions of giving directions (top three rows), posing (middle three rows), and walk dog (bottom three rows). Side by side video comparison can be found in our project website.

t=11 t=29 t=47 t=65 t=83 t=101 t=119 Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Groundtruth Ours GT-pose Ours Figure 18. Qualitative evaluation of our network for long-term pixel-level generation. We show the actions of walk together (top three rows), sitting down (middle three rows), and walk dog (bottom three rows). Side by side video comparison can be found in our project website.