DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

Similar documents
Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Sequential Predictions Recurrent Neural Networks

Recurrent Neural Networks

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Audiovisual to Sign Language Translator

Vision, Language, Reasoning

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Interact-AS. Use handwriting, typing and/or speech input. The most recently spoken phrase is shown in the top box

CSC2541 Project Paper: Mood-based Image to Music Synthesis

B657: Final Project Report Holistically-Nested Edge Detection

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

arxiv: v1 [stat.ml] 23 Jan 2017

Automatic Context-Aware Image Captioning

Name Period Date. Grade 7, Unit 4 Pre-assessment

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

Putting Context into. Vision. September 15, Derek Hoiem

Efficient Deep Model Selection

Shu Kong. Department of Computer Science, UC Irvine

NE LESSON GD Fit Families: Effortless Exercise

RESOURCE CATALOGUE. I Can Sign A Rainbow

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

Gesture Recognition using Marathi/Hindi Alphabet

Deep Learning for Computer Vision

CAPL 2 Questionnaire

GED Preparation Lesson Plan. Module: Science. Lesson Title: Forming a Conclusion. Standards: GED Preparation (Adult General Education)

Lesson 1 Pre-Visit The Athlete's Body

Hierarchical Convolutional Features for Visual Tracking

arxiv: v1 [cs.cv] 12 Dec 2016

Domain Generalization and Adaptation using Low Rank Exemplar Classifiers

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Open-Domain Chatting Machine: Emotion and Personality

Source and Description Category of Practice Level of CI User How to Use Additional Information. Intermediate- Advanced. Beginner- Advanced

CS/NEUR125 Brains, Minds, and Machines. Due: Friday, April 14

Sign Language MT. Sara Morrissey

Vector Learning for Cross Domain Representations

Shu Kong. Department of Computer Science, UC Irvine

Rising Scholars Academy 8 th Grade English I Summer Reading Project The Alchemist By Paulo Coelho

Autism, my sibling, and me

Group Behavior Analysis and Its Applications

Appendix C. Sample prepirls Passage, Questions, and Scoring Guides. Reading for Literary Experience Charlie s Talent

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Medical Image Analysis

An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Level 5-6 What Katy Did

5/2018. AAC Lending Library. Step-by-Step with Levels. ablenet. Attainment Company. Go Talk One Talking Block. ablenet. Attainment.

High-Resolution Breast Cancer Screening with Multi-View Deep Convolutional Neural Networks

A HMM-based Pre-training Approach for Sequential Data

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

Repurposing Corpus Materials For Interpreter Education. Nancy Frishberg, Ph.D. MSB Associates San Mateo, California USA

Visualizing the Affective Structure of a Text Document

arxiv: v1 [cs.cv] 7 Dec 2018

DeepDiary: Automatically Captioning Lifelogging Image Streams

Convolutional Neural Networks for Text Classification

Blast Searcher Formative Evaluation. March 02, Adam Klinger and Josh Gutwill

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Emotion Recognition using a Cauchy Naive Bayes Classifier

Why did the network make this prediction?

Building Evaluation Scales for NLP using Item Response Theory

A Year of Tips for Communication Success

MyDispense OTC exercise Guide

Houghton Mifflin Harcourt Discovering French Today! Level correlated to the

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

The Mythos of Model Interpretability

Convolutional and LSTM Neural Networks

Lesson 14: Association Between Categorical Variables

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Level 14 Book f. Level 14 Word Count 321 Text Type Information report High Frequency children, father, Word/s Introduced people

Improving the Interpretability of DEMUD on Image Data Sets

Helping Families Achieve the Best Outcome for their Child with a Cochlear Implant

Parents Talk About Teaching Kids to Read

Global&Journal&of&Community&Psychology&Practice&

CS343: Artificial Intelligence

Making Sure People with Communication Disabilities Get the Message

Object Detectors Emerge in Deep Scene CNNs

Attentional Masking for Pre-trained Deep Networks

Human Information Processing and Cultural Diversity. Jaana Holvikivi, DSc. School of ICT

- Where... you last night? - What time? - At about... - I was... Why? - Because there was a at. - There was? Too bad I wasn t there.

Active Deformable Part Models Inference

3. Which word is an antonym

Interpreting Deep Neural Networks and their Predictions

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Elementary 2 Unit Go to online version of the activity. Go back to this menu.

September 2016 Newsletter

Come for the day, or choose to stay!

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

Real Time Sign Language Processing System

Character Motivation Essential Question: How do readers analyze character motivation, including how it advances the plot and theme in a story?

Appendix I Teaching outcomes of the degree programme (art. 1.3)

arxiv: v2 [cs.cv] 3 Apr 2018

GIANT: Geo-Informative Attributes for Location Recognition and Exploration

Social Skills and Autism: Using Books in Creative Ways to Reach and Teach in Early Education

Streamlined Dense Video Captioning

Multi-attention Guided Activation Propagation in CNNs

Increasing Social Awareness in Adults with Autism Spectrum Disorders

Using Deep Convolutional Networks for Gesture Recognition in American Sign Language

Transcription:

SEOUL Oct.7, 2016 DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS Gunhee Kim Computer Science and Engineering Seoul National University October 7, 2016

AGENDA Photo stream captioning Video captioning Cesc C. Park and Gunhee Kim. Expressing an Image Stream with a Sequence of Natural Sentences. NIPS 2015 2

GENERAL USERS PHOTO STREAM Suppose that you and your family visit NYC A photo stream is a thread of the user s story Users do not organize the photo streams for later use Can we write a travelogue for a given photo stream? 3 3

PREVIOUS WORK IMAGE CAPTIONING Retrieve or generate a descriptive natural language sentence for a given image [Socher et al, TACL2013] [Karpathy et al. CVPR2015 ] [Mao et al, ICLR2015] Many more! [Vinyal et al, CVPR2015] [Gong et al. ECCV2014 ] 4 4

LIMITATION OF PREVIOUS WORK Much of previous work mainly discuss the relation between a single image and a single sentence A kid is smiling Absence of correlation, coherence and story for a stream of images Extend both input and output dimension to a sequence of images and a sequence of sentences 5 5

PROBLEM STATEMENT Objective: express an image stream to a coherent sequence of sentences A query of image stream A coherent sequence of sentences We took a couple days for family vacation in NYC to get away Empire state building right off the bat. Caeden is checking out the view. Caeden's first MLB game and my first in a while MLB game... He might be a mets fan Shake Shack... 6 6

IMAGE-TEXT PARALLEL TRAINING DATA Use a set of blog posts to learn the relation between a image stream and a sequence of sentences 19K blog posts with 150K images from Blogs are written in a way of storytelling Blog pictures are selected as the most canonical ones out of photo albums Informative sentences associated with pictures about location, sentiments, actors, 7 7

OUR SOLUTION CRCN Coherence Recurrent Convolutional Network (1) Convolutional neural networks for image description (2) Bidirectional recurrent neural networks for language model (3) Coherence model for a smooth flow of multi-sentences 8 8

OVERVIEW OF ALGORITHM 9 9

CRCN ARCHITECTURE The compatibility score btw image and sentence sequence Misaligned Pairs should be few Aligned Pairs should be many 10 10

RETRIEVAL SENTENCE SEQUENCES Retrieve best sentence sequences for a query image stream Divide-conquer search strategy! Almost optimal! The local fluency and coherence is required for the global one 11 11

USER STUDIES VIA AMT Goal: Find general users preferences between text sequences by different methods for a given photo stream Randomly select 100 test streams of 5 images Our method and one baseline predict text sequences Pairwise preference test via Quantitative results A higher number than 50% validate our approach (CRCN) The coherence becomes more critical as the passage is longer (4th vs 5th columns) 12 12

RESULTS FOR NYC DATASET (1) (2) (3) (4) (5) (CRCN) (1) One of the hallway arches inside of the library (2) As we walked through the library I noticed an exhibit called lunch hour nyc it captured my attention as I had also taken a tour of nyc food carts during my trip (3) Here is the top of the Chrysler building everyone's favorite skyscraper in new york. (4) After leaving the nypl we walked along 42nd st. (5) We walked down fifth avenue from rockefeller centre checking out the windows in saks the designer stores and eventually making our way to the impressive new york public library. (RCN) (1) As you walk along in some spots it looks like the buildings are sprouting up out of the high line plants (2) Charlie and his aunt donna relax on the high line after a steamy stroll (3) However navigating the new york subway system can be like trying to find your way through the amazon jungle sans guide (4) We loved nyc! (5) Getting ready for the sunny day...putting sunscreen on. 13 13

AGENDA Photo stream captioning Video captioning 14

WON LSMDC 2016! Large Scale Movie Description and Understanding Challenge (LSMDC 2016) in ECCV 2016 and MM 2016 https://sites.google.com/site/describingmovies/lsmdc-2016 Team members: Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim movie description movie multiple-choice movie fill-in-the-blank movie retrieval His vanity license plate reads 732. 15 15

ATTENTION MECHANISMS IN DEEP LEARNING Machine decides where to attend itself --- sequentially focus or attend on the most relevant part of the input over time. Image captioning [Xu et al. ICML2015] Machine translation [Hermann et al. NIPS2015] K. Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015. 16 K. M. Hermann et al. Teaching Machines to Read and Comprehend. NIPS 2015. 16

ATTENTION MECHANISMS IN DEEP LEARNING Machine decides where to attend itself --- sequentially focus or attend on the most relevant part of the input over time. Action recognition [Sharma et al. Arxiv2015] Golf swinging Trampoline jumping Video captioning [Sharma. Toronto 2016] Generated sentence : A woman is slicing a onion. Groundtruth : A woman is slicing a shrimp. S. Sharma et al. Action Recognition Using Visual Attention. Arxiv 2015. 17 U Toronto, 2016. S. Sharma. Action recognition and video description using visual attention. MS thesis, 17

PROBLEM STATEMENT Although attention models simulate human s attrition, there is no attempt to explicitly use human gaze labels Attention weights are implicitly learned in an end-to-end manner Does human attention improve the model performance? Then how can we inject such supervision to the attention model? Target task: Video captioning 18 18

PROBLEM STATEMENT Objective: supervise a caption generation model to attend where human focus on A short movie video stream Movie clip Predicted Human attention A human gaze assisted caption A little boy flies in the air by riding a bicycle. 19 19

RESULTS (CAPTION GENERATION) [1] Subhashini Venugopalan et al. Sequence to Sequence Video to Text. ICCV 2015 [2] Li Yao et al. Describing Videos by Exploiting Temporal Structure. ICCV 2015 20 20

RESULTS (CAPTION GENERATION) [1] Subhashini Venugopalan et al. Sequence to Sequence Video to Text. ICCV 2015 [2] Li Yao et al. Describing Videos by Exploiting Temporal Structure. ICCV 2015 21 21

CONCLUSION Joint understanding of multiple data modalities Visual data (images/videos) + Textual data Deep learning models are superior to jointly represent multiple data or tasks In the areas of robotics, VR, security, speech analysis, and more Many possible applications for online services 22 22

SEOUL Oct.7, 2016 THANK YOU