The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Similar documents
Attentional Masking for Pre-trained Deep Networks

Hierarchical Convolutional Features for Visual Tracking

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

Video Saliency Detection via Dynamic Consistent Spatio- Temporal Attention Modelling

Medical Image Analysis

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Latent Space Based Text Generation Using Attention Models

Putting Context into. Vision. September 15, Derek Hoiem

Interpreting Deep Neural Networks and their Predictions

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Deep Learning for Computer Vision

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Improving the Interpretability of DEMUD on Image Data Sets

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

Skin cancer reorganization and classification with deep neural network

Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Sequential Predictions Recurrent Neural Networks

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

CS221 / Autumn 2017 / Liang & Ermon. Lecture 19: Conclusion

Choose the correct answers. Circle the letters, please.

Shu Kong. Department of Computer Science, UC Irvine

arxiv: v2 [cs.cv] 19 Dec 2017

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Kai-Wei Chang UCLA. What It Takes to Control Societal Bias in Natural Language Processing. References:

FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES

Efficient Deep Model Selection

Shu Kong. Department of Computer Science, UC Irvine

Artificial Intelligence to Enhance Radiology Image Interpretation

Group Behavior Analysis and Its Applications

Differential Attention for Visual Question Answering

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Deep Learning Models for Time Series Data Analysis with Applications to Health Care

Noise-Robust Speech Recognition Technologies in Mobile Environments

Introduction. Help your students learn how to learn!

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Start ASL The Fun Way to Learn American Sign Language for free!

Functional Elements and Networks in fmri

Comparison of Two Approaches for Direct Food Calorie Estimation

CS6501: Deep Learning for Visual Recognition. GenerativeAdversarial Networks (GANs)

Object Detectors Emerge in Deep Scene CNNs

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Flexible, High Performance Convolutional Neural Networks for Image Classification

PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks

Thales Foundation Cyprus P.O. Box 28959, CY2084 Acropolis, Nicosia, Cyprus. Level 3 4

Convolutional Neural Networks for Text Classification

Do Now: Write a detailed account of what happened in the cartoon.

Deep Learning-based Detection of Periodic Abnormal Waves in ECG Data

arxiv: v1 [cs.cv] 12 Dec 2016

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Artificial Intelligence in Breast Imaging

Automatic Diagnosis of Ovarian Carcinomas via Sparse Multiresolution Tissue Representation

Vision, Language, Reasoning

Learning to Rank Authenticity from Facial Activity Descriptors Otto von Guericke University, Magdeburg - Germany

Introduction to Deep Reinforcement Learning and Control

1. Introduction 1.1. About the content

1. Introduction 1.1. About the content. 1.2 On the origin and development of neurocomputing

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information

Dual Path Network and Its Applications

Towards The Deep Model: Understanding Visual Recognition Through Computational Models. Panqu Wang Dissertation Defense 03/23/2017

Accessorize to a Crime: Real and Stealthy Attacks on State-Of. Face Recognition. Keshav Yerra ( ) Monish Prasad ( )

Multi-Modality American Sign Language Recognition

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

Local Image Structures and Optic Flow Estimation

Clusters, Symbols and Cortical Topography

Computational Cognitive Science

arxiv: v1 [cs.cv] 7 Dec 2018

Psychology of visual perception C O M M U N I C A T I O N D E S I G N, A N I M A T E D I M A G E 2014/2015

Interpretable & Transparent Deep Learning

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Rich feature hierarchies for accurate object detection and semantic segmentation

KANGOUROU 2009-ENGLISH LEVEL 3-4

Swadesh wordlist, categorised by semantic field.

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation

Discriminability objective for training descriptive image captions

Speech recognition in noisy environments: A survey

A bandage helps to stop a cut getting dirty or infected. Give the name of one type of micro-organism which can infect a cut....

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

arxiv: v2 [cs.cv] 10 Apr 2017

B657: Final Project Report Holistically-Nested Edge Detection

arxiv: v1 [stat.ml] 23 Jan 2017

A. Reading Comprehension 20 marks. Facts about Seals. Seals bark like a dog, have whiskers like a cat and swim like a fish.

Cost-aware Pre-training for Multiclass Cost-sensitive Deep Learning

University of Cambridge Engineering Part IB Information Engineering Elective

Domain Adversarial Training for Accented Speech Recognition

EECS 433 Statistical Pattern Recognition

Computational modeling of visual attention and saliency in the Smart Playroom

Simultaneous Estimation of Food Categories and Calories with Multi-task CNN

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Travel Time-dependent Maximum Entropy Inverse Reinforcement Learning for Seabird Trajectory Prediction

Member 1 Member 2 Member 3 Member 4 Full Name Krithee Sirisith Pichai Sodsai Thanasunn

Weakly Supervised Coupled Networks for Visual Sentiment Analysis

Social Group Discovery from Surveillance Videos: A Data-Driven Approach with Attention-Based Cues

Multi-attention Guided Activation Propagation in CNNs

arxiv: v1 [cs.lg] 4 Feb 2019

Computational Cognitive Neuroscience

Transcription:

Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members One Professor (Prof. Harada) One Lecturer (me) One Assistant Professor One Postdoc Two Office Administrators 11 Ph. D. students 23 Master students 8 Bachelor students 5 Interns Varying research topics ICCV, CVPR, ECCV, ICML, NIPS, ICASSP, SIGdial, ACM Multimedia, ICME, ICRA, IROS, etc. The most important thing We are hiring!

Journalist Robot Born in 2006 Objective: publishing news automatically Recognize Objects, people, actions Describe What is happening Generate Contents as humans do

Outline Journalist Robot: ancestor of current work in MIL Outline: research originates with this robot Recognize Basic: Framework for DL, Domain Adaptation Classification: Single-modality, Multi-modalities Describe Image Captioning Video Captioning Generate Image Reconstruction Video Generation

Recognize

MILJS: JavaScript Deep Learning [Hidaka+, ICLR Workshop 2017]

MILJS: JavaScript Deep Learning Support for both learning and inference Support for nodes with GPGPUs Currently WebCL is utilized. Now working on WebGPU. Support for nodes w/o GPGPUs No requirements to install any software Even ResNet with 152 layers can be trained [Hidaka+, ICLR Workshop 2017] Let me show you a preliminary demonstration using mnist!

Asymmetric Tri-training for Domain Adaptation Unsupervised domain adaptation Trained on mnist Works on SVHN? [Saito+, submitted to ICML 2017] Ground-truth labels are associated with source (mnist) However, there are no labels for target (SVHN)

Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] Asymmetric Tri-training: pseudo labels for target domain

Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] 1 st : Training on MNIST Add pseudo labels for easy samples eight nine 2 nd ~: Training on MNIST+α Add more pseudo labels

End-to-end learning for environmental sound classification Existing methods for speech / sound recognition: 1 Feature extraction: Fourier Transformation (log-mel features) 2 Classification: CNN with the extracted feature map [Tokozume+, ICASSP 2017] 1 2 Log-mel features are suitable for human speech; but for environmental sounds?

End-to-end learning for environmental sound classification Proposed approach (EnvNet): CNN for both 1 feature map extraction and 2 classification [Tokozume+, ICASSP 2017] 1 2 Extracted feature map

End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015] 64.5 64.0 71.0 log-mel feature + CNN [Piczak, MLSP 2015] End-to-end CNN (Ours) End-to-end CNN & log-mel feature + CNN (Ours) EnvNet can extract discriminative features for environmental sounds

Visual Question Answering (VQA) Question answering system for Associated image Question by natural language [Saito+, ICME 2017] Q: Is it going to rain soon? Ground Truth A: yes Q: Why is there snow on one side of the stream and clear grass on the other? Ground Truth A: shade

Visual Question Answering (VQA) Image VQA = Multi-class classification Image feature [Saito+, ICME 2017] Integrated vector Question feature Answer bed sheets, pillow Question What objects are found on the bed? After integrating for : usual classification

Visual Question Answering [Saito+, ICME 2017] Current advancement: improving how to integrate and Concatenation e.g.) [Antol+, ICCV 2015] Summation e.g.) Image feature (with attention) + Question feature [Xu+Saenko, ECCV 2016] Multiplication e.g.) Bilinear multiplication [Fukui+, EMNLP 2016] This work: DualNet doing sum, multiply and concatenation

Visual Question Answering (VQA) [Saito+, ICME 2017] VQA Challenge 2016 (in CVPR 2016) Won the 1 st place on abstract images w/o attention mechanism Q: What fruit is yellow and brown? A: banana Q: How many screens are there? A: 2 Q: What is the boy playing with? A: teddy bear Q: Are there any animals swimming in the pond? A: no

Describe

Automatic Image Captioning [Ushiku+, ACMMM 2011]

Training Dataset A small white dog wearing a flannel warmer. A white van parked in an empty lot. A small gray dog on a leash. A white cat rests head on a stone. A small white dog standing on a leash. Nearest Captions A black dog White and gray standing in a kitten Input lying Image on A small grassy white area. dog wearing a flannel warmer. its side. A small white dog wearing a flannel warmer. A small gray dog on a leash. Silver A small car parked gray dog on a leash. A woman posing on side of road. on a red scooter. A black dog standing in a grassy area. A black dog standing in a grassy area.

Automatic Image Captioning [ACM MM 2012, ICCV 2015] Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.

Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] A confused man in a blue shirt is sitting on a bench. A man in a blue shirt and blue jeans is standing in the overlooked water. A zebra standing in a field with a tree in the dirty background.

Image Captioning + Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN The most probable noun is memorized [Andrew+, BMVC 2016]

Image Captioning + Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN 2. Forced to predict sentiment term before the noun [Andrew+, BMVC 2016]

Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.

Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. he and a woman are standing next each other. she is holding a plate of food. Narrative

Beyond Caption to Narrative [Andrew+, ICIP 2016] A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.

Generate

Image Reconstruction [Kato+, CVPR 2014] Traditional pipeline for image classification Extracting local descriptors Collecting descriptors Calculating Global feature Classifying images d 1 d 2 d3 d m d 2 d 1 d m d k p( d; θ) Camera d k d N d j d j d 3 Cat d N

Image Reconstruction [Kato+, CVPR 2014] d 1 d 2 d3 d m d 2 d 1 d m d k p( d; θ) Camera d k d N d j d j d 3 Cat d N Inversed problem: Image reconstruction from a label Pot

Image Reconstruction [Kato+, CVPR 2014] Pot Optimized arrangement using: Global location cost + Adjacency cost Other examples cat (bombay) camera grand piano gramophone headphone pyramid joshua tree wheel chair

Video Generation [Yamamoto+, ACMMM 2016] Image generation is still challenging Only successful for controlled settings: Human faces Birds Flowers Video generation is BEGAN [Berthelot+, 2017 Mar.] Additionally requiring temporal consistency Extremely challenging StackGAN [Zhang+, 2016 Dec.] [Vondrick+, NIPS 2016]

Video Generation [Yamamoto+, ACMMM 2016] This work: generating easy videos C3D (3D convolutional neural network) for conditional generation with an input label tempcae (temporal convolutional auto-encoder) for regularizing video to improve its naturalness

Video Generation [Yamamoto+, ACMMM 2016] Car runs to left Ours (C3D+tempCAE) Only C3D Rocket flies up Ours (C3D+tempCAE) Only C3D

Conclusion MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems This talk introduces some of the current research Recognize Basic: Framework for DL, Domain Adaptation Classification: Single-modality, Multi-modalities Describe Image Captioning, Video Captioning Generate Image Reconstruction, Video Generation