The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members One Professor (Prof. Harada) One Lecturer (me) One Assistant Professor One Postdoc Two Office Administrators 11 Ph. D. students 23 Master students 8 Bachelor students 5 Interns Varying research topics ICCV, CVPR, ECCV, ICML, NIPS, ICASSP, SIGdial, ACM Multimedia, ICME, ICRA, IROS, etc. The most important thing We are hiring!

Journalist Robot Born in 2006 Objective: publishing news automatically Recognize Objects, people, actions Describe What is happening Generate Contents as humans do

Outline Journalist Robot: ancestor of current work in MIL Outline: research originates with this robot Recognize Basic: Framework for DL, Domain Adaptation Classification: Single-modality, Multi-modalities Describe Image Captioning Video Captioning Generate Image Reconstruction Video Generation

Recognize

MILJS: JavaScript Deep Learning [Hidaka+, ICLR Workshop 2017]

MILJS: JavaScript Deep Learning Support for both learning and inference Support for nodes with GPGPUs Currently WebCL is utilized. Now working on WebGPU. Support for nodes w/o GPGPUs No requirements to install any software Even ResNet with 152 layers can be trained [Hidaka+, ICLR Workshop 2017] Let me show you a preliminary demonstration using mnist!

Asymmetric Tri-training for Domain Adaptation Unsupervised domain adaptation Trained on mnist Works on SVHN? [Saito+, submitted to ICML 2017] Ground-truth labels are associated with source (mnist) However, there are no labels for target (SVHN)

Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] Asymmetric Tri-training: pseudo labels for target domain

Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] 1 st : Training on MNIST Add pseudo labels for easy samples eight nine 2 nd ~: Training on MNIST+α Add more pseudo labels

End-to-end learning for environmental sound classification Existing methods for speech / sound recognition: 1 Feature extraction: Fourier Transformation (log-mel features) 2 Classification: CNN with the extracted feature map [Tokozume+, ICASSP 2017] 1 2 Log-mel features are suitable for human speech; but for environmental sounds?

End-to-end learning for environmental sound classification Proposed approach (EnvNet): CNN for both 1 feature map extraction and 2 classification [Tokozume+, ICASSP 2017] 1 2 Extracted feature map

End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015] 64.5 64.0 71.0 log-mel feature + CNN [Piczak, MLSP 2015] End-to-end CNN (Ours) End-to-end CNN & log-mel feature + CNN (Ours) EnvNet can extract discriminative features for environmental sounds

Visual Question Answering (VQA) Question answering system for Associated image Question by natural language [Saito+, ICME 2017] Q: Is it going to rain soon? Ground Truth A: yes Q: Why is there snow on one side of the stream and clear grass on the other? Ground Truth A: shade

Visual Question Answering (VQA) Image VQA = Multi-class classification Image feature [Saito+, ICME 2017] Integrated vector Question feature Answer bed sheets, pillow Question What objects are found on the bed? After integrating for : usual classification

Visual Question Answering [Saito+, ICME 2017] Current advancement: improving how to integrate and Concatenation e.g.) [Antol+, ICCV 2015] Summation e.g.) Image feature (with attention) + Question feature [Xu+Saenko, ECCV 2016] Multiplication e.g.) Bilinear multiplication [Fukui+, EMNLP 2016] This work: DualNet doing sum, multiply and concatenation

Visual Question Answering (VQA) [Saito+, ICME 2017] VQA Challenge 2016 (in CVPR 2016) Won the 1 st place on abstract images w/o attention mechanism Q: What fruit is yellow and brown? A: banana Q: How many screens are there? A: 2 Q: What is the boy playing with? A: teddy bear Q: Are there any animals swimming in the pond? A: no

Describe

Automatic Image Captioning [Ushiku+, ACMMM 2011]

Training Dataset A small white dog wearing a flannel warmer. A white van parked in an empty lot. A small gray dog on a leash. A white cat rests head on a stone. A small white dog standing on a leash. Nearest Captions A black dog White and gray standing in a kitten Input lying Image on A small grassy white area. dog wearing a flannel warmer. its side. A small white dog wearing a flannel warmer. A small gray dog on a leash. Silver A small car parked gray dog on a leash. A woman posing on side of road. on a red scooter. A black dog standing in a grassy area. A black dog standing in a grassy area.

Automatic Image Captioning [ACM MM 2012, ICCV 2015] Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.

Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] A confused man in a blue shirt is sitting on a bench. A man in a blue shirt and blue jeans is standing in the overlooked water. A zebra standing in a field with a tree in the dirty background.

Image Captioning + Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN The most probable noun is memorized [Andrew+, BMVC 2016]

Image Captioning + Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN 2. Forced to predict sentiment term before the noun [Andrew+, BMVC 2016]

Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.

Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. he and a woman are standing next each other. she is holding a plate of food. Narrative

Beyond Caption to Narrative [Andrew+, ICIP 2016] A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.

Generate

Image Reconstruction [Kato+, CVPR 2014] Traditional pipeline for image classification Extracting local descriptors Collecting descriptors Calculating Global feature Classifying images d 1 d 2 d3 d m d 2 d 1 d m d k p( d; θ) Camera d k d N d j d j d 3 Cat d N

Image Reconstruction [Kato+, CVPR 2014] d 1 d 2 d3 d m d 2 d 1 d m d k p( d; θ) Camera d k d N d j d j d 3 Cat d N Inversed problem: Image reconstruction from a label Pot

Image Reconstruction [Kato+, CVPR 2014] Pot Optimized arrangement using: Global location cost + Adjacency cost Other examples cat (bombay) camera grand piano gramophone headphone pyramid joshua tree wheel chair

Video Generation [Yamamoto+, ACMMM 2016] Image generation is still challenging Only successful for controlled settings: Human faces Birds Flowers Video generation is BEGAN [Berthelot+, 2017 Mar.] Additionally requiring temporal consistency Extremely challenging StackGAN [Zhang+, 2016 Dec.] [Vondrick+, NIPS 2016]

Video Generation [Yamamoto+, ACMMM 2016] This work: generating easy videos C3D (3D convolutional neural network) for conditional generation with an input label tempcae (temporal convolutional auto-encoder) for regularizing video to improve its naturalness

Video Generation [Yamamoto+, ACMMM 2016] Car runs to left Ours (C3D+tempCAE) Only C3D Rocket flies up Ours (C3D+tempCAE) Only C3D

Conclusion MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems This talk introduces some of the current research Recognize Basic: Framework for DL, Domain Adaptation Classification: Single-modality, Multi-modalities Describe Image Captioning, Video Captioning Generate Image Reconstruction, Video Generation