Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Similar documents
Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Discriminability objective for training descriptive image captions

Training for Diversity in Image Paragraph Captioning

arxiv: v3 [cs.cv] 23 Jul 2018

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

arxiv: v1 [stat.ml] 23 Jan 2017

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

arxiv: v1 [cs.cv] 12 Dec 2016

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Vector Learning for Cross Domain Representations

Recurrent Neural Networks

Deep Learning for Computer Vision

Aggregated Sparse Attention for Steering Angle Prediction

Convolutional and LSTM Neural Networks

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

Computational modeling of visual attention and saliency in the Smart Playroom

Social Image Captioning: Exploring Visual Attention and User Attention

Convolutional and LSTM Neural Networks

Chittron: An Automatic Bangla Image Captioning System

Simultaneous Estimation of Food Categories and Calories with Multi-task CNN

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Using Deep Convolutional Networks for Gesture Recognition in American Sign Language

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Highly Accurate Brain Stroke Diagnostic System and Generative Lesion Model. Junghwan Cho, Ph.D. CAIDE Systems, Inc. Deep Learning R&D Team

Skin cancer reorganization and classification with deep neural network

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

arxiv: v2 [cs.cv] 19 Dec 2017

arxiv: v3 [cs.cv] 11 Aug 2017 Abstract

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

Attention Correctness in Neural Image Captioning

Convolutional Neural Networks for Estimating Left Ventricular Volume

A HMM-based Pre-training Approach for Sequential Data

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Comparison of Two Approaches for Direct Food Calorie Estimation

Building Evaluation Scales for NLP using Item Response Theory

Unpaired Image Captioning by Language Pivoting

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem

arxiv: v2 [cs.lg] 1 Jun 2018

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Sequential Predictions Recurrent Neural Networks

POC Brain Tumor Segmentation. vlife Use Case

Holistically-Nested Edge Detection (HED)

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Chair for Computer Aided Medical Procedures (CAMP) Seminar on Deep Learning for Medical Applications. Shadi Albarqouni Christoph Baur

Convolutional Neural Networks for Text Classification

arxiv: v1 [cs.cv] 13 Mar 2018

CSC2541 Project Paper: Mood-based Image to Music Synthesis

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records

Object Detectors Emerge in Deep Scene CNNs

A Computational Model For Action Prediction Development

Arecent paper [31] claims to (learn to) classify EEG

arxiv: v2 [cs.cv] 10 Aug 2017

Using stigmergy to incorporate the time into artificial neural networks

Hierarchical Convolutional Features for Visual Tracking

Comparison of Two Approaches for Direct Food Calorie Estimation

DeepMiner: Discovering Interpretable Representations for Mammogram Classification and Explanation

Interpretable & Transparent Deep Learning

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Computational Cognitive Neuroscience

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Neuromorphic convolutional recurrent neural network for road safety or safety near the road

Attend and Diagnose: Clinical Time Series Analysis using Attention Models

Improving the Interpretability of DEMUD on Image Data Sets

B657: Final Project Report Holistically-Nested Edge Detection

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

Vision, Language, Reasoning

arxiv: v1 [cs.cv] 7 Dec 2018

Multi-attention Guided Activation Propagation in CNNs

arxiv: v2 [cs.cv] 22 Mar 2018

arxiv: v4 [cs.cv] 1 Sep 2018

Efficient Deep Model Selection

arxiv: v1 [cs.cv] 15 Aug 2018

Automatic Detection of Knee Joints and Quantification of Knee Osteoarthritis Severity using Convolutional Neural Networks

Healthcare Research You

ARTIFICIAL INTELLIGENCE FOR DIGITAL PATHOLOGY. Kyunghyun Paeng, Co-founder and Research Scientist, Lunit Inc.

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Differential Attention for Visual Question Answering

Deep Architectures for Neural Machine Translation

Weak Supervision. Vincent Chen and Nish Khandwala

Generative Adversarial Networks.

An Overview and Comparative Analysis on Major Generative Models

arxiv: v1 [cs.cv] 2 May 2017

FEATURE EXTRACTION USING GAZE OF PARTICIPANTS FOR CLASSIFYING GENDER OF PEDESTRIANS IN IMAGES

Classification of breast cancer histology images using transfer learning

Automatic Context-Aware Image Captioning

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Patch-based Head and Neck Cancer Subtype Classification

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v2 [cs.cv] 19 Jul 2017

arxiv: v1 [cs.cv] 7 Mar 2018

arxiv: v1 [cs.cv] 24 Jul 2018

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v2 [cs.cv] 8 Mar 2018

ACUTE LEUKEMIA CLASSIFICATION USING CONVOLUTION NEURAL NETWORK IN CLINICAL DECISION SUPPORT SYSTEM

Transcription:

Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1

Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture Conclusion 2

Introduction Caption: Describing an image in words What are the applications? 3

Applications 4

What do we need? Datasets MSCOCO ~ 100k images + 5 captions/image Flickr30k ~ 30k images + 5 captions/image Flickr8k ~ 8k images + 5 captions/image Evaluation BLEU scores BLEU-1, BLEU-2, BLEU-3, BLEU-4 Meteor CIDEr Model 5

BLEU Scores Generated Caption: <start> I can cat <end> Given a ground truth caption and a generated caption for the corresponding image, BLEU-n score is the percentage of the number of matching n-grams 6

Previous approaches 2011 2013: Caption Generation through object detection and language models These models were very limited in their approach 2014: Encoder Decoder framework Image captioning as a machine translation problem 2017: Image captioning as a reinforcement learning problem 7

Image captioning as machine translation Good Afternoon! Guten Tag! Machine Translation is implemented using an encoder-decoder architecture A band is playing music on stage 8

Encoder Decoder framework Encoder A Convolutional Neural Network Decoder A Recurrent Neural Network 9

Encoder-Decoder with visual attention1 Encoder Any CNN Network Any pretrained CNN VGG16, GoogleNet Encoder: A CNN classifier Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015 10

Encoder-Decoder with visual attention Decoder RNN network At each timestep of RNN, we predict one word Attention Allows the model to attend to specific features 11

Attention Unit yi - Image features C Context (word features) Soft Attention 12

13

14

Goal - Given an image I, generate a sentence S = {w1,w2,...,wt} which correctly describes the image content Image Captioning as RL problem At any timestep t, State Image features + words generated until t Action Next word to generate Reward Can be set in different ways We will look into two different architectures: Actor-Critic Policy Gradient 15

Policy Gradient Architecture Predicts policy by maximizing expected reward The method suffers from high variance One way of reducing the variance is to increase batch size. However, it would lead to inefficient learning Introduce a baseline 16

Introducing a baseline reduces variance in the policy gradient algorithm Acts like a critic to the model The goal is to find a good baseline for policy network Baseline 17

Actor-Critic Architecture Actor Generates a policy function Critic Generates value for the given state Critic can be thought of as a moving baseline for the policy network Actor and Critic are two separate models which are trained simultaneously 18

Actor-Critic Model2 Actor Policy Network Predicts the next word Critic Value Network Evaluates the reward Train embedding network (rewards) Train policy network and value network Train actor-critic together as RL problem Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017 19

Any model that generates a sentence as a sequence of words The encoder-decoder framework can work as a policy network Policy Network The policy network is trained using standard supervised learning with cross entropy loss 20

The embedding model is the used to predict the similarity between an image and a sentence Embedding model 21

The value network vp evaluates the reward r from an observed state st The value network is trained using supervised learning with MSE Value Network 22

Pretrain policy network with cross entropy loss Pretrain value network with mean squared loss Train policy network and actor network jointly using deep RL Training 23

Results 24

25

Self-Critical Sequence Training (SCST)3 Built on Policy gradient method Utilizes its test-time inference to estimate a baseline Uses evaluation metric (CIDEr) to estimate reward Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017 26

No need to estimate a reward signal (as is in the case of actor-critic model) Utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences Directly optimizes on the evaluation Metric (CIDEr score) Advantages 27

SCST Training Policy Network Image encoder Resnet-101 + Attention Decoder LSTM (1 layer, 512 units) Pretrain the model with supervised learning (XE loss) Train the model with Reinforcement Learning Reward CIDEr score Baseline Test time inference reward 28

Results MS Powerpoint: A picture containing grass, animal MS Powerpoint: A close up of a brick building 29

30

Conclusion Three types of Image captioning models Object detection + language model Encoder-Decoder framework with supervised learning Pretrained encoder-decoder with Reinforcement learning Actor-Critic Architecture Policy gradient Architecture Datasets MSCOCO, Flickr30k, Flickr8k Evaluation Metric BLEU, CIDEr, Meteor 31

References 1. Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 2. Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 3. Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 32