Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1

Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture Conclusion 2

Introduction Caption: Describing an image in words What are the applications? 3

Applications 4

What do we need? Datasets MSCOCO ~ 100k images + 5 captions/image Flickr30k ~ 30k images + 5 captions/image Flickr8k ~ 8k images + 5 captions/image Evaluation BLEU scores BLEU-1, BLEU-2, BLEU-3, BLEU-4 Meteor CIDEr Model 5

BLEU Scores Generated Caption: <start> I can cat <end> Given a ground truth caption and a generated caption for the corresponding image, BLEU-n score is the percentage of the number of matching n-grams 6

Previous approaches 2011 2013: Caption Generation through object detection and language models These models were very limited in their approach 2014: Encoder Decoder framework Image captioning as a machine translation problem 2017: Image captioning as a reinforcement learning problem 7

Image captioning as machine translation Good Afternoon! Guten Tag! Machine Translation is implemented using an encoder-decoder architecture A band is playing music on stage 8

Encoder Decoder framework Encoder A Convolutional Neural Network Decoder A Recurrent Neural Network 9

Encoder-Decoder with visual attention1 Encoder Any CNN Network Any pretrained CNN VGG16, GoogleNet Encoder: A CNN classifier Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015 10

Encoder-Decoder with visual attention Decoder RNN network At each timestep of RNN, we predict one word Attention Allows the model to attend to specific features 11

Attention Unit yi - Image features C Context (word features) Soft Attention 12

Goal - Given an image I, generate a sentence S = {w1,w2,...,wt} which correctly describes the image content Image Captioning as RL problem At any timestep t, State Image features + words generated until t Action Next word to generate Reward Can be set in different ways We will look into two different architectures: Actor-Critic Policy Gradient 15

Policy Gradient Architecture Predicts policy by maximizing expected reward The method suffers from high variance One way of reducing the variance is to increase batch size. However, it would lead to inefficient learning Introduce a baseline 16

Introducing a baseline reduces variance in the policy gradient algorithm Acts like a critic to the model The goal is to find a good baseline for policy network Baseline 17

Actor-Critic Architecture Actor Generates a policy function Critic Generates value for the given state Critic can be thought of as a moving baseline for the policy network Actor and Critic are two separate models which are trained simultaneously 18

Actor-Critic Model2 Actor Policy Network Predicts the next word Critic Value Network Evaluates the reward Train embedding network (rewards) Train policy network and value network Train actor-critic together as RL problem Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017 19

Any model that generates a sentence as a sequence of words The encoder-decoder framework can work as a policy network Policy Network The policy network is trained using standard supervised learning with cross entropy loss 20

The embedding model is the used to predict the similarity between an image and a sentence Embedding model 21

The value network vp evaluates the reward r from an observed state st The value network is trained using supervised learning with MSE Value Network 22

Pretrain policy network with cross entropy loss Pretrain value network with mean squared loss Train policy network and actor network jointly using deep RL Training 23

Results 24

Self-Critical Sequence Training (SCST)3 Built on Policy gradient method Utilizes its test-time inference to estimate a baseline Uses evaluation metric (CIDEr) to estimate reward Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017 26

No need to estimate a reward signal (as is in the case of actor-critic model) Utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences Directly optimizes on the evaluation Metric (CIDEr score) Advantages 27

SCST Training Policy Network Image encoder Resnet-101 + Attention Decoder LSTM (1 layer, 512 units) Pretrain the model with supervised learning (XE loss) Train the model with Reinforcement Learning Reward CIDEr score Baseline Test time inference reward 28

Results MS Powerpoint: A picture containing grass, animal MS Powerpoint: A close up of a brick building 29

Conclusion Three types of Image captioning models Object detection + language model Encoder-Decoder framework with supervised learning Pretrained encoder-decoder with Reinforcement learning Actor-Critic Architecture Policy gradient Architecture Datasets MSCOCO, Flickr30k, Flickr8k Evaluation Metric BLEU, CIDEr, Meteor 31

References 1. Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. 2. Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 3. Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 32