Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Size: px

Start display at page:

Download "Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta"

Ronald Collins
5 years ago
Views:

1 Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1

2 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture Conclusion 2

3 Introduction Caption: Describing an image in words What are the applications? 3

4 Applications 4

5 What do we need? Datasets MSCOCO ~ 100k images + 5 captions/image Flickr30k ~ 30k images + 5 captions/image Flickr8k ~ 8k images + 5 captions/image Evaluation BLEU scores BLEU-1, BLEU-2, BLEU-3, BLEU-4 Meteor CIDEr Model 5

6 BLEU Scores Generated Caption: <start> I can cat <end> Given a ground truth caption and a generated caption for the corresponding image, BLEU-n score is the percentage of the number of matching n-grams 6

7 Previous approaches : Caption Generation through object detection and language models These models were very limited in their approach 2014: Encoder Decoder framework Image captioning as a machine translation problem 2017: Image captioning as a reinforcement learning problem 7

8 Image captioning as machine translation Good Afternoon! Guten Tag! Machine Translation is implemented using an encoder-decoder architecture A band is playing music on stage 8

9 Encoder Decoder framework Encoder A Convolutional Neural Network Decoder A Recurrent Neural Network 9

10 Encoder-Decoder with visual attention1 Encoder Any CNN Network Any pretrained CNN VGG16, GoogleNet Encoder: A CNN classifier Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning

11 Encoder-Decoder with visual attention Decoder RNN network At each timestep of RNN, we predict one word Attention Allows the model to attend to specific features 11

12 Attention Unit yi - Image features C Context (word features) Soft Attention 12

13 13

14 14

15 Goal - Given an image I, generate a sentence S = {w1,w2,...,wt} which correctly describes the image content Image Captioning as RL problem At any timestep t, State Image features + words generated until t Action Next word to generate Reward Can be set in different ways We will look into two different architectures: Actor-Critic Policy Gradient 15

16 Policy Gradient Architecture Predicts policy by maximizing expected reward The method suffers from high variance One way of reducing the variance is to increase batch size. However, it would lead to inefficient learning Introduce a baseline 16

17 Introducing a baseline reduces variance in the policy gradient algorithm Acts like a critic to the model The goal is to find a good baseline for policy network Baseline 17

18 Actor-Critic Architecture Actor Generates a policy function Critic Generates value for the given state Critic can be thought of as a moving baseline for the policy network Actor and Critic are two separate models which are trained simultaneously 18

Actor-Critic Model2 Actor Policy Network Predicts the next word Critic Value Network Evaluates the reward Train embedding network (rewards) Train policy network and value network Train actor-critic

19 Actor-Critic Model2 Actor Policy Network Predicts the next word Critic Value Network Evaluates the reward Train embedding network (rewards) Train policy network and value network Train actor-critic together as RL problem Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

20 Any model that generates a sentence as a sequence of words The encoder-decoder framework can work as a policy network Policy Network The policy network is trained using standard supervised learning with cross entropy loss 20

21 The embedding model is the used to predict the similarity between an image and a sentence Embedding model 21

22 The value network vp evaluates the reward r from an observed state st The value network is trained using supervised learning with MSE Value Network 22

23 Pretrain policy network with cross entropy loss Pretrain value network with mean squared loss Train policy network and actor network jointly using deep RL Training 23

24 Results 24

25 25

26 Self-Critical Sequence Training (SCST)3 Built on Policy gradient method Utilizes its test-time inference to estimate a baseline Uses evaluation metric (CIDEr) to estimate reward Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

27 No need to estimate a reward signal (as is in the case of actor-critic model) Utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences Directly optimizes on the evaluation Metric (CIDEr score) Advantages 27

28 SCST Training Policy Network Image encoder Resnet Attention Decoder LSTM (1 layer, 512 units) Pretrain the model with supervised learning (XE loss) Train the model with Reinforcement Learning Reward CIDEr score Baseline Test time inference reward 28

29 Results MS Powerpoint: A picture containing grass, animal MS Powerpoint: A close up of a brick building 29

30 30

31 Conclusion Three types of Image captioning models Object detection + language model Encoder-Decoder framework with supervised learning Pretrained encoder-decoder with Reinforcement learning Actor-Critic Architecture Policy gradient Architecture Datasets MSCOCO, Flickr30k, Flickr8k Evaluation Metric BLEU, CIDEr, Meteor 31

32 References 1. Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning Ren, Zhou, et al. "Deep reinforcement learning-based image captioning with embedding reward." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Rennie, Steven J., et al. "Self-critical sequence training for image captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing