Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Similar documents
Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

arxiv: v2 [cs.cv] 10 Aug 2017

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Semantic Image Retrieval via Active Grounding of Visual Situations

Recurrent Neural Networks

Dual Path Network and Its Applications

Vector Learning for Cross Domain Representations

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Training for Diversity in Image Paragraph Captioning

Sequential Predictions Recurrent Neural Networks

Fitness is a big part of this new program, and will be an important part of your training season.

Convolutional Neural Networks for Text Classification

Object Detection. Honghui Shi IBM Research Columbia

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

Pragmatically Informative Image Captioning with Character-Level Inference

Differential Attention for Visual Question Answering

arxiv: v2 [cs.cv] 10 Apr 2017

Feature Enhancement in Attention for Visual Question Answering

arxiv: v3 [cs.cv] 23 Jul 2018

arxiv: v2 [cs.cv] 2 Aug 2018

Shu Kong. Department of Computer Science, UC Irvine

Highly Accurate Brain Stroke Diagnostic System and Generative Lesion Model. Junghwan Cho, Ph.D. CAIDE Systems, Inc. Deep Learning R&D Team

Disciplinary Core Ideas

Shu Kong. Department of Computer Science, UC Irvine

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

arxiv: v1 [cs.cv] 12 Dec 2016

Computational Models of Visual Attention: Bottom-Up and Top-Down. By: Soheil Borhani

arxiv: v1 [cs.cv] 24 Jul 2018

Learning to Generate Long-term Future via Hierarchical Prediction. A. Motion-Based Pixel-Level Evaluation, Analysis, and Control Experiments

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Active Deformable Part Models Inference

Discriminability objective for training descriptive image captions

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg

Computational modeling of visual attention and saliency in the Smart Playroom

(Visual) Attention. October 3, PSY Visual Attention 1

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency

arxiv: v1 [cs.cv] 7 Dec 2018

Multi-attention Guided Activation Propagation in CNNs

Imaging Collaboration: From Pen and Ink to Artificial Intelligence June 2, 2018

HHS Public Access Author manuscript Med Image Comput Comput Assist Interv. Author manuscript; available in PMC 2018 January 04.

Neural Baby Talk. Jiasen Lu 1 Jianwei Yang 1 Dhruv Batra 1,2 Devi Parikh 1,2 1 Georgia Institute of Technology 2 Facebook AI Research.

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Elad Hoffer*, Itay Hubara*, Daniel Soudry

Partially-Supervised Image Captioning

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Attention Correctness in Neural Image Captioning

The Elephant Family MATRIARCH HERDS FEMALES MALES

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

THIS SET OF LESSONS PROVIDES

Chittron: An Automatic Bangla Image Captioning System

Y-Net: Joint Segmentation and Classification for Diagnosis of Breast Biopsy Images

A Neural Compositional Paradigm for Image Captioning

Neural circuits PSY 310 Greg Francis. Lecture 05. Rods and cones

CSC2541 Project Paper: Mood-based Image to Music Synthesis

arxiv: v1 [cs.cv] 20 Dec 2018

arxiv: v3 [cs.cv] 29 Jan 2019

Efficient Deep Model Selection

#40plusfitness Online Training: October 2017

arxiv: v1 [cs.cv] 2 May 2017

NNEval: Neural Network based Evaluation Metric for Image Captioning

arxiv: v1 [cs.cv] 23 Oct 2018

Towards Automatic Report Generation in Spine Radiology using Weakly Supervised Framework

arxiv: v2 [cs.cv] 17 Jun 2016

Retinopathy Net. Alberto Benavides Robert Dadashi Neel Vadoothker

SPICE: Semantic Propositional Image Caption Evaluation

Vision, Language, Reasoning

arxiv: v2 [cs.ai] 2 Mar 2018

arxiv: v3 [cs.ne] 6 Jun 2016

DeepMiner: Discovering Interpretable Representations for Mammogram Classification and Explanation

arxiv: v1 [cs.cl] 22 May 2018

The Elephant Family MATRIARCH HERDS FEMALES MALES

How to Create. a Safer Home. A room-by-room guide to eliminate

Network Dissection: Quantifying Interpretability of Deep Visual Representation

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Deep Learning based FACS Action Unit Occurrence and Intensity Estimation

What does it mean to pledge my health to better living?

ACCESSIBILITY CHECKLIST FOR EVENT PLANNING

Comparison of Two Approaches for Direct Food Calorie Estimation

Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

Guided Open Vocabulary Image Captioning with Constrained Beam Search

arxiv: v2 [cs.cv] 19 Jul 2017

NE LESSON GD Fit Families: Effortless Exercise

Learning Multifunctional Binary Codes for Both Category and Attribute Oriented Retrieval Tasks

Reactive agents and perceptual ambiguity

Classification of breast cancer histology images using transfer learning

CS 5306 INFO 5306: Crowdsourcing and. Human Computation. Lecture 9. 9/19/17 Haym Hirsh

Factoid Question Answering

arxiv: v1 [cs.cv] 2 Jun 2017

Firearm Detection in Social Media

5 MINUTE WATER BREAK

Object Detectors Emerge in Deep Scene CNNs

Sign Language Recognition using Convolutional Neural Networks

Island Park Pre- K Physical Education Curriculum Guide NYS Standards and Learning Priorities

What does it mean to pledge my health to better living?

Performance and Saliency Analysis of Data from the Anomaly Detection Task Study

Urogynecology History Questionnaire. Name: Date: Date of Birth: Age:

Transcription:

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering SUPPLEMENTARY MATERIALS 1. Implementation Details 1.1. Bottom-Up Attention Model Our bottom-up attention Faster R-CNN implementation uses an IoU threshold of 0.7 for region proposal suppression, and 0.3 for object class suppression. To select salient image regions, a class detection confidence threshold of 0.2 is used, allowing the number of regions per image k to vary with the complexity of the image, up to a maximum of 100. However, in initial experiments we find that simply selecting the top 36 features in each image works almost as well in both downstream tasks. Since Visual Genome [1] contains a relatively large number of annotations per image, the model is relatively intensive to train. Using 8 Nvidia M40 GPUs, we take around 5 days to complete 380K training iterations, although we suspect that faster training regimes could also be effective. 1.2. Captioning Model In the captioning model, we set the number of hidden units M in each LSTM to 1,000, the number of hidden units H in the attention layer to 512, and the size of the input word embedding E to 1,000. In training, we use a simple learning rate schedule, beginning with a learning rate of 0.01 which is reduced to zero on a straight-line basis over 60K iterations using a batch size of 100 and a momentum parameter of 0.9. Training using two Nvidia Titan X GPUs takes around 9 hours (including less than one hour for CIDEr optimization). During optimization and decoding we use a beam size of 5. When decoding we also enforce the constraint that a single word cannot be predicted twice in a row. Note that in both our captioning and VQA models, image features are fixed and not finetuned. hours on a single Nvidia K40 GPU. Refer to Teney et al. [3] for further details of the VQA model implementation. 2. Additional Examples In Figure 1 we qualitatively compare attention methodologies for image caption generation, by illustrating attention weights for the ResNet baseline and our full Up-Down model on the same image. The baseline ResNet model hallucinates a toilet and therefore generates a poor quality caption. In contrast, our Up-Down model correctly identifies the couch, despite the novel scene composition. Additional examples of generated captions can be found in Figures 2 and 3. Additional visual question answering examples can be found in Figures 4 and 5. References [1] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arxiv preprint arxiv:1602.07332, 2016. 1 [2] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP), 2014. 1 [3] D. Teney, P. Anderson, X. He, and A. van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In CVPR, 2018. 1 [4] M. D. Zeiler. ADADELTA: an adaptive learning rate method. arxiv preprint arxiv:1212.5701, 2012. 1 1.3. VQA Model In the VQA model, we use 300 dimension word embeddings, initialized with pretrained GloVe vectors [2], and we use hidden states of dimension 512. We train the VQA model using AdaDelta [4] and regularize with early stopping. The training of the model takes in the order of 12 18

Resnet A man sitting on a toilet in a bathroom. Up-Down A man sitting on a couch in a bathroom. Figure 1. Qualitative differences between attention methodologies in caption generation. For each generated word, we visualize the attended image region, outlining the region with the maximum attention weight in red. The selected image is unusual because it depicts a bathroom containing a couch but no toilet. Nevertheless, our baseline ResNet model (top) hallucinates a toilet, presumably from language priors, and therefore generates a poor quality caption. In contrast, our Up-Down model (bottom) clearly identifies the out-of-context couch, generating a correct caption while also providing more interpretable attention weights.

A group of people are playing a video game. A brown sheep standing in a field of grass. Two hot dogs on a tray with a drink. Figure 2. Examples of generated captions showing attended image regions. Attention is given to fine details, such as: (1) the man s hands holding the game controllers in the top image, and (2) the sheep s legs when generating the word standing in the middle image. Our approach can avoid the trade-off between coarse and fine levels of detail.

Two elephants and a baby elephant walking together. A close up of a sandwich with a stuffed animal. A dog laying in the grass with a frisbee. Figure 3. Further examples of generated captions showing attended image regions. The first example suggests an understanding of spatial relationships when generating the word together. The middle image demonstrates the successful captioning of a compositionally novel scene. The bottom example is a failure case. The dog s pose is mistaken for laying, rather than jumping possibly due to poor salient region cropping that misses the dog s head and feet.

Question: What color is illuminated on the traffic light? Answer left: green. Answer right: red. Question: What is the man holding? Answer left: phone. Answer right: controller. Question: What color is his tie? Answer left: blue. Answer right: black. Question: What sport is shown? Answer left: frisbee. Answer right: skateboarding. Question: Is this the handlebar of a motorcycle? Answer left: yes. Answer right: no. Figure 4. Further examples of successful visual question answering results, showing attended image regions.

Question: What is the name of the realty company? Answer left: none. Answer right: none. Question: What is the bus number? Answer left: 2. Answer right: 23. Question: How many cones have reflective tape? Answer left: 2. Answer right: 1. Question: How many oranges are on pedestals? Answer left: 2. Answer right: 2. Figure 5. Examples of visual question answering (VQA) failure cases. Although our simple VQA model has limited reading and counting capabilities, the attention maps are often correctly focused.