Recurrent Neural Networks

Similar documents
Sequential Predictions Recurrent Neural Networks

Vision, Language, Reasoning

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Deep Learning for Computer Vision

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Attention Correctness in Neural Image Captioning

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

arxiv: v1 [stat.ml] 23 Jan 2017

arxiv: v1 [cs.cv] 12 Dec 2016

Discriminability objective for training descriptive image captions

Rich feature hierarchies for accurate object detection and semantic segmentation

Convolutional and LSTM Neural Networks

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Convolutional and LSTM Neural Networks

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

arxiv: v2 [cs.cv] 10 Aug 2017

Differential Attention for Visual Question Answering

Object Detectors Emerge in Deep Scene CNNs

Vector Learning for Cross Domain Representations

Chittron: An Automatic Bangla Image Captioning System

arxiv: v2 [cs.cv] 19 Dec 2017

Efficient Deep Model Selection

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Networks and Hierarchical Processing: Object Recognition in Human and Computer Vision

Neural Response Generation for Customer Service based on Personality Traits

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Multi-attention Guided Activation Propagation in CNNs

Image Representations and New Domains in Neural Image Captioning. Jack Hessel, Nic Saava, Mike Wilber

Towards Open Set Deep Networks: Supplemental

Natural Language Processing - A Survey MSR UW Summer Institute

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

A HMM-based Pre-training Approach for Sequential Data

Deep Architectures for Neural Machine Translation

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

Learning to Evaluate Image Captioning

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Factoid Question Answering

Audiovisual to Sign Language Translator

Language to Logical Form with Neural Attention

arxiv: v1 [cs.cv] 28 Mar 2016

CS388: Natural Language Processing Lecture 22: Grounding

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN

arxiv: v3 [cs.cv] 23 Jul 2018

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Quantifying Radiographic Knee Osteoarthritis Severity using Deep Convolutional Neural Networks

Convolutional Neural Networks for Text Classification

arxiv: v1 [cs.cv] 3 Apr 2018

Shu Kong. Department of Computer Science, UC Irvine

Hierarchical Convolutional Features for Visual Tracking

Convolutional Neural Networks (CNN)

arxiv: v1 [cs.cv] 7 Dec 2018

Interpretable & Transparent Deep Learning

Visual semantics: image elements. Symbols Objects People Poses

arxiv: v1 [cs.cv] 23 Oct 2018

arxiv: v1 [cs.cl] 22 May 2018

A Neural Compositional Paradigm for Image Captioning

Generating Visual Explanations

An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation

Social Image Captioning: Exploring Visual Attention and User Attention

Automatic Context-Aware Image Captioning

arxiv: v1 [cs.cv] 19 Jan 2018

arxiv: v2 [cs.cv] 3 Apr 2018

Shu Kong. Department of Computer Science, UC Irvine

DeepDiary: Automatically Captioning Lifelogging Image Streams

arxiv: v2 [cs.cv] 10 Apr 2017

Flexible, High Performance Convolutional Neural Networks for Image Classification

arxiv: v3 [cs.cv] 21 Mar 2017

Attentional Masking for Pre-trained Deep Networks

Unpaired Image Captioning by Language Pivoting

Generative Adversarial Networks.

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Guided Open Vocabulary Image Captioning with Constrained Beam Search

arxiv: v2 [cs.cv] 19 Jul 2017

S.No. Chapters Page No. 1. Plants Animals Air Activities on Air Water Our Body...

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Context Gates for Neural Machine Translation

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Latent Space Based Text Generation Using Attention Models

Deep Learning Models for Time Series Data Analysis with Applications to Health Care

Automated diagnosis of pneumothorax using an ensemble of convolutional neural networks with multi-sized chest radiography images

Skin cancer reorganization and classification with deep neural network

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Improving the Interpretability of DEMUD on Image Data Sets

Recognizing American Sign Language Gestures from within Continuous Videos

Aggregated Sparse Attention for Steering Angle Prediction

DEEP CONVOLUTIONAL ACTIVATION FEATURES FOR LARGE SCALE BRAIN TUMOR HISTOPATHOLOGY IMAGE CLASSIFICATION AND SEGMENTATION

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

General Knowledge/Semantic Memory: Chapter 8 1

Transcription:

CS 2750: Machine Learning Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2017

One Motivation: Descriptive Text for Images It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns Scarlett O Hara described in Gone with the Wind Tamara Berg

Some pre-rnn good results This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. Kulkarni et al., CVPR 2011 This is a picture of two dogs. The first dog is near the second furry dog.

Some pre-rnn bad results Missed detections: False detections: Incorrect attributes: Here we see one potted plant. There are one road and one cat. The furry road is in the furry cat. This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. This is a picture of one dog. Kulkarni et al., CVPR 2011 This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green

Karpathy and Fei-Fei, CVPR 2015 Results with Recurrent Neural Networks

Recurrent Networks offer a lot of flexibility: vanilla neural networks Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. image captioning image -> sequence of words Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. sentiment classification sequence of words -> sentiment Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. machine translation seq of words -> seq of words Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. video classification on frame level Andrej Karpathy

Recurrent Neural Network RNN x Andrej Karpathy

Recurrent Neural Network y usually want to output a prediction at some time steps RNN x Adapted from Andrej Karpathy

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y new state old state input vector at some time step some function with parameters W RNN x Andrej Karpathy

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. x Andrej Karpathy

(Vanilla) Recurrent Neural Network The state consists of a single hidden vector h: y RNN x Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] y RNN x Example training sequence: hello Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: hello Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: hello Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: hello Andrej Karpathy

Recurrent Neural Networks RNNs tie the weights at each time step Condition the neural network on all previous words y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Richard Socher

Recurrent Neural Network language model Given list of word vectors: At a single time step: x t h t Richard Socher

Recurrent Neural Network language model Given list of word vectors: At a single time step: We use the same set of W weights at all time steps! is some initialization vector for the hidden layer at time step 0 is the column vector at time step t Adapted from Richard Socher

Recurrent Neural Network language model is a probability distribution over the vocabulary Same cross entropy loss function but predicting words instead of classes, over dataset of size (number of words) T Adapted from Richard Socher

Training RNNs is hard Multiply the same matrix at each time step during forward prop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Ideally inputs from many time steps ago can modify output y Richard Socher

The vanishing gradient problem Similar but simpler RNN formulation: Total error is the sum of each error at time steps t Chain rule application: Richard Socher

Why is the vanishing gradient a problem? The error at a time step ideally can tell a previous time step from many steps away to change during backprop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Richard Socher

The vanishing gradient problem for language models In the case of language modeling or question answering words from time steps far away are not taken into consideration when training to predict the next word Example: Jane walked into the room. John walked in too. It was late in the day. Jane said hito Richard Socher

Gated Recurrent Units (GRUs) More complex hidden unit computation in recurrence! Introduced by Cho et al. 2014 Main ideas: keep around memories to capture long distance dependencies allow error messages to flow at different strengths depending on the inputs Richard Socher

Gated Recurrent Units (GRUs) Standard RNN computes hidden layer at next time step directly: GRU first computes an update gate (another layer) based on current input word vector and hidden state Compute reset gate similarly but with different weights Richard Socher

Update gate Reset gate Gated Recurrent Units (GRUs) New memory content: If reset gate unit is ~0, then this ignores previous memory and only stores the new word information Final memory at time step combines current and previous time steps: Richard Socher

Gated Recurrent Units (GRUs) Final memory h t-1 h t Memory (reset) ~ ht-1 ~ ht Update gate z t-1 z t Reset gate r t-1 r t Input: x t-1 x t Richard Socher

Gated Recurrent Units (GRUs) If reset is close to 0, ignore previous hidden state: Allows model to drop information that is irrelevant in the future Update gate z controls how much of past state should matter now. If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient! Units with short-term dependencies often have reset gates very active Richard Socher

Gated Recurrent Units (GRUs) Units with long term dependencies have active update gates z Illustration: Richard Socher, figure from wildml.com

Long-short-term-memories (LSTMs) Proposed by Hochreiter and Schmidhuber in 1997 We can make the units even more complex Allow each time step to modify Input gate (current cell matters) Forget (gate 0, forget past) Output (how much cell isexposed) New memory cell Final memory cell: Final hidden state: Adapted from Richard Socher

Long-short-term-memories (LSTMs) Intuition: memory cells can keep information intact, unless inputs makes them forget it or overwrite it with new input Cell can decide to output this information or just store it Richard Socher, figure from wildml.com

LSTMs are currently very hip! En vogue default model for most sequencelabeling tasks Very powerful, especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) Most useful if you have lots and lots of data Richard Socher

Deep LSTMs compared to traditional systems Method test BLEU score (ntst14) Bahdanau et al. [2] 28.45 Baseline System [29] 33.30 Single forward LSTM, beam size 12 26.17 Single reversed LSTM, beam size 12 30.59 Ensemble of 5 reversed LSTMs, beam size 1 33.00 Ensemble of 2 reversed LSTMs, beam size 12 33.27 Ensemble of 5 reversed LSTMs, beam size 2 34.50 Ensemble of 5 reversed LSTMs, beam size 12 34.81 Table 1: The performance of the LSTM on WMT 14 English to French test set (ntst14). Note that an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of size 12. Method test BLEU score (ntst14) Baseline System [29] 33.30 Cho et al. [5] 34.54 Best WMT 14 result [9] 37.0 Rescoring the baseline 1000-best with a single forward LSTM 35.61 Rescoring the baseline 1000-best with a single reversed LSTM 35.85 Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs 36.5 Oracle Rescoring of the Baseline 1000-best lists 45 Sequence to Sequence Learning by Sutskever et al. 2014 Richard Socher

Generating poetry with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-35 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating poetry with RNNs at first: train more train more train more Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-36 8 Feb 2016 More info: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating poetry with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-37 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating textbooks with RNNs open source textbook on algebraic geometry Latex source Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-38 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating textbooks with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-39 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating textbooks with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-40 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating code with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-41 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating code with RNNs Generated C code Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-42 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating code with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-43 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Generating code with RNNs Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-44 8 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Andrej Karpathy 8 Feb 2016

Image Captioning Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick Andrej Karpathy

Image Captioning Recurrent Neural Network Convolutional Neural Network Andrej Karpathy

Image Captioning test image Andrej Karpathy

Andrej Karpathy test image

test image X Andrej Karpathy

Image Captioning test image x0 <START> <START> Andrej Karpathy

Image Captioning test image Wih y0 h0 x0 <START> before: h = tanh(w xh * x + W hh * h) now: h = tanh(w xh * x + W hh * h + W ih * im) im <START> Andrej Karpathy

Image Captioning test image y0 h0 sample! x0 <START> straw <START> Andrej Karpathy

Image Captioning test image y0 y1 h0 h1 x0 <START> straw <START> Andrej Karpathy

Image Captioning test image y0 y1 h0 h1 sample! x0 <START> straw hat <START> Andrej Karpathy

Image Captioning test image y0 y1 y2 h0 h1 h2 x0 <START> straw hat <START> Andrej Karpathy

Image Captioning test image Caption generated: straw hat y0 h0 y1 h1 y2 h2 sample <END> token => finish. x0 <START> straw hat <START> Adapted from Andrej Karpathy

Andrej Karpathy Image Captioning

Video Captioning Generate descriptions for events depicted in video clips A monkey pulls a dog s tail and is chased by the dog. Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning English Sentence RNN encoder RNN decoder French Sentence [Sutskever et al. NIPS 14] Encode RNN decoder Sentence [Donahue et al. CVPR 15] [Vinyals et al. CVPR 15] Encode RNN decoder Sentence [Venugopalan et. al. NAACL 15] (this work) Key Insight: Generate feature representation of the video and decode it to a sentence Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning CNN Input Video Sample frames @1/10 Forward propagate Output: fc7 features (activations before classification layer) fc7: 4096 dimension feature vector Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning Input Video Convolutional Net Recurrent Net Output LSTM LSTM A LSTM LSTM boy... LSTM LSTM is LSTM LSTM playing LSTM LSTM golf LSTM LSTM <EOS> Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning CNN pre-training CNN fc7: 4096 dimension feature vector Caffe Reference Net - variation of Alexnet [Krizhevsky et al. NIPS 12] 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] Initialize weights of our network. Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning Image-Caption training LSTM LSTM A LSTM LSTM LSTM LSTM man is LSTM LSTM scaling CNN LSTM LSTM a LSTM LSTM cliff Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning Fine-tuning LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing CNN LSTM LSTM golf 1. Video dataset 2. Mean pooled feature 3. Lower learning rate LSTM LSTM <EOS> Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning A man appears to be plowing a rice field with a plow being pulled by two oxen. A man is plowing a mud field. Domesticated livestock are helping a man plow. A man leads a team of oxen down a muddy path. A man is plowing with some oxen. A man is tilling his land with an ox pulled plow. Bulls are pulling an object. Two oxen are plowing a field. The farmer is tilling the soil. A man in ploughing the field. A man is walking on a rope. A man is walking across a rope. A man is balancing on a rope. A man is balancing on a rope at the beach. A man walks on a tightrope at the beach. A man is balancing on a volleyball net. A man is walking on a rope held by poles A man balanced on a wire. The man is balancing on the wire. A man is walking on a rope. A man is standing in the sea shore. Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. Model BLEU METEOR Best Prior Work [Thomason et al. COLING 14] 13.68 23.90 Only Images 12.66 20.96 Only Video 31.19 26.87 Pre-training only, no fine-tuning No pre-training Images+Video 33.29 29.07 Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning FGM: A person is dancing with the person on the stage. YT: A group of men are riding the forest. I+V: A group of people are dancing. GT: Many men and women are dancing in the street. FGM: A person is cutting a potato in the kitchen. YT: A man is slicing a tomato. I+V: A man is slicing a carrot. GT: A man is slicing carrots. FGM: A person is walking with a person in the forest. YT: A monkey is walking. I+V: A bear is eating a tree. GT: Two bear cubs are digging into dirt and plant matter at the base of a tree. FGM: A person is riding a horse on the stage. YT: A group of playing are playing in the ball. I+V: A basketball player is playing. GT: Dwayne wade does a fancy layup in an allstar game. Venugopalan et al., Translating Videos to Natural Language using Deep Recurrent Neural Networks, NAACL-HTL 2015

Video Captioning English Sentence RNN encoder RNN decoder French Sentence [Sutskever et al. NIPS 14] Encode RNN decoder Sentence [Donahue et al. CVPR 15] [Vinyals et al. CVPR 15] Encode RNN decoder Sentence [Venugopalan et. al. NAACL 15] RNN encoder Encode RNN decoder Sentence [Venugopalan et. al. ICCV 15] (this work) 3 Venugopalan et al., Sequence to Sequence - Video to Text, ICCV 2015

Video Captioning S2VTOverview CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Encoding stage A man is talking... Decoding stage Venugopalan et al., Sequence to Sequence - Video to Text, ICCV 2015

Visual Question Answering (VQA) Task: Given an image and a natural language open-ended question, generate a natural language answer. Agrawal et al., VQA: Visual Question Answering, IJCV 2016, ICCV 2015

Visual Question Answering (VQA) Image Embedding 4096-dim Neural Network Softmax over top K answers Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP Question Embedding How many horses are in this image? 1024-dim Agrawal et al., VQA: Visual Question Answering, IJCV 2016, ICCV 2015

Visual Question Answering (VQA) Wu et al., Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge From External Sources, CVPR 2016

Visual Question Answering (VQA) Agrawal et al., VQA: Visual Question Answering, IJCV 2016, ICCV 2015