Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Similar documents
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

arxiv: v1 [cs.cv] 9 Aug 2017

Example An example is provided at the start of each exercise as a guide for students.

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Introduction. Help your students learn how to learn!

Discriminability objective for training descriptive image captions

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

The story of Snow White is a beautiful fairytale with princesses, princes and

Towards Open Set Deep Networks: Supplemental

Information Following Ankle Injury

arxiv: v2 [cs.cv] 19 Dec 2017

Object Detection. Honghui Shi IBM Research Columbia

Quantifying Radiographic Knee Osteoarthritis Severity using Deep Convolutional Neural Networks

Automatic Detection of Knee Joints and Quantification of Knee Osteoarthritis Severity using Convolutional Neural Networks

ENGLISH COMPETITION. LEVEL 3-4 (Γ - Δ Δημοτικού)

Latent Space Based Text Generation Using Attention Models

Anterior Total Hip Replacement

Supplementary material: Backtracking ScSPM Image Classifier for Weakly Supervised Top-down Saliency

A-Z Picture Cards - Medium a (short) apple, ant, alligator, astronaut, ambulance

Object Recognition: Conceptual Issues. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman

Fear Ladder (Example)

Imaging Collaboration: From Pen and Ink to Artificial Intelligence June 2, 2018

Categories. Represent/store visual objects in terms of categories. What are categories? Why do we need categories?

S.No. Chapters Page No. 1. Plants Animals Air Activities on Air Water Our Body...

KANGOUROU 2009-ENGLISH LEVEL 3-4

What Do You Think? For You To Do GOALS. The men s high jump record is over 8 feet.

The Leeds Teaching Hospitals NHS Trust Total Hip Replacement A guide to your Rehabilitation

Fitness is a big part of this new program, and will be an important part of your training season.

arxiv: v2 [cs.cv] 2 Aug 2018

Osteoporosis Exercise:

SPECTRUM MEDICAL Total Hip Replacement Surgery/Posterior Approach I. A. II. Positioning - - III. Swelling: IV. Infection/Phlebitis:

DEPARTMENT OF HEALTH AND HUMAN SERVICES

Grade 3. English & Social Department. Ministry of Education. Kingdom of Saudi Arabia. House of Languages International Schools

Posterior Total Hip Replacement with Precautions. Therapy Resources

PRE-OPERATIVE VISIT FOR KNEE REPLACEMENT with Dr. LaReau

FIH Venue Specifications Commonwealth Games. (Updated Oct 2015)

Start ASL The Fun Way to Learn American Sign Language for free!

Supplementary Creativity: Generating Diverse Questions using Variational Autoencoders

Going to the dentist

DRAFT. Activities of Daily Living After Lung Surgery Self-care for safety and healing. Clamshell Precautions

Recurrent Neural Networks

Adjusting Your Activity After Heart Surgery

Hip Replacement Recovery Guide

Active Living with Arthritis Podcast #11 Doing What You Love: Gardening, Golf, and Tennis when Living with Knee Osteoarthritis

Total Knee Arthroplasty

Scientific Method Practice 7 th Grade Life Science

Sequential Predictions Recurrent Neural Networks

Caring for Your New Hip

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Recommending Outfits from Personal Closet

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

Iliotibial (IT) Band Syndrome

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Attentional Masking for Pre-trained Deep Networks

OHIOHEALTH ORTHOPEDIC SURGEONS Dr. Nathaniel Long Sarah A. Domenicucci, PA-C POST OPERATIVE INSTRUCTIONS

Hip Resurfacing with Precautions. Therapy Resources. xpe045 (4/2015) AHC

Posted July 31, 2015 uwreadilab.com

PART I: KEY PROCEDURES FOR THE TRICARE DEMO

Posterior Total Hip Replacement

Preventing Falls: Steps YOU Can Take

PHYSICAL EDUCATION POLICY

Physical Therapy for Distal Femoral Replacement

Stereotyping and Bias in the Flickr30k Dataset

Child Caretaker-DWC/Q0201 (Domestic Workers)

Daily Activities for Better Bones

When Saliency Meets Sentiment: Understanding How Image Content Invokes Emotion and Sentiment

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

USING OBSERVATIONS AND INFERENCES IN SCIENCE

Thales Foundation Cyprus P.O. Box 28959, CY2084 Acropolis, Nicosia, Cyprus. Level 3 4

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Learners Take Action to Reduce the Risk of Asthma

arxiv: v3 [cs.ne] 6 Jun 2016

WALL PUSH UPS TABLE PUSH UPS

PE Long Term Planning Overview

Modified Parkinson Activity Scale

VILNIAUS MIESTO 3 KLASIŲ MOKINIŲ ANGLŲ KALBOS OLIMPIADOS UŽDUOTYS 2016 M. Student s name, school

arxiv: v1 [cs.si] 27 Oct 2016

Eat at least five fruits & vegetables a day.

Wellness Student POLICIES & PROCEDURES manual

Scientific Method 7th grade science

Yankee Doodle Went to Town...

Image Surveillance Assistant

VILNIAUS MIESTO 3 KLASIŲ MOKINIŲ ANGLŲ KALBOS OLIMPIADOS UŽDUOČIŲ ATSAKYMAI 2016 M. Student s name, school

Asthma Triggers. It is very important for you to find out what your child s asthma triggers are and learn ways to avoid them.

Sports Risk Assessment. By Michael Gold

Do Now: Write a detailed account of what happened in the cartoon.

Halo Brace Care. Applying the Halo

1. An apostrophe has been left out of this sentence. Where should the missing apostrophe go?

arxiv: v2 [cs.cv] 10 Apr 2017

Dolphin Gym Low High

St. Joseph Rayong School Course Outline 1st Semester P5 Curriculum - Physical Education ( )

Physical activity can occur at a range of intensities, such as the following: (2) LIGHT

Building Evaluation Scales for NLP using Item Response Theory

Exercises Following Foot/Toe Injury

Comparison of Two Approaches for Direct Food Calorie Estimation

Coventry Children s and Young People s Occupational Therapy Service INTRODUCTION

Active-Q A Physical Activity Questionnaire for Adults

FIH Venue Specifications

9 in 10 Australian young people don t move enough. Make your move Sit less Be active for life! years

Transcription:

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material Yining Li 1 Chen Huang 2 Xiaoou Tang 1 Chen Change Loy 1 1 Department of Information Engineering, The Chinese University of Hong Kong 2 Robotics Institute, Carnegie Mellon University {ly015, xtang, ccloy}@ie.cuhk.edu.hk, chenh2@andrew.cmu.edu 1. VDQG Dataset Object Category. We selected 87 object categories from the annotation of Visual Genome datasets [3] to construct the VDQG dataset. Figure 1 shows the list of object category and the number of samples belonging to each object category. Question Type. Figure 2 visualizes the most frequent n- gram (n 4) sequences of questions in the VDQG dataset as well as the Visual Genome dataset. We observe that the question type distributions of these two datasets are similar to each other. A significant difference is that there is almost no why type question in VDQG dataset, which is reasonable because this type of question is hardly used to distinguish similar objects. Examples. We show some examples of the VDQG dataset in Fig. 3. 2. Implementation Details Attributes. We built an attribute set by extracting the commonly used n-gram expressions (n 3) from region descriptions available in the Visual Genome dataset. And the part-of-speech constraint has been taken into consideration to select for discriminative expressions. Table 1 shows the part-of-speech constraints we use and the most frequent attributes. Model Optimization. We implement our model using Caffe [1] and optimize the model parameters using Adam [2] algorithm. For the attribute recognition model, we use a batchsize of 50 and train for 100 epochs. For the attribute-conditioned LSTM model, we use a batchsize of 50 and train for 30 epochs, where gradient clipping is applied for stability. The parameters of CNN network has been pre-trained on ImageNet [4], and fixed during finetuning for efficiency. 1 We merge all numbers that are greater than one into one label more than one. Table 1: Part-of-speech constraint on n-gram expressions to extract attributes Part of speech Top attributes <NN> man, woman, table, shirt, person <JJ> white, black, blue, brown, green <VB> wear, stand, hold, sit, look <CD> one, more than one 1 <JJ,NN> white plate, teddy bear, young man <VB,NN> play tennis, hit ball, eat grass <IN,NN> on table, in front, on top, in background <NN,NN> tennis player, stop sign, tennis court <VB,NN,NN> play video game <IN,NN,NN> on tennis court, on train track 3. Qualitative Results Figure 4 shows some examples of discriminative question generated using our approach. The experimental result shows that our model is capable of capturing distinguishing attributes and generate discriminative questions based on the attributes. Some failure cases are shown at the last two rows in Figure 4. We observe that the failure cases are caused by different reasons. Specifically, the first two failure examples are caused by incorrect attribute recognition; the following two failure examples are caused by pairing attributes of different type of objects (e.g. pairing green of the grass and white of the people s clothes); and the last two failure examples are caused by incorrect language generation. References [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv:1408.5093, 2014. 1 [2] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015. 1 1

[3] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32 73, 2017. 1 [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211 252, 2015. 1

man shirt woman land person table floor trouser building girl male_child plate chair dog child field car bench player cat street bag horse bowl train bus umbrella rug lady motorcycle counter pillow sofa airplane board guy bicycle bed room truck food seat bear soil bird animal clock laptop curtain elephant screen box cabinet court shelf light beach boat bottle container towel blanket giraffe bathroom desk kitchen tower cow plant necktie cake tray vehicle zebra skier sink lamp banana keyboard ocean toilet computer television book basket vegetable distance 1400 1200 1000 800 600 400 200 0 Figure 1: Object category distribution of VDQG dataset. (a) VDQG (b) Visual Genome Figure 2: N -gram sequence distribution of VDQG dataset (a) and Visual Genome dataset (b).

+ What kind of bird is this? + What is the bird standing on? + What color is the bird s feet? - How many birds are there? - What is the bird doing? - When was the photo taken? + What is the man doing? + What is the man wearing on his head? + What is the man holding? - What gender is the person in the shirt? - What is the man wearing? - Who is wearing the shirt? + What is in the dog s mouth? + Where is the dog? + What is the dog doing? - What kind of animal is in the picture? - How many dogs are there? - When was this picture taken? + How many people are there? + What color is the woman s top? + What color is the table? - What is on the table? - Where is the woman? - What is on the woman s head? + What color is the man s pants? + What is the man wearing on the head? + What is in the background? - Who is on the skateboard? - What is the person doing? - What is the man wearing? + What color is the woman s umbrella? + What is the woman wearing? + What color is the woman s hair? - What is the woman holding? - Who is holding an umbrella? - What is in front of the woman? + Where are the bananas? + What is beside the bananas? + How many bananas are there? - What colors are the bananas? - What fruit is shown? - What is green? + What color is the building? + When is the picture taken? + What time of day is it? - Where is the clock? - Hat is on the top of the tower? - Where is the weather like? + What color is the skier s jacket? + How many people are in the picture? + How many skiers are there? - What is on the ground? - What is the skier doing? - What is the person wearing? + What is on the bench? + What color is the bench? + What color is the wall? - What is the bench made of? - Where is the bench? - How many people are there sitting on the bench? + What color is the bus? + How many decks does the bus have? + What is on the side of the bus? - Where is the bus? - How many buses are there in the photo? - What is behind the bus? + What is on the ground? + How many people are there? + How is the weather? - What is the color of the ground? - What is in the distance? - Where was the photo taken? Figure 3: Example of image pairs and the associated positive and negative question annotations in the proposed VDQG dataset. Positive and negative questions are written in blue and red, respectively.

Red Blue What color is the bus? Baseball Tennis What sport is being played? Ride horse Sit What is the woman doing? Daytime Night When was the picture taken? Cake Cup What is on the table? On bed On ground Where is the cat? Uniform Shirt What is the man wearing? More_than_one One How many beds are there? Play Talk What is the man doing? Leather Plastic What is the bag made of? White Black What color is the man s shirt? Backpack Jacket What is the person wearing? White Green What color is the grass? Man Woman Who is in the picture Brown Green What color is the grass? Brown Black What color is the cow? Wood Metal What is the bench made of? Stand Sit What is the man doing? Figure 4: Discriminative questions generated by our approach. Under each ambiguous pair, the first line shows the distinguishing attribute pair selected by the attribute model, and the second line shows the questions generated by the attibuteconditioned LSTM. The last two rows show some failure cases.