SPICE: Semantic Propositional Image Caption Evaluation

Similar documents
Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem

Discriminability objective for training descriptive image captions

Learning to Evaluate Image Captioning

Comparing Automatic Evaluation Measures for Image Description

NNEval: Neural Network based Evaluation Metric for Image Captioning

Comparing Automatic Evaluation Measures for Image Description

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

arxiv: v1 [cs.cv] 23 Oct 2018

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

A Neural Compositional Paradigm for Image Captioning

Vision: Over Ov view Alan Yuille

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

The TUNA-REG Challenge 2009: Overview and Evaluation Results

Spatial Prepositions in Context: The Semantics of near in the Presence of Distractor Objects

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Referring Expressions & Alternate Views of Summarization. Ling 573 Systems and Applications May 24, 2016

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

The Effect of Sensor Errors in Situated Human-Computer Dialogue

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks

Object Detection. Honghui Shi IBM Research Columbia

Translation Quality Assessment: Evaluation and Estimation

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Automatic Context-Aware Image Captioning

Evaluation & Systems. Ling573 Systems & Applications April 9, 2015

arxiv: v1 [cs.cl] 13 Apr 2017

Keyword-driven Image Captioning via Context-dependent Bilateral LSTM

Learning-based Composite Metrics for Improved Caption Evaluation

Proximity in Context: an empirically grounded computational model of proximity for processing topological spatial expressions

Supplementary Creativity: Generating Diverse Questions using Variational Autoencoders

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

Action Recognition. Computer Vision Jia-Bin Huang, Virginia Tech. Many slides from D. Hoiem

DeepDiary: Automatically Captioning Lifelogging Image Streams

Building Evaluation Scales for NLP using Item Response Theory

Exploiting Ordinality in Predicting Star Reviews

Image Description with a Goal: Building Efficient Discriminating Expressions for Images

arxiv: v1 [cs.cv] 20 Dec 2018

Natural Logic Inference for Emotion Detection

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

Study of Translation Edit Rate with Targeted Human Annotation

Attention Correctness in Neural Image Captioning

Shu Kong. Department of Computer Science, UC Irvine

Hierarchical Convolutional Features for Visual Tracking

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Methods for Automatically Evaluating Answers to Complex Questions

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

arxiv: v1 [cs.cv] 12 Dec 2016

2018 Keypoints Challenge

Chapter 4: Representation in Memory

A Visual Saliency Map Based on Random Sub-Window Means

Lecture 10: POS Tagging Review. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Back to the Future: Knowledge Light Case Base Cookery

From Once Upon a Time to Happily Ever After: Tracking Emotions in Books and Mail! Saif Mohammad! National Research Council Canada!

arxiv: v1 [cs.cv] 19 Jan 2018

Viewpoint Dependence in Human Spatial Memory

Scenes Categorization based on Appears Objects Probability

Shu Kong. Department of Computer Science, UC Irvine

Expanded View Figures

The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

Understanding and Predicting Importance in Images

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN

Language-independent Probabilistic Answer Ranking for Question Answering

High-level Vision. Bernd Neumann Slides for the course in WS 2004/05. Faculty of Informatics Hamburg University Germany

arxiv: v2 [cs.cv] 10 Apr 2017

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Evaluating Classifiers for Disease Gene Discovery

Experiments with CST-based Multi-document Summarization

controlling for confounds in online measures of sentence complexity

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Semantic Image Retrieval via Active Grounding of Visual Situations

Improved Intelligent Classification Technique Based On Support Vector Machines

Text Mining of Patient Demographics and Diagnoses from Psychiatric Assessments

Neural Baby Talk. Jiasen Lu 1 Jianwei Yang 1 Dhruv Batra 1,2 Devi Parikh 1,2 1 Georgia Institute of Technology 2 Facebook AI Research.

arxiv: v1 [cs.cv] 18 May 2016

Schema-Driven Relationship Extraction from Unstructured Text

Modeling State Space Search Technique for a Real World Adversarial Problem Solving

Framework for Comparative Research on Relational Information Displays

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

HHS Public Access Author manuscript Med Image Comput Comput Assist Interv. Author manuscript; available in PMC 2018 January 04.

Exploiting Patent Information for the Evaluation of Machine Translation

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL

Learning to Disambiguate by Asking Discriminative Questions Supplementary Material

Adjudicator Agreement and System Rankings for Person Name Search

arxiv: v3 [cs.cv] 23 Jul 2018

Emotions and Stress. 6. Why is guilt a learned emotion?

Fusing Generic Objectness and Visual Saliency for Salient Object Detection

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Last Words. That snice...whatcanyoudowithit? Anja Belz University of Brighton

Guest Editors Introduction to the Special Issue on Computational Intelligence

BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

CHAPTER 5 DECISION TREE APPROACH FOR BONE AGE ASSESSMENT

Data Mining in Bioinformatics Day 4: Text Mining

What Helps Where And Why? Semantic Relatedness for Knowledge Transfer

UvA-DARE (Digital Academic Repository)

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

Transcription:

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson1, Basura Fernando1, Mark Johnson2 and Stephen Gould1 1 Australian National University 2 Macquarie University ARC Centre of Excellence for Robotic Vision Vision ARC Centre of Excellence for Robotic 1

Image captioning Source: MS COCO Captions dataset Source: http://aipoly.com/ 2

Automatic caption evaluation Benchmark datasets require fast to compute, accurate and inexpensive evaluation metrics Good metrics can be used to help construct better models The Evaluation Task: Given a candidate caption ci and a set of m reference captions Ri = {ri1,,rim}, compute a score Si that represents similarity between ci and Ri. 3

Existing state of the art Nearest neighbour captions often ranked higher than human captions Source: Lin Cui, Large-scale Scene UNderstanding Workshop, CVPR 2015 4

Existing metrics BLEU: Precision with brevity penalty, geometric mean over n-grams METEOR: Align fragments, take harmonic mean of precision & recall ROUGE-L: F-score based on Longest Common Substring CIDEr: Cosine similarity with TFIDF weighting 5

Motivation False positive False negative (High n-gram similarity) (Low n-gram similarity) A young girl standing on top of a tennis court. A shiny metal pot filled with some diced veggies. A giraffe standing on top of a green field. The pan on the stove has chopped vegetables in it. n-gram overlap is not necessary or sufficient for two sentences to mean the same SPICE primarily addresses false positives Source: MS COCO Captions dataset 6

Is this a good caption? A young girl standing on top of a basketball court 7

Is this a good caption? A young girl standing on top of a basketball court Semantic propositions: 1. There is girl 2. The girl is young 3. The girl is standing 4. There is court 5. The court is used for basketball 6. The girl is on the court 8

Key Idea scene graphs 1 2. Parse2 1. Input 3. Scene Graph3 4. Tuples (girl) (court) (girl, young) (girl, standing) (court, tennis) (girl, on-top-of, court) 1 Johnson et. al. Image Retrieval Using Scene Graphs, CVPR 2015 Klein & Manning: Accurate Unlexicalized Parsing, ACL 2003 3 Schuster et. al: Generating semantically precise scene graphs from textual descriptions for improved image retrieval, EMNLP 2015 2 9

SPICE Calculation SPICE calculated as an F-score over tuples, with: Merging of synonymous nodes, and Wordnet synsets used for tuple matching and merging. Given candidate caption c, a set of reference captions S, and the mapping T from captions to tuples: 10

Example good caption 11

Example good caption 12

Example weak caption 13

Example weak caption 14

Evaluation MS COCO (C40) Pearson ρ correlation between evaluation metrics and human judgments for the 15 competition entries plus human captions in the 2015 COCO Captioning Challenge, using 40 reference captions. Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40. 15

Evaluation MS COCO (C40) SPICE picks the same top-5 as human evaluators. Absolute scores are lower with 40 reference captions (compared to 5 reference captions) Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40. 16

Gameability SPICE measures how well caption models recover objects, attributes and relations Fluency is neglected (as with n-gram metrics) If fluency is a concern, include a fluency metric such as surprisal* *Hale, J: A probabilistic Earley Parser as a Psycholinguistic Model 2001; Levy, R: Expectation-based syntactic comprehension 2008 17

SPICE for error analysis Breakdown of SPICE F-score over objects, attributes and relations 18

Can caption models count? Breakdown of attribute F-score over color, number and size attributes 19

Summary SPICE measures how well caption models recover objects, attributes and relations SPICE captures human judgment better than CIDEr, BLEU, METEOR and ROUGE Tuples can be categorized for detailed error analysis Scope for further improvement as better semantic parsers are developed Next steps: Using SPICE to build better caption models! 20

Thank you Link: SPICE Project Page (http://panderson.me/spice) Acknowledgement: We are grateful to the COCO Consortium for re-evaluating the 2015 Captioning Challenge entries using SPICE. ARC Centre of Excellence for Robotic Vision Vision ARC Centre of Excellence for Robotic 21