Automatic Context-Aware Image Captioning

Similar documents
Audiovisual to Sign Language Translator

About REACH: Machine Captioning for Video

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Neuromorphic convolutional recurrent neural network for road safety or safety near the road

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Social Media Fundraising Guide

Room noise reduction in audio and video calls

Table of Contents. Contour Diabetes App User Guide

User Guide Seeing and Managing Patients with AASM SleepTM

Summary Table Voluntary Product Accessibility Template. Supporting Features. Supports. Supports. Supports. Supports

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Contour Diabetes app User Guide

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

POC Brain Tumor Segmentation. vlife Use Case

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Inventions on expressing emotions In Graphical User Interface

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

B657: Final Project Report Holistically-Nested Edge Detection

5/2018. AAC Lending Library. Step-by-Step with Levels. ablenet. Attainment Company. Go Talk One Talking Block. ablenet. Attainment.

Research Proposal on Emotion Recognition

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

Voluntary Product Accessibility Template (VPAT)

Defect Removal Metrics. SE 350 Software Process & Product Quality 1

Competing Frameworks in Perception

Competing Frameworks in Perception

Checklist of Key Considerations for Development of Program Logic Models [author name removed for anonymity during review] April 2018

Leaving Wednesday Night, April 6 Returning Sunday Morning, April 10. Led by Lauren Heath and Libby Rease

Networx Enterprise Proposal for Internet Protocol (IP)-Based Services. Supporting Features. Remarks and explanations. Criteria

Discovering and Understanding Self-harm Images in Social Media. Neil O Hare, MFSec Bucharest, Romania, June 6 th, 2017

Epilepsy Sensor Transmitter

VPAT Summary. VPAT Details. Section Telecommunications Products - Detail. Date: October 8, 2014 Name of Product: BladeCenter HS23

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

Proteome Discoverer Version 1.3

OECD QSAR Toolbox v.4.2. An example illustrating RAAF scenario 6 and related assessment elements

Avaya one-x Communicator for Mac OS X R2.0 Voluntary Product Accessibility Template (VPAT)

Clay Tablet Connector for hybris. User Guide. Version 1.5.0

Summary Table Voluntary Product Accessibility Template. Supports. Please refer to. Supports. Please refer to

Wimba Classroom Captioner Guide

Accessible Computing Research for Users who are Deaf and Hard of Hearing (DHH)

Factoid Question Answering

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Deep Learning for Computer Vision

Convolutional Neural Networks for Text Classification

Language Volunteer Guide

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Device Modeling as Prompting Strategy for Users of AAC Devices. Meher Banajee, Ph.D., CCC-SLP Nino Acuna, M.A. Hannah Deshotels, B.A.

SUMMARY TABLE VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

Date: April 19, 2017 Name of Product: Cisco Spark Board Contact for more information:

Voluntary Product Accessibility Template (VPAT)

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Welcome to SeeYourChart

Exercise Pro Getting Started Guide

Object Detectors Emerge in Deep Scene CNNs

Avaya IP Office R9.1 Avaya one-x Portal Call Assistant Voluntary Product Accessibility Template (VPAT)

C.A.T. WALK & FUN RUN 2018 FUNDRAISING GUIDE

The Effect of Sensor Errors in Situated Human-Computer Dialogue

Level 14 Book f. Level 14 Word Count 321 Text Type Information report High Frequency children, father, Word/s Introduced people

Cindy Fan, Rick Huang, Maggie Liu, Ethan Zhang October 20, f: Design Check-in Tasks

CPSC81 Final Paper: Facial Expression Recognition Using CNNs

Section Web-based Internet information and applications VoIP Transport Service (VoIPTS) Detail Voluntary Product Accessibility Template

Visual semantics: image elements. Symbols Objects People Poses

Chapter 12 Conclusions and Outlook

ACCESSIBILITY FOR THE DISABLED

Collaboration in neurodivergent teams

Errol Davis Director of Research and Development Sound Linked Data Inc. Erik Arisholm Lead Engineer Sound Linked Data Inc.

Real Time Sign Language Processing System

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Putting Context into. Vision. September 15, Derek Hoiem

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

The ipad and Mobile Devices: Useful Tools for Individuals with Autism

Bullying UK Fundraising Pack

6. If I already own another brand Videophone, can I use it with the ACN Digital Phone Service? No.

Florida Standards Assessments

Video Captioning Workflow and Style Guide Overview

Avaya G450 Branch Gateway, Release 7.1 Voluntary Product Accessibility Template (VPAT)

Convolutional and LSTM Neural Networks

A Guide to Theatre Access: Marketing for captioning

Why did the network make this prediction?

Modeling the Use of Space for Pointing in American Sign Language Animation

Participant Playbook

ACCESSIBILITY FOR THE DISABLED

Family Day Community Group Toolkit: Ideas, Tips, and Materials For Your Family Day Event!

To learn more, visit our website at Table of Contents

The openehr approach. Background. Approach. Heather Leslie, July 17, 2014

Background Information

Run Time Tester Requirements Document

Hierarchical FSM s with Multiple Concurrency Models

As of: 01/10/2006 the HP Designjet 4500 Stacker addresses the Section 508 standards as described in the chart below.

Note: This document describes normal operational functionality. It does not include maintenance and troubleshooting procedures.

Simple Caption Editor User Guide. May, 2017

FAIRMONT CHATEAU LAKE LOUISE MOTHER S DAY WEEKEND ACTIVITIES

SUMMARY TABLE VOLUNTARY PRODUCT ACCESSIBILITY TEMPLATE

Designing a Web Page Considering the Interaction Characteristics of the Hard-of-Hearing

Diabetes Management App. Instruction Manual

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Citation for published version (APA): Geus, A. F. D., & Rotterdam, E. P. (1992). Decision support in aneastehesia s.n.

Transcription:

Technical Disclosure Commons Defensive Publications Series May 23, 2017 Automatic Context-Aware Image Captioning Sandro Feuz Sebastian Millius Follow this and additional works at: http://www.tdcommons.org/dpubs_series Recommended Citation Feuz, Sandro and Millius, Sebastian, "Automatic Context-Aware Image Captioning", Technical Disclosure Commons, (May 23, 2017) http://www.tdcommons.org/dpubs_series/532 This work is licensed under a Creative Commons Attribution 4.0 License. This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons.

Feuz and Millius: Automatic Context-Aware Image Captioning AUTOMATIC CONTEXT-AWARE IMAGE CAPTIONING ABSTRACT A system and method are disclosed that automatically trigger system-generated captions for images. The system includes a machine learning model and an interface that proposes captions for an image that a user intends to share. The machine learning model is generated by considering the factors such as: extraction of metadata from the image, semantic information extracted from raw image, context of the current chat, image captions from previously shared images, personal context or user modifications to the image. Based on the results of the machine learning model, the system generates and proposes several captions to the user. The user then selects a suitable caption and shares the image. The user may also modify or manually enter the caption. The system generated image captioning enables easy sharing of captioned images which are personalized and artistic. BACKGROUND Generally, images constitute a large fraction of chat messages. The images are often sent along with a message (caption) describing it. For instance, a user sends an image from the honeymoon with his wife to a friend, with the message: Alice and I enjoying our honeymoon in the Bahamas. Currently, captions for images are manually inserted in all available chat applications. DESCRIPTION A system and method are disclosed that triggers system-generated captions for images. The system includes a machine learning model and an interface that proposes captions for an image that a user intends to share. The method includes a data-driven, trainable machine learning approach that determines the chat context for the images and generates captions when sending Published by Technical Disclosure Commons, 2017 2

Defensive Publications Series, Art. 532 [2017] images in chat applications as shown in FIG. 1. When the user begins to share an image, the machine learning model acquires inputs for generating suitable captions. The inputs may include image metadata, image content, chat context and the personalized chat history. The system then generates the captions based on the available information and presents it to the user as illustrated in FIG. 2. The user selects the desired caption and sends the captioned image. FIG. 1: Method to automatically generate context-aware captions for images http://www.tdcommons.org/dpubs_series/532 3

Feuz and Millius: Automatic Context-Aware Image Captioning FIG. 2: System-generated context-aware image captioning The machine learning model may operate in a way similar to existing image description generating models. Training the machine-learning model to create personalized captions involves taking various features as inputs and generating textual representation of captions, or language independent intermediate representation to create a textual representation out of such captions. Various features may serve as inputs for the machine learning model. Attributes deduced from the image and image metadata may be used, for instance, the raw image (pixel values), the Published by Technical Disclosure Commons, 2017 4

Defensive Publications Series, Art. 532 [2017] image metadata (when (timestamp) and where (coordinates) was the image taken, who took it). For example, geocoded address may be used ( Pelican Bay Hotel - Seahorse Road Bahamas), as can a resolved timestamp (for example, Super Bowl, or First day of School ). Semantic information may be extracted from a raw image, as a result of image content analysis (e.g. Eiffel Tower). Context of the current chat may be used, and may involve analyzing history for the current chat. Image captions from previously shared images may also be a factor. Personal context such as calendar entries ( Honeymoon ), social contacts with their profile pictures, etc. may be used. User modifications such as applied filters, added text - e.g. style-specific proposals may be used. The system then outputs a textual representation of possible captions. Alternatively, the system could create a language-independent intermediate representation to create a textual representation out of, prior to outputting the textual representation. The second part automatically proposes the generated captions before the user sends an image. Whenever the user selects an image in a chat application and attempts to enter a caption (or accompanying message) to the image, the system extracts the necessary features from the chat history and the image the user selected, and runs them through the machine learning model, to propose captions. The system generates multiple proposals for captions which may be presented to the user as illustrated in FIG. 1, and the user may then select one of them without manually typing it. Optionally, the user may modify the caption before sending. The machine learning model may be trained in a supervised fashion by providing the input and desired output data. Training data may be accumulated by analyzing chat logs and extracting existing captions, which were entered manually by the users. The model may learn natural captions from the way human users annotate images they share and also from the chat history. The machine learning model may also use a deep recurrent neural network (RNN) with http://www.tdcommons.org/dpubs_series/532 5

Feuz and Millius: Automatic Context-Aware Image Captioning additional convolutional layers to extract semantic information from raw images. Also, the system may include a rule-based approach for default image captioning. This may be used to combine extracted labels/content from an image with manually curated caption templates. For example, the system would detect that the image was taken in Paris, and user may have a list of templates, for example, A day in Paris, Having fun in Paris. Furthermore, the technology may be extended to provide captions of video content or animated GIFs. For example, as shown in FIG. 2, the user selects an image with a view over a city from a restaurant. The extracted features may include the semantic location deduced from the exact location, which in this case is Casino XY, Las Vegas, the calendar entries from the user overlapping with the time the photo was taken (Stay at Casino XY, Stag night). Together with the raw image and features extracted from the image content analysis system (keywords like Gambling, Poker, Casino ), are fed into the machine learning model. The machine learning model produces multiple example sentences. The top 3 candidates are used as quick-select options, and the user is still left with an option to manually enter the caption. As the system takes into account the user s modifications to the image and adapts the proposed captions, more artistic proposals may be generated. In another example, the usage of chat history and context in which an image is shared makes the proposals more natural and self-referential to the chat history A fun night with Peter and Sarah at Olive Garden may be suggested when the friends Peter and Sarah were mentioned in the chat. The disclosed system and method for image captioning enables simple and easier sharing of images. The scope, flexibility and predictive performance of a machine learned system goes far beyond simple primitive template proposals, giving the user a wide choice of proposals and increasing the potential of usability. Published by Technical Disclosure Commons, 2017 6