Learning-based Composite Metrics for Improved Caption Evaluation

Size: px
Start display at page:

Download "Learning-based Composite Metrics for Improved Caption Evaluation"

Transcription

1 Learning-based Composite Metrics for Improved Caption Evaluation Naeha Sharif, Lyndon White, Mohammed Bennamoun and Syed Afaq Ali Shah, {naeha.sharif, and {mohammed.bennamoun, The University of Western Australia. 35 Stirling Highway, Crawley, Western Australia Abstract The evaluation of image caption quality is a challenging task, which requires the assessment of two main aspects in a caption: adequacy and fluency. These quality aspects can be judged using a combination of several linguistic features. However, most of the current image captioning metrics focus only on specific linguistic facets, such as the lexical or semantic, and fail to meet a satisfactory level of correlation with human judgements at the sentence-level. We propose a learning-based framework to incorporate the scores of a set of lexical and semantic metrics as features, to capture the adequacy and fluency of captions at different linguistic levels. Our experimental results demonstrate that composite metrics draw upon the strengths of standalone measures to yield improved correlation and accuracy. 1 Introduction Automatic image captioning requires the understanding of the visual aspects of images to generate human-like descriptions (Bernardi et al., 2016). The evaluation of the generated captions is crucial for the development and fine-grained analysis of image captioning systems (Vedantam et al., 2015). Automatic evaluation metrics aim at providing efficient, cost-effective and objective assessments of the caption quality. Since these automatic measures serve as an alternative to the manual evaluation, the major concern is that such measures should correlate well with human assessments. In other words, automatic metrics are expected to mimic the human judgement process by taking into account various aspects that humans consider when they assess the captions. The evaluation of image captions can be characterized as having two major aspects: adequacy and fluency. Adequacy is how well the caption reflects the source image, and fluency is how well the caption conforms to the norms and conventions of human language (Toury, 2012). In the case of manual evaluation, both adequacy and fluency tend to shape the human perception of the overall caption quality. Most of the automatic evaluation metrics tend to capture these aspects of quality based on the idea that the closer the candidate description to the professional human caption, the better it is in quality (Papineni et al., 2002). The output in such case is a score (the higher the better) reflecting the similarity. The majority of the commonly used metrics for image captioning such as BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) are based on the lexical similarity. Lexical measures (n-gram based) work by rewarding the n- gram overlaps between the candidate and the reference captions. Thus, measuring the adequacy by counting the n-gram matches and assessing the fluency by implicitly using the reference n-grams as a language model (Mutton et al., 2007). However, a high number of n-gram matches cannot always be indicative of a high caption quality, nor a low number of n-gram matches can always be reflective of a low caption quality (Giménez and Màrquez, 2010). A recently proposed semantic metric SPICE (Anderson et al., 2016), overcomes this deficiency of lexical measures by measuring the semantic similarity of candidate and reference captions using Scene Graphs. However, the major drawback of SPICE is that it ignores the fluency of the output caption. Integrating assessment scores of different measures is an intuitive and reasonable way to improve the current image captioning evaluation methods. Through this methodology, each metric plays the role of a judge, assessing the quality of captions in terms of lexical, grammatical or semantic accuracy. For this research, we use the scores conferred

2 by a set of measures that are commonly used for captioning and combine them through a learningbased framework. In this work: 1. We evaluate various combinations of a chosen set of metrics and show that the proposed composite metrics correlate better with human judgements. 2. We analyse the accuracy of composite metrics in terms of differentiating between pairs of captions in reference to the ground truth captions. 2 Literature Review The success of any captioning system depends on how well it transforms the visual information to natural language. Therefore, the significance of reliable automatic evaluation metrics is undeniable for the fine-grained analysis and advancement of image captioning systems. While image captioning has drawn inspiration from the Machine Translation (MT) domain for encoderdecoder based captioning networks (Vinyals et al., 2015), (Xu et al., 2015), (Yao et al., 2016), (You et al., 2016), (Lu et al., 2017), it has also benefited from automatic metrics which were initially proposed to evaluate machine translations/text summaries, such as BLEU (Papineni et al., 2002), ME- TEOR (Denkowski and Lavie, 2014) and ROUGE (Lin, 2004). In the past few years, two metrics CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016) were developed specifically for image captioning. Compared to the previously used metrics, these two show a better correlation with human judgements. The authors in (Liu et al., 2016) proposed a linear combination of SPICE and CIDEr called SPIDEr and showed that optimizing image captioning models for SPIDEr s score can lead to better quality captions. However, SPIDEr was not evaluated for its correlation with human judgements. Recently, (Kusner et al., 2015) proposed the use of a document similarity metric Word Mover s Distance (WMD), which uses the word2vec (Mikolov et al., 2013) embedding space to determine the distance between two texts. The metrics used for caption evaluation can be broadly categorized as lexical and semantic measures. Lexical metrics reward the n-gram matches between candidate captions and human generated reference texts (Giménez and Màrquez, 2010), and can be further categorized as unigram and n-gram based measures. Unigram based methods such as BLEU-1 (Papineni et al., 2002), assess only the lexical correctness of the candidate. However, in the case of METEOR or WMD, where some sort of synonym-matching/stemming is also involved, unigram-overlaps help to evaluate both the lexical and to some degree the semantic aptness of the output caption. N-gram based metrics such as ROUGE and CIDEr primarily assess the lexical correctness of the caption, but also measure some amount of syntactic accuracy by capturing the word order. The lexical measures have received criticism based on the argument that the n-gram overlap is neither an adequate nor a necessary indicative measure of the caption quality (Anderson et al., 2016). To overcome this limitation, semantic metrics such as SPICE, capture the sentence meaning to evaluate the candidate captions. Their performance however is highly dependent on a successful semantic parsing. Purely syntactic measures, which capture the grammatical correctness, exist, and have been used in MT (Mutton et al., 2007), but not in the captioning domain. While fluency (well-formedness) of a candidate caption can be attributed to the syntactic and lexical correctness (Fomicheva et al., 2016), adequacy (informativeness) depends on the lexical and semantic correctness (Rios et al., 2011). We hypothesize that by combining scores from different metrics, which have different strengths in measuring adequacy and fluency, a composite metric that is of overall higher quality is created (Sec. 5). Machine learning offers a systematic approach to integrate the scores of stand-alone metrics. In the MT evaluation, various successful learning paradigms have been proposed (Bojar et al., 2016), (Bojar et al., 2017) and the existing learning-based metrics can be categorized as binary functions which classify the candidate translation as good or bad (Kulesza and Shieber, 2004), (Guzmán et al., 2015) or continuous functions which score the quality of translation on an absolute scale (Song and Cohn, 2011), (Albrecht and Hwa, 2008). Our research is conceptually similar to the work in (Kulesza and Shieber, 2004), which induces a human-likeness criteria. However, our approach differs in terms of the learning algorithm as well as the features used. Moreover, the focus of this work is to assess various combinations of metrics (that capture the caption quality at

3 Figure 1: Overall framework of the proposed Composite Metrics different linguistic levels) in terms of their correlation with human judgements at the sentence level. 3 Proposed Approach In our approach, we use scores conferred by a set of existing metrics as an input to a multi-layer feed-forward neural network. We adopt a training criteria based on a simple question: is the caption machine or human generated? Our trained classifier sets a boundary between good and bad quality captions, thus classifying them as human or machine produced. Furthermore, we obtain a continuous output score by using the class-probability, which can be considered as some measure of believability that the candidate caption is human generated. Framing our learning problem as a classification task allows us to create binary training data using the human generated captions and machine generated captions as positive and negative training examples respectively. Our proposed framework shown in Figure 1 first extracts a set of numeric features using the candidate C and the reference sentences S. The extracted feature vector is then fed as an input to our multi-layer neural network. Each entity of the feature vector corresponds to the score generated by one of the four measures: METEOR, CIDEr, WMD 1 and SPICE respectively. We chose these measures because they show a relatively better correlation with human judgements compared to 1 We convert the WMD distance score to similarity by using a negative exponential, to use it as a feature. the other commonly used ones for captioning (Kilickaya et al., 2016). Our composite metrics are named Eval MS, Eval CS, Eval MCS, Eval W CS, Eval MW S and Eval MW CS. The subscript letters in each name corresponds to the first letter of each individual metric. For example, Eval MS corresponds to the combination of METEOR and SPICE. Figure 2 shows the linguistic aspects captured by the stand-alone 2 and the composite metrics. SPICE is based on sentence meanings, thus it evaluates the semantics. CIDEr covers the syntactic and lexical aspects, whereas Meteor and WMD assess the lexical and semantic components. The learning-based metrics mostly fall in the region formed by the overlap of all three major linguistics facets, leading to better a evaluation. We train our metrics to maximise the classification accuracy on the training dataset. Since we are primarily interested in maximizing the correlation with human judgements, we perform early stopping based on Kendalls τ (rank correlation) with the validation set. 4 Experimental Setup To train our composite metrics, we source data from Flicker30k dataset (Plummer et al., 2015) and three image captioning models namely: (1) show and tell (Vinyals et al., 2015), (2) show, attend and tell (soft-attention) (Xu et al., 2015), and (3) adaptive attention (Lu et al., 2017). Flicker30k 2 The stand-alone metrics marked with an * in the Figure 2 are used as features for this work.

4 Table 1: Kendall s correlation co-efficient of automatic evaluation metrics and proposed composite metrics against human quality judgements. All correlations are significant at p<0.001 Individual Metrics Kendall τ Composite Metrics Kendall τ BLEU Eval MS ROUGE-L Eval CS METEOR Eval MCS CIDEr Eval W CS SPICE Eval MW S WMD Eval MW CS Figure 2: Various automatic measures (standalone and combined) and their respective linguistic levels. See Sec. 3 for more details. dataset consists of 31,783 photos acquired from Flicker 3, each paired with 5 captions obtained through the Amazon Mechanical Turk (AMT) (Turk, 2012). For each image in Flicker30k, we randomly select three of the human generated captions as positive training examples, and three machine generated (one from each image captioning model) captions as negative training examples. We combined the Microsoft COCO (Chen et al., 2015) training and validation set (containing 123,287 images in total, each paired with 5 or more captions), to train the image captioning models using their official codes. These image captioning models achieved state-of-the-art performance when they were published. In order to obtain reference captions for each training example, we again use the human written descriptions of Flicker30k. For each negative training example (machine-generated caption), we randomly choose 4 out of 5 human written captions originally associated with each image. Whereas, for each positive training example (human-generated caption), we use the 5 human written captions associated with each image, selecting one of these as a human candidate caption (positive example) and the remaining 4 as references. In Figure 3, a possible pairing scenario is shown for further clarification. For our validation set, we source data from Flicker8k (Young et al., 2014). This dataset contains 5,822 captions assessed by three expert judges on a scale of 1 (the caption is unrelated to the image) to 4 (the caption describes the im- 3 age without any errors). From our training set, we remove the captions of images which overlap with the captions in the validation and test sets (discussed in Sec. 5), leaving us with a total of 132,984 non-overlapping captions for the training of the composite metrics. 5 Results and Discussion 5.1 Correlation The most desirable characteristic of an automatic evaluation metric is its strong correlation with human scores (Zhang and Vogel, 2010). A stronger correlation with human judgements indicates that a metric captures the information that humans use to assess a candidate caption. To evaluate the sentence-level correlation of our composite metrics with human judgements, we source data from a dataset collected by the authors in (Aditya et al., 2017). We use 6993 manually evaluated human and machine generated captions from this set, which were scored by AMT workers for correctness on the scale of 1 (low relevance to image) to 5 (high relevance to image). Each caption in the dataset is accompanied by a single judgement. In Table 1, we report the Kendalls τ correlation coefficient for the proposed composite metrics and other commonly used caption evaluation metrics. It can be observed from Table 1 that composite metrics outperform stand-alone metrics in terms of sentence-level correlation. The combination of Meteor and SPICE (Eval MS ) and METEOR, CIDEr and SPICE (Eval MCS ) showed the most promising results. The success of these composite metrics can be attributed to the individual strengths of Meteor, CIDEr and SPICE. ME- TEOR is a strong lexical measure based on unigram matching, which uses additional linguistic knowledge for word matching, such as the morphological variation in words via stemming and

5 Figure 3: Shows an example of a candidate and reference pairing that is used in the training set. (a) Image, (b) human and machine generated captions for the image, and (c) candidate and reference pairings for the image. dictionary based look-up for synonyms and paraphrases (Banerjee and Lavie, 2005). CIDEr uses higher order n-grams to account for fluency and down-weighs the commonly occurring (less informative) n-grams by performing Term Frequency Inverse Document Frequency (TF-IDF) weighting for each n-gram in the dataset (Vedantam et al., 2015). SPICE on the other hand is a strong indicator of the semantic correctness of a caption. Together these metrics assess the lexical, semantic and syntactic information. The composite metrics which included WMD in the combination achieved a lower performance, compared to the ones in which WMD was not included. One possible reason is that WMD heavily penalizes shorter candidate captions when the number of words between the output and the reference captions are not equal (Kusner et al., 2015). This penalty might not be consistently useful as it is possible for a shorter candidate caption to be both fluent and adequate. Therefore, WMD is a better suited metric for measuring document distance. 5.2 Accuracy We follow the framework introduced in (Vedantam et al., 2015) to analyse the ability of a metric to discriminate between pairs of captions with reference to the ground truth caption. A metric is considered accurate if it assigns a higher score to the caption preferred by humans. For this experiment, we use PASCAL-50s (Vedantam et al., 2015), which contains human judgements for 4000 triplets of descriptions (one reference caption with two candidate captions). Based on the pairing, the triplets are grouped into four categories (comprising of 1000 triplets each) i.e., Table 2: Comparative accuracy results (in percentage) on four kinds of pairs tested on PASCAL-50s Metrics HC HI HM MM AVG BLEU ROUGE-L METEOR CIDEr SPICE WMD Eval MS Eval CS Eval MCS Eval W CS Eval MW S Eval MW CS Human-Human Correct (HC), Human-Human Incorrect (HI), Human-Machine (HM), Machine- Machine (MM). We follow the original approach of (Vedantam et al., 2015) and use 5 reference captions per candidate to assess the accuracy of the metrics and report them in Table 2. Table 2 shows that on average composite measures produce better accuracy compared to the individual metrics. Amongst the four categories, HC is the hardest, in which all metrics show the worst performance. Differentiating between two good quality (human generated) correct captions is challenging as it involves a fine-grained analysis of the two candidates. Eval MS achieves the highest accuracy in HC category which shows that as captioning systems continue to improve, this combination of lexical and semantic metrics will continue to perform well. Moreover, human generated captions are usually fluent. Therefore, a combination of strong indicators of adequacy such as SPICE and METEOR is the most suitable for this task. Eval MCS shows the highest accuracy in differ-

6 entiating between machine captions, which is another important category as one of the main goals of automatic evaluation is to distinguish between two machine algorithms. Amongst the composite metrics, Eval MS is again the best in distinguishing human captions (good quality) from machine captions (bad quality) which was our basic training criteria. 6 Conclusion and Future Works In this paper we propose a learning-based approach to combine various metrics to improve caption evaluation. Our experimental results show that metrics operating along different linguistic dimensions can be successfully combined through a learning-based framework, and they outperform the existing metrics for caption evaluation in term of correlation and accuracy, with Eval MS and Eval MCS giving the best overall performance. Our study reveals that the proposed approach is promising and has a lot of potential to be used for evaluation in the captioning domain. In the future, we plan to integrate features (components) of metrics instead of their scores for a better performance. We also intend to use syntactic measures, which to the best of our knowledge have not yet been used for caption evaluation (except in an indirect way by the n-gram measures which capture the word order) and study how they can improve the correlation at the sentence level. Majority of the metrics for captioning focus more on adequacy as compared to fluency. This aspect also needs further attention and a combination of metrics/features that can specifically assess the fluency of captions needs to be devised. Acknowledgements The authors are grateful to Nvidia for providing Titan-Xp GPU, which was used for the experiments. This research is supported by Australian Research Council, ARC DP References Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos, and Cornelia Fermüller Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding. Joshua S Albrecht and Rebecca Hwa Regression for machine translation evaluation at the sentence level. Machine Translation, 22(1-2):1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages Springer. Satanjeev Banerjee and Alon Lavie Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, Barbara Plank, et al Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res.(JAIR), 55: Ondřej Bojar, Yvette Graham, Amir Kamran, and Miloš Stanojević Results of the wmt16 metrics shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, volume 2, pages Ondřej Bojar, Jindřich Helcl, Tom Kocmi, Jindřich Libovickỳ, and Tomáš Musil Results of the wmt17 neural mt training task. In Proceedings of the Second Conference on Machine Translation, pages Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick Microsoft coco captions: Data collection and evaluation server. arxiv preprint arxiv: Michael Denkowski and Alon Lavie Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages Marina Fomicheva, Núria Bel, Lucia Specia, Iria da Cunha, and Anton Malinovskiy Cobaltf: a fluent metric for mt evaluation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, volume 2, pages Jesús Giménez and Lluís Màrquez Linguistic measures for automatic machine translation evaluation. Machine Translation, 24(3-4): Francisco Guzmán, Shafiq Joty, Lluís Màrquez, and Preslav Nakov Pairwise neural machine translation evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem Re-evaluating automatic metrics for image captioning. arxiv preprint arxiv:

7 Alex Kulesza and Stuart M Shieber A learning approach to improving sentence-level mt evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pages Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger From word embeddings to document distances. In International Conference on Machine Learning, pages Chin-Yew Lin Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy Improved image captioning via policy gradient optimization of spider. arxiv preprint arxiv: Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale Gleu: Automatic evaluation of sentence-level fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages Association for Computational Linguistics. Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages IEEE. Gideon Toury Descriptive Translation Studies and beyond: revised edition, volume 100. John Benjamins Publishing. Amazon Mechanical Turk Amazon mechanical turk. Retrieved August, 17:2012. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages IEEE. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei Boosting image captioning with attributes. OpenReview, 2(5):8. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: Ying Zhang and Stephan Vogel Significance tests of automatic machine translation evaluation metrics. Machine Translation, 24(1): Miguel Rios, Wilker Aziz, and Lucia Specia Tine: A metric to assess mt adequacy. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages Association for Computational Linguistics. Xingyi Song and Trevor Cohn Regression and ranking based optimisation for sentence level machine translation evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages Association for Computational Linguistics.

NNEval: Neural Network based Evaluation Metric for Image Captioning

NNEval: Neural Network based Evaluation Metric for Image Captioning NNEval: Neural Network based Evaluation Metric for Image Captioning Naeha Sharif 1, Lyndon White 1, Mohammed Bennamoun 1, and Syed Afaq Ali Shah 1,2 1 The University of Western Australia, 35 Stirling Highway,

More information

SPICE: Semantic Propositional Image Caption Evaluation

SPICE: Semantic Propositional Image Caption Evaluation SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson1, Basura Fernando1, Mark Johnson2 and Stephen Gould1 1 Australian National University 2

More information

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem

Towards image captioning and evaluation. Vikash Sehwag, Qasim Nadeem Towards image captioning and evaluation Vikash Sehwag, Qasim Nadeem Overview Why automatic image captioning? Overview of two caption evaluation metrics Paper: Captioning with Nearest-Neighbor approaches

More information

Training for Diversity in Image Paragraph Captioning

Training for Diversity in Image Paragraph Captioning Training for Diversity in Image Paragraph Captioning Luke Melas-Kyriazi George Han Alexander M. Rush School of Engineering and Applied Sciences Harvard University {lmelaskyriazi@college, hanz@college,

More information

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass

LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES. Felix Sun, David Harwath, and James Glass LOOK, LISTEN, AND DECODE: MULTIMODAL SPEECH RECOGNITION WITH IMAGES Felix Sun, David Harwath, and James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA {felixsun,

More information

arxiv: v1 [cs.cl] 13 Apr 2017

arxiv: v1 [cs.cl] 13 Apr 2017 Room for improvement in automatic image description: an error analysis Emiel van Miltenburg Vrije Universiteit Amsterdam emiel.van.miltenburg@vu.nl Desmond Elliott ILLC, Universiteit van Amsterdam d.elliott@uva.nl

More information

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta Image Captioning using Reinforcement Learning Presentation by: Samarth Gupta 1 Introduction Summary Supervised Models Image captioning as RL problem Actor Critic Architecture Policy Gradient architecture

More information

A Neural Compositional Paradigm for Image Captioning

A Neural Compositional Paradigm for Image Captioning A Neural Compositional Paradigm for Image Captioning Bo Dai 1 Sanja Fidler 2,3,4 Dahua Lin 1 1 CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong 2 University of Toronto 3 Vector Institute 4

More information

arxiv: v1 [cs.cv] 23 Oct 2018

arxiv: v1 [cs.cv] 23 Oct 2018 A Neural Compositional Paradigm for Image Captioning arxiv:1810.09630v1 [cs.cv] 23 Oct 2018 Bo Dai 1 Sanja Fidler 2,3,4 Dahua Lin 1 1 CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong 2 University

More information

Pragmatically Informative Image Captioning with Character-Level Inference

Pragmatically Informative Image Captioning with Character-Level Inference Pragmatically Informative Image Captioning with Character-Level Inference Reuben Cohn-Gordon, Noah Goodman, and Chris Potts Stanford University {reubencg, ngoodman, cgpotts}@stanford.edu Abstract We combine

More information

Attention Correctness in Neural Image Captioning

Attention Correctness in Neural Image Captioning Attention Correctness in Neural Image Captioning Chenxi Liu 1 Junhua Mao 2 Fei Sha 2,3 Alan Yuille 1,2 Johns Hopkins University 1 University of California, Los Angeles 2 University of Southern California

More information

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation Menno van Zaanen and Simon Zwarts Division of Information and Communication Sciences Department of Computing

More information

arxiv: v3 [cs.cv] 23 Jul 2018

arxiv: v3 [cs.cv] 23 Jul 2018 Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data Xihui Liu 1, Hongsheng Li 1, Jing Shao 2, Dapeng Chen 1, and Xiaogang Wang 1 arxiv:1803.08314v3 [cs.cv] 23 Jul

More information

arxiv: v2 [cs.cv] 17 Jun 2016

arxiv: v2 [cs.cv] 17 Jun 2016 Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? Abhishek Das 1, Harsh Agrawal 1, C. Lawrence Zitnick 2, Devi Parikh 1, Dhruv Batra 1 1 Virginia Tech,

More information

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani Semi-automatic WordNet Linking using Word Embeddings Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani January 11, 2018 Kevin Patel WordNet Linking via Embeddings 1/22

More information

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk Michael Denkowski and Alon Lavie Language Technologies Institute School of

More information

Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks

Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks Michael Denkowski and Alon Lavie Language Technologies Institute

More information

Learning to Evaluate Image Captioning

Learning to Evaluate Image Captioning Learning to Evaluate Image Captioning Yin Cui 1,2 Guandao Yang 1 Andreas Veit 1,2 Xun Huang 1,2 Serge Belongie 1,2 1 Department of Computer Science, Cornell University 2 Cornell Tech Abstract Evaluation

More information

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab

Intelligent Machines That Act Rationally. Hang Li Bytedance AI Lab Intelligent Machines That Act Rationally Hang Li Bytedance AI Lab Four Definitions of Artificial Intelligence Building intelligent machines (i.e., intelligent computers) Thinking humanly Acting humanly

More information

Comparing Automatic Evaluation Measures for Image Description

Comparing Automatic Evaluation Measures for Image Description Comparing Automatic Evaluation Measures for Image Description Desmond Elliott and Frank Keller Institute for Language, Cognition, and Computation School of Informatics, University of Edinburgh d.elliott@ed.ac.uk,

More information

arxiv: v1 [cs.cv] 12 Dec 2016

arxiv: v1 [cs.cv] 12 Dec 2016 Text-guided Attention Model for Image Captioning Jonghwan Mun, Minsu Cho, Bohyung Han Department of Computer Science and Engineering, POSTECH, Korea {choco1916, mscho, bhhan}@postech.ac.kr arxiv:1612.03557v1

More information

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab Intelligent Machines That Act Rationally Hang Li Toutiao AI Lab Four Definitions of Artificial Intelligence Building intelligent machines (i.e., intelligent computers) Thinking humanly Acting humanly Thinking

More information

BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets Michel Galley 1 Chris Brockett 1 Alessandro Sordoni 2 Yangfeng Ji 3 Michael Auli 4 Chris Quirk 1 Margaret Mitchell

More information

arxiv: v1 [cs.cl] 22 May 2018

arxiv: v1 [cs.cl] 22 May 2018 Joint Image Captioning and Question Answering Jialin Wu and Zeyuan Hu and Raymond J. Mooney Department of Computer Science The University of Texas at Austin {jialinwu, zeyuanhu, mooney}@cs.utexas.edu arxiv:1805.08389v1

More information

Comparing Automatic Evaluation Measures for Image Description

Comparing Automatic Evaluation Measures for Image Description Comparing Automatic Evaluation Measures for Image Description Desmond Elliott and Frank Keller Institute for Language, Cognition, and Computation School of Informatics, University of Edinburgh d.elliott@ed.ac.uk,

More information

Social Image Captioning: Exploring Visual Attention and User Attention

Social Image Captioning: Exploring Visual Attention and User Attention sensors Article Social Image Captioning: Exploring and User Leiquan Wang 1 ID, Xiaoliang Chu 1, Weishan Zhang 1, Yiwei Wei 1, Weichen Sun 2,3 and Chunlei Wu 1, * 1 College of Computer & Communication Engineering,

More information

Study of Translation Edit Rate with Targeted Human Annotation

Study of Translation Edit Rate with Targeted Human Annotation Study of Translation Edit Rate with Targeted Human Annotation Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Outline Motivations Definition of Translation

More information

On the Automatic Generation of Medical Imaging Reports

On the Automatic Generation of Medical Imaging Reports On the Automatic Generation of Medical Imaging Reports Baoyu Jing * Pengtao Xie * Eric P. Xing Petuum Inc, USA * School of Computer Science, Carnegie Mellon University, USA {baoyu.jing, pengtao.xie, eric.xing}@petuum.com

More information

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size) Liang Huang and Kai Zhao and Mingbo Ma School of Electrical Engineering and Computer Science Oregon State University Corvallis,

More information

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, Philip

More information

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. Md Musa Leibniz University of Hannover

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. Md Musa Leibniz University of Hannover Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings Md Musa Leibniz University of Hannover Agenda Introduction Background Word2Vec algorithm Bias in the data generated from

More information

Discriminability objective for training descriptive image captions

Discriminability objective for training descriptive image captions Discriminability objective for training descriptive image captions Ruotian Luo TTI-Chicago Joint work with Brian Price Scott Cohen Greg Shakhnarovich (Adobe) (Adobe) (TTIC) Discriminability objective for

More information

Vector Learning for Cross Domain Representations

Vector Learning for Cross Domain Representations Vector Learning for Cross Domain Representations Shagan Sah, Chi Zhang, Thang Nguyen, Dheeraj Kumar Peri, Ameya Shringi, Raymond Ptucha Rochester Institute of Technology, Rochester, NY 14623, USA arxiv:1809.10312v1

More information

Beam Search Strategies for Neural Machine Translation

Beam Search Strategies for Neural Machine Translation Beam Search Strategies for Neural Machine Translation Markus Freitag and Yaser Al-Onaizan IBM T.J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights, NY 10598 {freitagm,onaizan}@us.ibm.com Abstract

More information

Chittron: An Automatic Bangla Image Captioning System

Chittron: An Automatic Bangla Image Captioning System Chittron: An Automatic Bangla Image Captioning System Motiur Rahman 1, Nabeel Mohammed 2, Nafees Mansoor 3 and Sifat Momen 4 1,3 Department of Computer Science and Engineering, University of Liberal Arts

More information

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions? Isa Maks and Piek Vossen Vu University, Faculty of Arts De Boelelaan 1105, 1081 HV Amsterdam e.maks@vu.nl, p.vossen@vu.nl

More information

arxiv: v1 [cs.cv] 19 Jan 2018

arxiv: v1 [cs.cv] 19 Jan 2018 Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli arxiv:1802.02210v1 [cs.cv] 19 Jan 2018 Eri Matsuo Ichiro Kobayashi Ochanomizu University 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610,

More information

arxiv: v1 [cs.cv] 7 Dec 2018

arxiv: v1 [cs.cv] 7 Dec 2018 An Attempt towards Interpretable Audio-Visual Video Captioning Yapeng Tian 1, Chenxiao Guan 1, Justin Goodman 2, Marc Moore 3, and Chenliang Xu 1 arxiv:1812.02872v1 [cs.cv] 7 Dec 2018 1 Department of Computer

More information

A HMM-based Pre-training Approach for Sequential Data

A HMM-based Pre-training Approach for Sequential Data A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University

More information

Translation Quality Assessment: Evaluation and Estimation

Translation Quality Assessment: Evaluation and Estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk MTM - Prague, 12 September 2016 Translation Quality Assessment: Evaluation and Estimation

More information

DeepDiary: Automatically Captioning Lifelogging Image Streams

DeepDiary: Automatically Captioning Lifelogging Image Streams DeepDiary: Automatically Captioning Lifelogging Image Streams Chenyou Fan David J. Crandall School of Informatics and Computing Indiana University Bloomington, Indiana USA {fan6,djcran}@indiana.edu Abstract.

More information

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering SUPPLEMENTARY MATERIALS 1. Implementation Details 1.1. Bottom-Up Attention Model Our bottom-up attention Faster R-CNN

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

arxiv: v3 [cs.cv] 11 Aug 2017 Abstract

arxiv: v3 [cs.cv] 11 Aug 2017 Abstract human G-GAN G-MLE human G-GAN G-MLE Towards Diverse and Natural Image Descriptions via a Conditional GAN Bo Dai 1 Sanja Fidler 23 Raquel Urtasun 234 Dahua Lin 1 1 Department of Information Engineering,

More information

Exploiting Patent Information for the Evaluation of Machine Translation

Exploiting Patent Information for the Evaluation of Machine Translation Exploiting Patent Information for the Evaluation of Machine Translation Atsushi Fujii University of Tsukuba Masao Utiyama National Institute of Information and Communications Technology Mikio Yamamoto

More information

Adjudicator Agreement and System Rankings for Person Name Search

Adjudicator Agreement and System Rankings for Person Name Search Adjudicator Agreement and System Rankings for Person Name Search Mark D. Arehart, Chris Wolf, Keith J. Miller The MITRE Corporation 7515 Colshire Dr., McLean, VA 22102 {marehart, cwolf, keith}@mitre.org

More information

Context Gates for Neural Machine Translation

Context Gates for Neural Machine Translation Context Gates for Neural Machine Translation Zhaopeng Tu Yang Liu Zhengdong Lu Xiaohua Liu Hang Li Noah s Ark Lab, Huawei Technologies, Hong Kong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com

More information

Multi-attention Guided Activation Propagation in CNNs

Multi-attention Guided Activation Propagation in CNNs Multi-attention Guided Activation Propagation in CNNs Xiangteng He and Yuxin Peng (B) Institute of Computer Science and Technology, Peking University, Beijing, China pengyuxin@pku.edu.cn Abstract. CNNs

More information

Shu Kong. Department of Computer Science, UC Irvine

Shu Kong. Department of Computer Science, UC Irvine Ubiquitous Fine-Grained Computer Vision Shu Kong Department of Computer Science, UC Irvine Outline 1. Problem definition 2. Instantiation 3. Challenge and philosophy 4. Fine-grained classification with

More information

Shu Kong. Department of Computer Science, UC Irvine

Shu Kong. Department of Computer Science, UC Irvine Ubiquitous Fine-Grained Computer Vision Shu Kong Department of Computer Science, UC Irvine Outline 1. Problem definition 2. Instantiation 3. Challenge 4. Fine-grained classification with holistic representation

More information

Age Estimation based on Multi-Region Convolutional Neural Network

Age Estimation based on Multi-Region Convolutional Neural Network Age Estimation based on Multi-Region Convolutional Neural Network Ting Liu, Jun Wan, Tingzhao Yu, Zhen Lei, and Stan Z. Li 1 Center for Biometrics and Security Research & National Laboratory of Pattern

More information

Differential Attention for Visual Question Answering

Differential Attention for Visual Question Answering Differential Attention for Visual Question Answering Badri Patro and Vinay P. Namboodiri IIT Kanpur { badri,vinaypn }@iitk.ac.in Abstract In this paper we aim to answer questions based on images when provided

More information

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS SEOUL Oct.7, 2016 DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS: CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS Gunhee Kim Computer Science and Engineering Seoul National University October

More information

arxiv: v2 [cs.lg] 1 Jun 2018

arxiv: v2 [cs.lg] 1 Jun 2018 Shagun Sodhani 1 * Vardaan Pahuja 1 * arxiv:1805.11016v2 [cs.lg] 1 Jun 2018 Abstract Self-play (Sukhbaatar et al., 2017) is an unsupervised training procedure which enables the reinforcement learning agents

More information

Decoding Strategies for Neural Referring Expression Generation

Decoding Strategies for Neural Referring Expression Generation Decoding Strategies for Neural Referring Expression Generation Sina Zarrieß and David Schlangen Dialogue Systems Group // CITEC // Faculty of Linguistics and Literary Studies Bielefeld University, Germany

More information

arxiv: v1 [cs.cv] 20 Dec 2018

arxiv: v1 [cs.cv] 20 Dec 2018 nocaps: novel object captioning at scale arxiv:1812.08658v1 [cs.cv] 20 Dec 2018 Harsh Agrawal 1 Karan Desai 1 Xinlei Chen 2 Rishabh Jain 1 Dhruv Batra 1,2 Devi Parikh 1,2 Stefan Lee 1 Peter Anderson 1

More information

Do explanations make VQA models more predictable to a human?

Do explanations make VQA models more predictable to a human? Do explanations make VQA models more predictable to a human? Arjun Chandrasekaran,1 Viraj Prabhu,1 Deshraj Yadav,1 Prithvijit Chattopadhyay,1 Devi Parikh1,2 1 2 Georgia Institute of Technology Facebook

More information

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Reza Lotfian and Carlos Busso Multimodal Signal Processing (MSP) lab The University of Texas

More information

Translation Quality Assessment: Evaluation and Estimation

Translation Quality Assessment: Evaluation and Estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk 9 September 2013 Translation Quality Assessment: Evaluation and Estimation 1 / 33

More information

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience. Outline: Motivation. What s the attention mechanism? Soft attention vs. Hard attention. Attention in Machine translation. Attention in Image captioning. State-of-the-art. 1 Motivation: Attention: Focusing

More information

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University,

More information

arxiv: v2 [cs.cv] 19 Dec 2017

arxiv: v2 [cs.cv] 19 Dec 2017 An Ensemble of Deep Convolutional Neural Networks for Alzheimer s Disease Detection and Classification arxiv:1712.01675v2 [cs.cv] 19 Dec 2017 Jyoti Islam Department of Computer Science Georgia State University

More information

Assigning Relative Importance to Scene Elements

Assigning Relative Importance to Scene Elements Assigning Relative Importance to Scene Elements Igor L. O Bastos, William Robson Schwartz Smart Surveillance Interest Group, Department of Computer Science Universidade Federal de Minas Gerais, Minas Gerais,

More information

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Yue Liu 1, Tao Ge 2, Kusum Mathews 3, Heng Ji 1, Deborah L. McGuinness 1 1 Department of Computer Science,

More information

Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN

Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN Huda A. Al-muzaini, Tasniem N. Al-yahya, Hafida Benhidour Dept. of Computer Science College of Computer and Information Sciences

More information

Object Detectors Emerge in Deep Scene CNNs

Object Detectors Emerge in Deep Scene CNNs Object Detectors Emerge in Deep Scene CNNs Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Presented By: Collin McCarthy Goal: Understand how objects are represented in CNNs Are

More information

arxiv: v1 [cs.ai] 28 Nov 2017

arxiv: v1 [cs.ai] 28 Nov 2017 : a better way of the parameters of a Deep Neural Network arxiv:1711.10177v1 [cs.ai] 28 Nov 2017 Guglielmo Montone Laboratoire Psychologie de la Perception Université Paris Descartes, Paris montone.guglielmo@gmail.com

More information

Spatial Prepositions in Context: The Semantics of near in the Presence of Distractor Objects

Spatial Prepositions in Context: The Semantics of near in the Presence of Distractor Objects Spatial Prepositions in Context: The Semantics of near in the Presence of Distractor Objects Fintan J. Costello, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland.

More information

Measurement of Progress in Machine Translation

Measurement of Progress in Machine Translation Measurement of Progress in Machine Translation Yvette Graham Timothy Baldwin Aaron Harwood Alistair Moffat Justin Zobel Department of Computing and Information Systems The University of Melbourne {ygraham,tbaldwin,aharwood,amoffat,jzobel}@unimelb.edu.au

More information

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix

Segmentation of Cell Membrane and Nucleus by Improving Pix2pix Segmentation of Membrane and Nucleus by Improving Pix2pix Masaya Sato 1, Kazuhiro Hotta 1, Ayako Imanishi 2, Michiyuki Matsuda 2 and Kenta Terai 2 1 Meijo University, Siogamaguchi, Nagoya, Aichi, Japan

More information

Neural Response Generation for Customer Service based on Personality Traits

Neural Response Generation for Customer Service based on Personality Traits Neural Response Generation for Customer Service based on Personality Traits Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Sandbank and David Konopnicki IBM Research - Haifa Haifa 31905, Israel {hjon,shmueli,tommy,davidko}@il.ibm.com

More information

arxiv: v1 [cs.cv] 28 Mar 2016

arxiv: v1 [cs.cv] 28 Mar 2016 arxiv:1603.08507v1 [cs.cv] 28 Mar 2016 Generating Visual Explanations Lisa Anne Hendricks 1 Zeynep Akata 2 Marcus Rohrbach 1,3 Jeff Donahue 1 Bernt Schiele 2 Trevor Darrell 1 1 UC Berkeley EECS, CA, United

More information

Emotion Recognition using a Cauchy Naive Bayes Classifier

Emotion Recognition using a Cauchy Naive Bayes Classifier Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method

More information

arxiv: v3 [cs.cv] 29 Jan 2019

arxiv: v3 [cs.cv] 29 Jan 2019 Large-Scale Visual Relationship Understanding Ji Zhang 1,2, Yannis Kalantidis 1, Marcus Rohrbach 1, Manohar Paluri 1, Ahmed Elgammal 2, Mohamed Elhoseiny 1 1 Facebook Research 2 Department of Computer

More information

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks 1 Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Jing Ma 1, Wei Gao 2, Kam-Fai Wong 1,3 1 The Chinese University of Hong Kong 2 Victoria University of Wellington, New Zealand

More information

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Guided Open Vocabulary Image Captioning with Constrained Beam Search Guided Open Vocabulary Image Captioning with Constrained Beam Search Peter Anderson 1, Basura Fernando 1, Mark Johnson 2, Stephen Gould 1 1 The Australian National University, Canberra, Australia firstname.lastname@anu.edu.au

More information

Framework for Comparative Research on Relational Information Displays

Framework for Comparative Research on Relational Information Displays Framework for Comparative Research on Relational Information Displays Sung Park and Richard Catrambone 2 School of Psychology & Graphics, Visualization, and Usability Center (GVU) Georgia Institute of

More information

Toward the Evaluation of Machine Translation Using Patent Information

Toward the Evaluation of Machine Translation Using Patent Information Toward the Evaluation of Machine Translation Using Patent Information Atsushi Fujii Graduate School of Library, Information and Media Studies University of Tsukuba Mikio Yamamoto Graduate School of Systems

More information

arxiv: v3 [stat.ml] 27 Mar 2018

arxiv: v3 [stat.ml] 27 Mar 2018 ATTACKING THE MADRY DEFENSE MODEL WITH L 1 -BASED ADVERSARIAL EXAMPLES Yash Sharma 1 and Pin-Yu Chen 2 1 The Cooper Union, New York, NY 10003, USA 2 IBM Research, Yorktown Heights, NY 10598, USA sharma2@cooper.edu,

More information

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Yue Liu 1, Tao Ge 2, Kusum S. Mathews 3, Heng Ji 1, Deborah L. McGuinness 1 1 Department of Computer Science,

More information

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion Yue Liu 1, Tao Ge 2, Kusum S. Mathews 3, Heng Ji 1, Deborah L. McGuinness 1 1 Department of Computer Science,

More information

An Overview and Comparative Analysis on Major Generative Models

An Overview and Comparative Analysis on Major Generative Models An Overview and Comparative Analysis on Major Generative Models Zijing Gu zig021@ucsd.edu Abstract The amount of researches on generative models has been grown rapidly after a period of silence due to

More information

arxiv: v2 [cs.cv] 6 Nov 2017

arxiv: v2 [cs.cv] 6 Nov 2017 Published as a conference paper in International Conference of Computer Vision (ICCV) 2017 Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training Rakshith Shetty 1 Marcus

More information

arxiv: v1 [cs.cl] 4 Apr 2019

arxiv: v1 [cs.cl] 4 Apr 2019 Evaluating Style Transfer for Text Remi Mir 1, Bjarke Felbo 2, Nick Obradovich 2, Iyad Rahwan 2 1 Department of EECS, Massachusetts Institute of Technology 2 Media Lab, Massachusetts Institute of Technology

More information

Comparison of Two Approaches for Direct Food Calorie Estimation

Comparison of Two Approaches for Direct Food Calorie Estimation Comparison of Two Approaches for Direct Food Calorie Estimation Takumi Ege and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka, Chofu-shi, Tokyo

More information

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records Deep Learning based Information Extraction Framework on Chinese Electronic Health Records Bing Tian Yong Zhang Kaixin Liu Chunxiao Xing RIIT, Beijing National Research Center for Information Science and

More information

Automated Femoral Neck Shaft Angle Measurement Using Neural Networks

Automated Femoral Neck Shaft Angle Measurement Using Neural Networks SIIM 2017 Scientific Session Posters & Demonstrations Automated Femoral Neck Shaft Angle Using s Simcha Rimler, MD, SUNY Downstate; Srinivas Kolla; Mohammad Faysel; Menachem Rimler (Presenter); Gabriel

More information

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

Deep Architectures for Neural Machine Translation

Deep Architectures for Neural Machine Translation Deep Architectures for Neural Machine Translation Antonio Valerio Miceli Barone Jindřich Helcl Rico Sennrich Barry Haddow Alexandra Birch School of Informatics, University of Edinburgh Faculty of Mathematics

More information

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports Ramon Maldonado, BS, Travis Goodwin, PhD Sanda M. Harabagiu, PhD The University

More information

arxiv: v1 [cs.cv] 18 May 2016

arxiv: v1 [cs.cv] 18 May 2016 BEYOND CAPTION TO NARRATIVE: VIDEO CAPTIONING WITH MULTIPLE SENTENCES Andrew Shin, Katsunori Ohnishi, and Tatsuya Harada Grad. School of Information Science and Technology, The University of Tokyo, Japan

More information

Neural Baby Talk. Jiasen Lu 1 Jianwei Yang 1 Dhruv Batra 1,2 Devi Parikh 1,2 1 Georgia Institute of Technology 2 Facebook AI Research.

Neural Baby Talk. Jiasen Lu 1 Jianwei Yang 1 Dhruv Batra 1,2 Devi Parikh 1,2 1 Georgia Institute of Technology 2 Facebook AI Research. Neural Baby Talk Jiasen Lu 1 Jianwei Yang 1 Dhruv Batra 1,2 Devi Parikh 1,2 1 Georgia Institute of Technology 2 Facebook AI Research {jiasenlu, jw2yang, dbatra, parikh}@gatech.edu Abstract We introduce

More information

Deep Multimodal Fusion of Health Records and Notes for Multitask Clinical Event Prediction

Deep Multimodal Fusion of Health Records and Notes for Multitask Clinical Event Prediction Deep Multimodal Fusion of Health Records and Notes for Multitask Clinical Event Prediction Chirag Nagpal Auton Lab Carnegie Mellon Pittsburgh, PA 15213 chiragn@cs.cmu.edu Abstract The advent of Electronic

More information

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu

Minimum Risk Training For Neural Machine Translation. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu Minimum Risk Training For Neural Machine Translation Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu ACL 2016, Berlin, German, August 2016 Machine Translation MT: using computer

More information

Partially-Supervised Image Captioning

Partially-Supervised Image Captioning Partially-Supervised Image Captioning Peter Anderson Macquarie University Sydney, Australia p.anderson@mq.edu.au Stephen Gould Australian National University Canberra, Australia stephen.gould@anu.edu.au

More information

arxiv: v1 [cs.cv] 24 Jul 2018

arxiv: v1 [cs.cv] 24 Jul 2018 Multi-Class Lesion Diagnosis with Pixel-wise Classification Network Manu Goyal 1, Jiahua Ng 2, and Moi Hoon Yap 1 1 Visual Computing Lab, Manchester Metropolitan University, M1 5GD, UK 2 University of

More information

Bayesian models of inductive generalization

Bayesian models of inductive generalization Bayesian models of inductive generalization Neville E. Sanjana & Joshua B. Tenenbaum Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 239 nsanjana, jbt @mit.edu

More information

PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks

PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks PathGAN: Visual Scanpath Prediction with Generative Adversarial Networks Marc Assens 1, Kevin McGuinness 1, Xavier Giro-i-Nieto 2, and Noel E. O Connor 1 1 Insight Centre for Data Analytic, Dublin City

More information

Psychological testing

Psychological testing Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining

More information