. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

Similar documents
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. Md Musa Leibniz University of Hannover

Word Sense Disambiguation Using Sense Signatures Automatically Acquired from a Second Language

Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

How preferred are preferred terms?

Development of Indian Sign Language Dictionary using Synthetic Animations

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

FINAL REPORT Measuring Semantic Relatedness using a Medical Taxonomy. Siddharth Patwardhan. August 2003

Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Workshop/Hackathon for the Wordnet Bahasa

Asthma Surveillance Using Social Media Data

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Wiktionary RDF. Transformation of Japanese Wiktionary to Linked Open Data SIG-SWO

1 Introduction. Easy and intuitive usage. Searching - not only by text, but by another, sign language specific criteria

Processing MWEs: Neurocognitive Bases of Verbal MWEs and Lexical Cohesiveness within MWEs

Handwritten Kannada Vowels and English Character Recognition System

Discovering and Understanding Self-harm Images in Social Media. Neil O Hare, MFSec Bucharest, Romania, June 6 th, 2017

Smaller, faster, deeper: University of Edinburgh MT submittion to WMT 2017

Character-based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions

Corpus Construction and Semantic Analysis of Indonesian Image Description

Using a grammar implementation to teach writing skills

Latin Language course

An Avatar-Based Weather Forecast Sign Language System for the Hearing-Impaired

Generalizing Dependency Features for Opinion Mining

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

These Terms Synonym Term Manual Used Relationship

Image-Based Food Calorie Estimation Using Knowledge on Food Categories, Ingredients and Cooking Directions

Feature Engineering for Depression Detection in Social Media

Semantic Structure of the Indian Sign Language

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Identifying Cognitive Distortion by Convolutional Neural Network based Text Classification

ALTHOUGH there is agreement that facial expressions

Approximating hierarchy-based similarity for WordNet nominal synsets using Topic Signatures

A Web Tool for Building Parallel Corpora of Spoken and Sign Languages

Recommendation of CGM novels considering serendipity.

Signing for the Deaf using Virtual Humans. Ian Marshall Mike Lincoln J.A. Bangham S.J.Cox (UEA) M. Tutt M.Wells (TeleVirtual, Norwich)

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

Beam Search Strategies for Neural Machine Translation

Back to the Future: Knowledge Light Case Base Cookery

Combining unsupervised and supervised methods for PP attachment disambiguation

Clinical Event Detection with Hybrid Neural Architecture

Childhood memories (signed by Avril Langard-Tang) Worksheet

ALTHOUGH there is agreement that facial expressions

Exploiting Patent Information for the Evaluation of Machine Translation

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

This is the accepted version of this article. To be published as : This is the author version published as:

Machine Translation Contd

Using Information From the Target Language to Improve Crosslingual Text Classification

Natural Logic Inference for Emotion Detection

Chittron: An Automatic Bangla Image Captioning System

IDENTIFYING STRESS BASED ON COMMUNICATIONS IN SOCIAL NETWORKS

Exploiting Ordinality in Predicting Star Reviews

Insights into Analogy Completion from the Biomedical Domain

Cross-Linguistic Discovery of Semantic Regularity. Ben Wing CSC 620

Schema-Driven Relationship Extraction from Unstructured Text

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

Less is More: Culling the Training Set to Improve Robustness of Deep Neural Networks

Learning the Fine-Grained Information Status of Discourse Entities

Not all NLP is Created Equal:

Binge drinking increases risk of dementia

Traits & Trait Taxonomies

Phrase-level Quality Estimation for Machine Translation

Drugs and the Teen Brain

Common Sense Assistant for Writing Stories that Teach Social Skills

Automatic coding of death certificates to ICD-10 terminology

Sentence Formation in NLP Engine on the Basis of Indian Sign Language using Hand Gestures

English and Persian Apposition Markers in Written Discourse: A Case of Iranian EFL learners

Hindi WordNet Data and Associated Software License Agreement

NYU Refining Event Extraction (Old-Fashion Traditional IE) Through Cross-document Inference

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

Applying a Culture Dependent Emotion Triggers Database for Text Valence and Emotion Classification

Fall 2004 IN THIS ISSUE:

A Simple Pipeline Application for Identifying and Negating SNOMED CT in Free Text

SPICE: Semantic Propositional Image Caption Evaluation

Shu Kong. Department of Computer Science, UC Irvine

DOMAIN BOUNDED ENGLISH TO INDIAN SIGN LANGUAGE TRANSLATION MODEL

Thesaurus-based value maps as an instrument for psychological research

Looking for Subjectivity in Medical Discharge Summaries The Obesity NLP i2b2 Challenge (2008)

S URVEY ON EMOTION CLASSIFICATION USING SEMANTIC WEB

Convolutional Neural Networks for Text Classification

Overview of the Patent Translation Task at the NTCIR-7 Workshop

An Intelligent Writing Assistant Module for Narrative Clinical Records based on Named Entity Recognition and Similarity Computation

GIANT: Geo-Informative Attributes for Location Recognition and Exploration

arxiv: v2 [cs.lg] 3 Jun 2016

Kai-Wei Chang UCLA. What It Takes to Control Societal Bias in Natural Language Processing. References:

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Emotional Aware Clustering on Micro-blogging Sources

Automatic Identification & Classification of Surgical Margin Status from Pathology Reports Following Prostate Cancer Surgery

Model answers. Childhood memories (signed by Avril Langard-Tang) Introduction:

Outline. Teager Energy and Modulation Features for Speech Applications. Dept. of ECE Technical Univ. of Crete

Textual Emotion Processing From Event Analysis

IRIT at e-risk. 1 Introduction

a day (or two) with lexicographers

Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach

Transcription:

Semi-automatic WordNet Linking using Word Embeddings Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani January 11, 2018 Kevin Patel WordNet Linking via Embeddings 1/22

Outline 1 2 3 4 5 6 7 Kevin Patel WordNet Linking via Embeddings 2/22

Wordnet Lexical resource Groups words into sets of synonyms called Synsets Records relations among these synsets Linked Wordnet Synsets with same meaning, but belonging to wordnets of different languages are linked EuroWordNet Vossen and Letteren (1997) and IndoWordNet Bhattacharyya (2010) Used for Machine Translation Hovy (1998), Cross Lingual Information Retrieval Gonzalo et al (1998), etc Challenge in linking Wordnets Linking done manually Tools such as Joshi et al (2012b) to assist humans Kevin Patel WordNet Linking via Embeddings 3/22

Background Princeton WordNet (Miller et al, 1990) or the English WordNet was the first wordnet EuroWordNet (Vossen and Letteren, 1997) : linked wordnet comprising of wordnets for European languages Each wordnet separately captures a language-specific information Wordnets uses Princeton WordNet as an Inter-Lingual-Index Enables one to go from concepts in one language to similar concepts in any other language IndoWordNet Bhattacharyya (2010) is a linked wordnet comprising of wordnets for 18 Indian languages Created using the expansion approach using Hindi WordNet as a pivot Partially linked to English WordNet Kevin Patel WordNet Linking via Embeddings 4/22

Joshi et al (2012a) developed a heuristic based measure where they use bilingual dictionaries to link two wordnets Combine scores using various heuristics and generate a list of potential candidates for linked synsets Singh et al (2016) discuss a method to improve the current status of Hindi-English linkage and present a generic methodology Their method is beneficial for culture-specific synsets, or for non-existing concepts Cost and time inefficient; requires a lot of manual effort on the part of a lexicographer Our intention: reduce effort on the part of lexicographers Kevin Patel WordNet Linking via Embeddings 5/22

Given wordnets of two different languages E and F with sets of synsets {s 1 E, s2 E,, sm E } and {s1 F, s2 F,, sn F } respectively, find mappings of the form < s i E, sj F > which are semantically correct Hindi Synsets English Synsets 1: {ह थ, हस त, कर} 1: {hand, paw} 2: {tax, revenue enhancement} Kevin Patel WordNet Linking via Embeddings 6/22

Motivation Adapted from Mikolov et al (2013a) Kevin Patel WordNet Linking via Embeddings 7/22

Algorithm: Notations Let E and F be two languages Let E and F be the number of synsets in wordnets of E and F respectively Let s i E and sj F be the ith and j th synsets of E and F respectively, s i E = {e1 α, e 2 α,, e mi α } s j F = {f1 β, f2 β,, fnj β } e p α and f q β are words in vocabulary of E and F respectively for 1 p m i and 1 q n j, and 1 i E and 1 j F Let v e p α be the word vector corresponding to e p α Kevin Patel WordNet Linking via Embeddings 8/22

Algorithm: Training Estimate v s i E Similarly, as Given links of the form error Err is minimized v s i E = 1 m i v s j F m i p=0 = 1 n j n j s i E, sj F q=0 v e p α (1) v f q β, we learn W such that the (2) Err = Wv s i v E s j 2 (3) F Kevin Patel WordNet Linking via Embeddings 9/22

Algorithm: Prediction To find a mapping for a new synset s k E, one needs to Calculate v = Wv s k E Find v s l F such that v s l F v is maximized Create link s k E, sl F Kevin Patel WordNet Linking via Embeddings 10/22

Datasets Linking Hindi WordNet to English WordNet English Vectors: Pretrained vectors from Google s word2vec tool Mikolov et al (2013b), trained on News dataset (around 100 billion tokens) Hindi Vectors: Trained using word2vec on Bojar corpus Bojar et al (2014) (around 365 million tokens) Linked data: Created at CFILT, IITB Of the form hindi_synset_id, english_synset_id, link_type, where link_type {DIRECT, HYPERNYMY, etc} Focus on only DIRECT links 6863 such links available Kevin Patel WordNet Linking via Embeddings 11/22

Distribution of links Class Count Noun 4757 Adjective 1283 Verb 680 Adverb 143 Distribution of available links among various classes Kevin Patel WordNet Linking via Embeddings 12/22

metric Accuracy@n: One of the top n predictions can be correct Kevin Patel WordNet Linking via Embeddings 13/22

Results: Overall Acc@1 Acc@3 Acc@5 Acc@8 Acc@10 Overall 029 045 052 058 060 Results for the overall setting: Dimension of English embeddings=300, Dimensions of Hindi embeddings=300 Kevin Patel WordNet Linking via Embeddings 14/22

Results: Per word class I Word Class Acc@1 Acc@3 Acc@5 Acc@8 Acc@10 Noun 035 053 060 065 067 Adjective 026 044 050 057 060 Verb 015 025 029 033 037 Adverb 028 051 059 070 073 Results for the setting: Dimension of English Vectors=300, Dimensions of Hindi Vectors=300 Kevin Patel WordNet Linking via Embeddings 15/22

Results: Per word class II Word Class Acc@1 Acc@3 Acc@5 Acc@8 Acc@10 Noun 035 051 058 064 066 Adjective 012 020 024 030 032 Verb 017 027 032 036 039 Adverb 038 052 065 076 080 Results for the setting: Dimension of English Vectors=300, Dimensions of Hindi Vectors=1200 Kevin Patel WordNet Linking via Embeddings 16/22

Possible reasons for poor performance Kevin Patel WordNet Linking via Embeddings 17/22

Possible reasons for poor performance Something is fundamentally missing in word vectors Probably presence of only co-occurence information, and lack of other information such as word ordering, argument frames( for verbs), etc Kevin Patel WordNet Linking via Embeddings 17/22

Possible reasons for poor performance Something is fundamentally missing in word vectors Probably presence of only co-occurence information, and lack of other information such as word ordering, argument frames( for verbs), etc The approach to create synset vectors is not optimal Kevin Patel WordNet Linking via Embeddings 17/22

Possible reasons for poor performance Something is fundamentally missing in word vectors Probably presence of only co-occurence information, and lack of other information such as word ordering, argument frames( for verbs), etc The approach to create synset vectors is not optimal The linear transformation approach is not optimal Kevin Patel WordNet Linking via Embeddings 17/22

Possible reasons for poor performance Something is fundamentally missing in word vectors Probably presence of only co-occurence information, and lack of other information such as word ordering, argument frames( for verbs), etc The approach to create synset vectors is not optimal The linear transformation approach is not optimal Synset members are often phrases instead of words How to create phrase vectors? Kevin Patel WordNet Linking via Embeddings 17/22

Possible reasons for poor performance Something is fundamentally missing in word vectors Probably presence of only co-occurence information, and lack of other information such as word ordering, argument frames( for verbs), etc The approach to create synset vectors is not optimal The linear transformation approach is not optimal Synset members are often phrases instead of words How to create phrase vectors? Currently, a word has only one vector That is a one of the reason for ambiguity Perhaps for each word, multiple vectors (one vector per sense) is the way to go Kevin Patel WordNet Linking via Embeddings 17/22

Described an approach to link wordnets Creates synset embeddings using word embeddings, followed by learning transformation from source to target language synsets Our approach achieves accuracy@10 of approximately 60% and 70% of all synsets and noun synsets, respectively Discussed reasons for poor performance on classes such as verbs Plan to integrate it in tools such as Joshi et al (2012a) Kevin Patel WordNet Linking via Embeddings 18/22

References Bhattacharyya, P (2010) Indowordnet In Lexical Resources Engineering Conference 2010 (LREC 2010) Bojar, O, Diatka, V, Rychlý, P, Straňák, P, Suchomel, V, Tamchyna, A, and Zeman, D (2014) HindMonoCorp 05 Gonzalo, J, Verdejo, F, Chugur, I, and Cigarran, J (1998) Indexing with wordnet synsets can improve text retrieval arxiv preprint cmp-lg/9808002 Hovy, E (1998) Combining and standardizing large-scale, practical ontologies for machine translation and other uses In Proceedings of the 1st International Conference on Language Resources and (LREC), pages 535 542 Joshi, S, Chatterjee, A, Karra, A K, and Bhattacharyya, P U (2012a) Eating your own cooking: automatically linking wordnet synsets of two languages Kevin Patel WordNet Linking via Embeddings 19/22

References Joshi, S, Chatterjee, A, Karra, K A, and Bhattacharyya, P (2012b) Eating your own cooking: Automatically linking wordnet synsets of two languages In Proceedings of COLING 2012: Demonstration Papers, pages 239 246 The COLING 2012 Organizing Committee Mikolov, T, Le, Q V, and Sutskever, I (2013a) Exploiting similarities among languages for machine translation CoRR, abs/13094168 Mikolov, T, Sutskever, I, Chen, K, Corrado, G S, and Dean, J (2013b) Distributed representations of words and phrases and their compositionality In Burges, C, Bottou, L, Welling, M, Ghahramani, Z, and Weinberger, K, editors, Advances in Neural Information Processing Systems 26, pages 3111 3119 Curran Associates, Inc Kevin Patel WordNet Linking via Embeddings 20/22

References Miller, G A, Beckwith, R, Fellbaum, C, Gross, D, and Miller, K J (1990) to wordnet: An on-line lexical database International journal of lexicography, 3(4):235 244 Singh, M, Shukla, R, Jha, J, Kashyap, L, Kanojia, D, and Bhattacharyya, P (2016) Mapping it differently: A solution to the linking challenges In Eighth Global Wordnet Conference GWC 2016 Vossen, P and Letteren, C C (1997) Eurowordnet: a multilingual database for information retrieval In In: Proceedings of the DELOS workshop on Cross-language Information Retrieval, pages 5 7 Kevin Patel WordNet Linking via Embeddings 21/22

Thank You Questions? For more details, write to: kevinpatel@cseiitbacin Kevin Patel WordNet Linking via Embeddings 22/22