Microblog Retrieval for Disaster Relief: How To Create Ground Truths? Ribhav Soni and Sukomal Pal

Similar documents
Overview of the FIRE 2018 track: Information Retrieval from Microblogs during Disasters (IRMiDis)

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Nepal Earthquak e Response 2015 DECEM BER TA RGETED NORTHERN GORK HA INTER-A GENCY COM M ON FEEEDBA CK PROJECT

Real-time Summarization Track

Loose Tweets: An Analysis of Privacy Leaks on Twitter

Weak Supervision. Vincent Chen and Nish Khandwala

CHAPTER 9 SUMMARY AND CONCLUSION

Motivation. Motivation. Motivation. Finding Deceptive Opinion Spam by Any Stretch of the Imagination

Asthma Surveillance Using Social Media Data

Prepare & Respond. Tips for how individuals, groups and communities can use Facebook before, during and after disasters

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

HIDDEN AFTERSHOCKS. Report. Summary

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Affective Impact of Movies: Task Overview and Results

Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers

Use of Twitter to Assess Sentiment toward Waterpipe Tobacco Smoking

A Language Modeling Approach for Temporal Information Needs

Data Mining in Bioinformatics Day 4: Text Mining

Adjudicator Agreement and System Rankings for Person Name Search

PSYCHOLOGICAL FIRST AID

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Hospital Response to Natural Disasters : form Tsunami to Hurricane Katrina

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Kerala Flood Relief Ground Report. First Five Days

Emotion Recognition using a Cauchy Naive Bayes Classifier

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

RAPID ASSESSMENT OF RH SERVICES. RH-SUBCLUSTER 4 th May 2015

Community mobilization in major emergencies

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Outline General overview of the project Last 6 months Next 6 months. Tracking trends on the web using novel Machine Learning methods.

Sample Observation Form

Classification of Local Language Disaster Related Tweets in Micro Blogs

Learning the Fine-Grained Information Status of Discourse Entities

Modeling Annotator Rationales with Application to Pneumonia Classification

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

Automatic coding of death certificates to ICD-10 terminology

Deep Networks and Beyond. Alan Yuille Bloomberg Distinguished Professor Depts. Cognitive Science and Computer Science Johns Hopkins University

Wikipedia-Based Automatic Diagnosis Prediction in Clinical Decision Support Systems

EARTHQUAKE DISASTER 2015 IN NEPAL

Introduction to ROC analysis

Introduction to Sentiment Analysis

An Improved Algorithm To Predict Recurrence Of Breast Cancer

May All Your Wishes Come True: A Study of Wishes and How to Recognize Them

Your Global Quality Assurance Platform All-in-One. All Connected. All Secure.

Role Playing & Goal Setting

Generalizing Dependency Features for Opinion Mining

NEPAL ONE YEAR AFTER THE EARTHQUAKE

Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets

Update: Health Sector Response: Mega Earthquake 21 th Baisakh, 2072 (4 th May, 2015)

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning

Sub-Topic Classification of HIV related Opportunistic Infections. Miguel Anderson and Joseph Fonseca

Impact Evaluation Methods: Why Randomize? Meghan Mahoney Policy Manager, J-PAL Global

Annotating Clinical Events in Text Snippets for Phenotype Detection

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

Vital Responder: Real-time Health Monitoring of First- Responders

36,066 > 22,000 < 9,000

Exploiting Ordinality in Predicting Star Reviews

Christopher Cairns and Elizabeth Plantan. October 9, 2016

When Twitter meets Foursquare: Tweet Location Prediction using Foursquare

Colon cancer subtypes from gene expression data

Earthquakes : Psycho-social impacts and support. Nuray Karancı Middle East Technical University Psychology Department

Bringing Commitments (and Other Norms) to Practice

ARTIFICIAL INTELLIGENCE FOR DIGITAL PATHOLOGY. Kyunghyun Paeng, Co-founder and Research Scientist, Lunit Inc.

Module 14: Missing Data Concepts

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

Pulmonary nodule detection in PET/CT images: Improved approach using combined nodule detection and hybrid FP reduction

Technical Track Session IV Instrumental Variables

PREP. Getting Prepared with Your Neighbors

First Problem Set: Answers, Discussion and Background

Experiments with CST-based Multi-document Summarization

The Benefit of Having a Long Term Recovery Group

Filippo Chiarello, Andrea Bonaccorsi, Gualtiero Fantoni, Giacomo Ossola, Andrea Cimino and Felice Dell Orletta

Blend of south and East: Developing social capital to replace

Natural Language Processing to the Rescue?: Extracting Situational Awareness Tweets During Mass Emergency

Session: Imaging for Clinical Decision Support

Advanced Neighborhood Earthquake Drill

Your goal in studying for the GED science test is scientific

Translation Quality Assessment: Evaluation and Estimation

Factuality Levels of Diagnoses in Swedish Clinical Text

Evaluating Classifiers for Disease Gene Discovery

Jia Jia Tsinghua University 26/09/2017

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

Sound Texture Classification Using Statistics from an Auditory Model

Unit 14: Pre-Disaster Recovery Plan

Deadline for Membership Dues October 15

MCQ Course in Pediatrics Al Yamamah Hospital June Dr M A Maleque Molla, FRCP, FRCPCH

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Approximating hierarchy-based similarity for WordNet nominal synsets using Topic Signatures

Improved Intelligent Classification Technique Based On Support Vector Machines

This is a repository copy of Measuring the effect of public health campaigns on Twitter: the case of World Autism Awareness Day.

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION

Philippines Humanitarian Situation Report

A MODIFIED FREQUENCY BASED TERM WEIGHTING APPROACH FOR INFORMATION RETRIEVAL

The Ordinal Nature of Emotions. Georgios N. Yannakakis, Roddy Cowie and Carlos Busso

Prediction of Average and Perceived Polarity in Online Journalism

Transcription:

Microblog Retrieval for Disaster Relief: How To Create Ground Truths? Ribhav Soni and Sukomal Pal

Outline 1. Overview of this work 2. Introduction 3. Background: FMT16 4. Experiments 5. Discussion

Overview Focus of this work is on the creation of gold standard data for retrieving helpful tweets during disasters. We show that the gold standard data prepared in FIRE 2016 Microblog Track (FMT16) missed many relevant tweets. We also demonstrate that using a machine learning model can help in retrieving the remaining relevant tweets.

Introduction The FIRE 2016 Microblog Track (FMT16) led to the creation of a benchmark collection of ground truth data for microblog retrieval in disaster scenario. However, based on our experiments, we argue that the ground truth annotation exercise missed up to 80% of the relevant tweets. First, we manually labeled a small, random subset of the data, exhaustively for each of the seven topics used in FMT16, and found that about 80% of the relevant tweets were missing in the gold standard. We then trained an SVM model on a subset of the data, and used it to retrieve the top 100 tweets with the highest confidence score of the trained model. We found that more than 50% of the relevant tweets among these were missed in the gold standard.

Background: FMT16 We used the dataset from FMT16, a collection of about 50,000 tweets posted during the Nepal earthquake in 2015. The task was to retrieve tweets relevant to each of seven information needs, expressed as topics in TREC format. The gold standard preparation involved three phases: 1. Three annotators independently tried to search for relevant tweets using intuitive keywords, after all tweets were indexed using Indri. 2. All tweets identified by at least one of the three annotators in Phase 1 were taken and their relevance finalized by mutual discussion. 3. Standard pooling, taking the top 30 results from each run and deciding on their relevance. The seven topics used in FMT16: FMT1: What resources were available FMT2: What resources were required FMT3: What medical resources were available FMT4: What medical resources were required FMT5: What were the requirements / availability of resources at specific locations FMT6: What were the activities of various NGOs / Government organizations FMT7: What infrastructure damage and restoration were being reported

EXPERIMENTS

1. Exhaustive labeling on a small, random subset 700 tweets were randomly taken from the original dataset Relevance for each of the seven topics was judged for all 700 tweets Result About 5 times the number of relevant tweets (from the sample) marked in the gold standard were identified to be actually relevant. Comparison of number of relevant tweets identified from the 700-sized sample: Topic FMT 16 Gold Standard FMT1 7 43 FMT2 4 12 FMT3 5 10 FMT4 1 4 FMT5 4 9 FMT6 5 53 FMT7 3 28 Our Manual Labeling At least one topic 22 105

2. Bootstrapping, to estimate total number of relevant tweets We generated 1000 samples of 700 tweets, with replacement. The average of the number of relevant tweets across all samples, divided by the sample size, was taken as an estimate for fraction of relevant tweets in the entire collection. Result 15.02%, or about 7500 tweets out of 50,000 were estimated to be relevant. This is about 5 times of that in the FMT16 gold standard. Represents a loss of about 6,000 useful tweets. Fraction of relevant tweets Number of relevant tweets FMT 16 Gold Standard Our Estimation by bootstrapping 3.13% 15.02% 1,565 7,520

3. Machine Learning for automatic filtering of tweets We trained seven binary SVM classifiers, one for each of the seven topics. Used a bag-of-words model with unigram TF-IDF values as features. Since the classification task was highly skewed towards non-relevant tweets, we used undersampling (i.e., used only as many negative examples as we had positive ones). Positive examples were also available from the FMT16 gold standard data, besides our manual labeling of 700 tweets. We thus had available, on average, about 650 tweets for each topic, with an equal number of relevant and non-relevant tweets. Classifier for topic RESULTS Precision Recall F1 score FMT1 92.56 92.83 92.67 FMT2 93.45 92.81 93.09 FMT3 96.35 93.99 95.14 FMT4 93.06 90.68 91.74 FMT5 92.95 88.47 90.57 FMT6 90.88 89.06 89.91 FMT7 91.89 90.49 91.13

4. Retrieving most relevant tweets in the entire collection The trained classifiers were used to predict the 100 most relevant tweets for each topic in the entire collection (i.e., those on which the classifier had the highest prediction scores). The predicted sets of 100 tweets were manually analyzed to see how many of the predicted tweets were actually relevant. Result On average, for each topic, about 79 of the top 100 predicted tweets were actually relevant. Only 47% of them (on average) were also identified in the FMT16 gold standard. Number of tweets that were actually relevant out of the top 100 predicted most relevant tweets: Topic Actually relevant Also marked in Gold Standard FMT1 80 43 FMT2 73 48 FMT3 92 57 FMT4 62 33 FMT5 65 22 FMT6 84 23 FMT7 94 32

Some relevant tweets that were missed in FMT16 Gold Standard Relevant to Topic FMT1 FMT2 Tweet Text Earthquake Relief Distribution: Distributed Relief materials to the earthquake victims of Tukcha-1 (Pandy-Rai... http://t.co/0vlgheff4p RT @worldtoiletday: Nepal earthquake: Urgent need for water, #sanitation and food: http://t.co/uob6hq81py #NepalEarthquake @UNICEF @UN_Water FMT3 RT Bloodbanks #Nepal Hospital and Research Centre 4476225 Norvic Hospital 4258554 #NepalEarthquake #MNTL #India FMT4 FMT5 FMT6 FMT7 RT @FocusNewsIndia: #NepalEarthquake #Nepal PM Sushil Koirala requests for urgent blood donation for victims rescued from #earthquake htt Tomorrow, We are moving to Hansapur VDC of Gorkha District to provide relief materials to the earthquake... http://t.co/gyzit3eyip #ArtofLiving Nepal Centre providing shelter to 100's of ppl. Volunteers providing food & water #NepalEarthquakeRelief http://t.co/15rmabe2vo RT @PDChina: The rubble of Hanumndhoka Durbar Square, a @UNESCO world #heritage site, was badly damaged by earthquake in Kathmandu http://t

Discussion The gold standard creation exercise in FMT16 missed many relevant tweets, despite a three-phase approach. Some reasons why this happened include: Many relevant tweets don t contain expected keywords at all, so they were missed in the keyword-search-based Phases 1 and 2. Pooling (Phase 3) also failed to find all relevant tweets because the number of participating systems, as well as the depth of the pool, was small (15 runs, and a pool depth of 30 tweets). We also showed that machine learning can be employed to retrieve relevant tweets from unseen data with a reasonable accuracy. It can also be used to shortlist useful tweets for manual verification in the next stage, depending on the available annotators time.

QUESTIONS?

Thank you