Microblog Retrieval for Disaster Relief: How To Create Ground Truths? Ribhav Soni and Sukomal Pal

Outline 1. Overview of this work 2. Introduction 3. Background: FMT16 4. Experiments 5. Discussion

Overview Focus of this work is on the creation of gold standard data for retrieving helpful tweets during disasters. We show that the gold standard data prepared in FIRE 2016 Microblog Track (FMT16) missed many relevant tweets. We also demonstrate that using a machine learning model can help in retrieving the remaining relevant tweets.

Introduction The FIRE 2016 Microblog Track (FMT16) led to the creation of a benchmark collection of ground truth data for microblog retrieval in disaster scenario. However, based on our experiments, we argue that the ground truth annotation exercise missed up to 80% of the relevant tweets. First, we manually labeled a small, random subset of the data, exhaustively for each of the seven topics used in FMT16, and found that about 80% of the relevant tweets were missing in the gold standard. We then trained an SVM model on a subset of the data, and used it to retrieve the top 100 tweets with the highest confidence score of the trained model. We found that more than 50% of the relevant tweets among these were missed in the gold standard.

Background: FMT16 We used the dataset from FMT16, a collection of about 50,000 tweets posted during the Nepal earthquake in 2015. The task was to retrieve tweets relevant to each of seven information needs, expressed as topics in TREC format. The gold standard preparation involved three phases: 1. Three annotators independently tried to search for relevant tweets using intuitive keywords, after all tweets were indexed using Indri. 2. All tweets identified by at least one of the three annotators in Phase 1 were taken and their relevance finalized by mutual discussion. 3. Standard pooling, taking the top 30 results from each run and deciding on their relevance. The seven topics used in FMT16: FMT1: What resources were available FMT2: What resources were required FMT3: What medical resources were available FMT4: What medical resources were required FMT5: What were the requirements / availability of resources at specific locations FMT6: What were the activities of various NGOs / Government organizations FMT7: What infrastructure damage and restoration were being reported

EXPERIMENTS

1. Exhaustive labeling on a small, random subset 700 tweets were randomly taken from the original dataset Relevance for each of the seven topics was judged for all 700 tweets Result About 5 times the number of relevant tweets (from the sample) marked in the gold standard were identified to be actually relevant. Comparison of number of relevant tweets identified from the 700-sized sample: Topic FMT 16 Gold Standard FMT1 7 43 FMT2 4 12 FMT3 5 10 FMT4 1 4 FMT5 4 9 FMT6 5 53 FMT7 3 28 Our Manual Labeling At least one topic 22 105

2. Bootstrapping, to estimate total number of relevant tweets We generated 1000 samples of 700 tweets, with replacement. The average of the number of relevant tweets across all samples, divided by the sample size, was taken as an estimate for fraction of relevant tweets in the entire collection. Result 15.02%, or about 7500 tweets out of 50,000 were estimated to be relevant. This is about 5 times of that in the FMT16 gold standard. Represents a loss of about 6,000 useful tweets. Fraction of relevant tweets Number of relevant tweets FMT 16 Gold Standard Our Estimation by bootstrapping 3.13% 15.02% 1,565 7,520

3. Machine Learning for automatic filtering of tweets We trained seven binary SVM classifiers, one for each of the seven topics. Used a bag-of-words model with unigram TF-IDF values as features. Since the classification task was highly skewed towards non-relevant tweets, we used undersampling (i.e., used only as many negative examples as we had positive ones). Positive examples were also available from the FMT16 gold standard data, besides our manual labeling of 700 tweets. We thus had available, on average, about 650 tweets for each topic, with an equal number of relevant and non-relevant tweets. Classifier for topic RESULTS Precision Recall F1 score FMT1 92.56 92.83 92.67 FMT2 93.45 92.81 93.09 FMT3 96.35 93.99 95.14 FMT4 93.06 90.68 91.74 FMT5 92.95 88.47 90.57 FMT6 90.88 89.06 89.91 FMT7 91.89 90.49 91.13

4. Retrieving most relevant tweets in the entire collection The trained classifiers were used to predict the 100 most relevant tweets for each topic in the entire collection (i.e., those on which the classifier had the highest prediction scores). The predicted sets of 100 tweets were manually analyzed to see how many of the predicted tweets were actually relevant. Result On average, for each topic, about 79 of the top 100 predicted tweets were actually relevant. Only 47% of them (on average) were also identified in the FMT16 gold standard. Number of tweets that were actually relevant out of the top 100 predicted most relevant tweets: Topic Actually relevant Also marked in Gold Standard FMT1 80 43 FMT2 73 48 FMT3 92 57 FMT4 62 33 FMT5 65 22 FMT6 84 23 FMT7 94 32

Some relevant tweets that were missed in FMT16 Gold Standard Relevant to Topic FMT1 FMT2 Tweet Text Earthquake Relief Distribution: Distributed Relief materials to the earthquake victims of Tukcha-1 (Pandy-Rai... http://t.co/0vlgheff4p RT @worldtoiletday: Nepal earthquake: Urgent need for water, #sanitation and food: http://t.co/uob6hq81py #NepalEarthquake @UNICEF @UN_Water FMT3 RT Bloodbanks #Nepal Hospital and Research Centre 4476225 Norvic Hospital 4258554 #NepalEarthquake #MNTL #India FMT4 FMT5 FMT6 FMT7 RT @FocusNewsIndia: #NepalEarthquake #Nepal PM Sushil Koirala requests for urgent blood donation for victims rescued from #earthquake htt Tomorrow, We are moving to Hansapur VDC of Gorkha District to provide relief materials to the earthquake... http://t.co/gyzit3eyip #ArtofLiving Nepal Centre providing shelter to 100's of ppl. Volunteers providing food & water #NepalEarthquakeRelief http://t.co/15rmabe2vo RT @PDChina: The rubble of Hanumndhoka Durbar Square, a @UNESCO world #heritage site, was badly damaged by earthquake in Kathmandu http://t

Discussion The gold standard creation exercise in FMT16 missed many relevant tweets, despite a three-phase approach. Some reasons why this happened include: Many relevant tweets don t contain expected keywords at all, so they were missed in the keyword-search-based Phases 1 and 2. Pooling (Phase 3) also failed to find all relevant tweets because the number of participating systems, as well as the depth of the pool, was small (15 runs, and a pool depth of 30 tweets). We also showed that machine learning can be employed to retrieve relevant tweets from unseen data with a reasonable accuracy. It can also be used to shortlist useful tweets for manual verification in the next stage, depending on the available annotators time.

QUESTIONS?

Thank you