Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Similar documents
The Impact of Relevance Judgments and Data Fusion on Results of Image Retrieval Test Collections

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

Relevance Assessment: Are Judges Exchangeable and Does it Matter?

RELEVANCE JUDGMENTS EXCLUSIVE OF HUMAN ASSESSORS IN LARGE SCALE INFORMATION RETRIEVAL EVALUATION EXPERIMENTATION

Annotating If Authors of Tweets are Located in the Locations They Tweet About

A Language Modeling Approach for Temporal Information Needs

Adjudicator Agreement and System Rankings for Person Name Search

4 Diagnostic Tests and Measures of Agreement

The Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students

Here or There. Preference Judgments for Relevance

Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.

The Institutional Dimension of Document Quality Judgments

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

Introductory: Coding

Importance of Good Measurement

Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

COMPUTING READER AGREEMENT FOR THE GRE

Real-time Summarization Track

Introduction. 1.1 Facets of Measurement

Agreement Coefficients and Statistical Inference

Elsevier ClinicalKey TM FAQs

WikiWarsDE: A German Corpus of Narratives Annotated with Temporal Expressions

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Standards for the reporting of new Cochrane Intervention Reviews

Technical Specifications

UniNE at CLEF 2015: Author Profiling

Toward the Evaluation of Machine Translation Using Patent Information

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track

Investigating the Exhaustivity Dimension in Content-Oriented XML Element Retrieval Evaluation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Facet5 Appendix 1.0 Translations

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

NYU Refining Event Extraction (Old-Fashion Traditional IE) Through Cross-document Inference

DATA is derived either through. Self-Report Observation Measurement

Overview of the Patent Translation Task at the NTCIR-7 Workshop

EQUATOR Network: promises and results of reporting guidelines

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments

Evaluation of Linguistic Labels Used in Applications

Variations in relevance judgments and the measurement of retrieval e ectiveness

Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS. Contract grant sponsor: National Science Foundation; Contract grant number:

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

NTT DOCOMO Technical Journal

The diagnosis of Chronic Pancreatitis

C/S/E/L :2008. On Analytical Rigor A Study of How Professional Intelligence Analysts Assess Rigor. innovations at the intersection of people,

Improving Passage Retrieval Using Interactive Elicition and Statistical Modeling

Psychology 2019 v1.3. IA2 high-level annotated sample response. Student experiment (20%) August Assessment objectives

Maltreatment Reliability Statistics last updated 11/22/05

Open Research Online The Open University s repository of research publications and other research outputs

Cochrane Breast Cancer Group

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

English 10 Writing Assessment Results and Analysis

Birds' Judgments of Number and Quantity

How Do We Assess Students in the Interpreting Examinations?

Validity and reliability of measurements

Similarity Analysis of Legal Judgments and applying Paragraph-link to Find Similar Legal Judgments.

Unsupervised Measurement of Translation Quality Using Multi-engine, Bi-directional Translation

Systematic Reviews. Simon Gates 8 March 2007

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

CHAMP: CHecklist for the Appraisal of Moderators and Predictors

A Scale to Measure the Attitude of Farmers Towards Kisan Call Centre

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2

Designing A User Study

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

Christopher Cairns and Elizabeth Plantan. October 9, 2016

How preferred are preferred terms?


CSCI 5417 Information Retrieval Systems. Jim Martin!

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

ISA 540, Auditing Accounting Estimates, Including Fair Value Accounting Estimates, and Related Disclosures Issues and Task Force Recommendations

Microblog Retrieval for Disaster Relief: How To Create Ground Truths? Ribhav Soni and Sukomal Pal

ISPOR Task Force Report: ITC & NMA Study Questionnaire

Part II II Evaluating performance in recommender systems

HHS Public Access Author manuscript Stud Health Technol Inform. Author manuscript; available in PMC 2015 July 08.

ASSESSMENT OF DECISION MAKING CAPACITY IN ADULTS PARTICIPATING IN A RESEARCH STUDY 6/8/2011

plural noun 1. a system of moral principles: the ethics of a culture. 2. the rules of conduct recognized in respect to a particular group, culture,

W3C Semantic Sensor Networks Ontologies, Applications, and Future Directions

SAGE. Nick Beard Vice President, IDX Systems Corp.

Running head: CPPS REVIEW 1

Responsible Conduct of Research: Responsible Authorship. David M. Langenau, PhD, Associate Professor of Pathology Director, Molecular Pathology Unit

Interpreting Kappa in Observational Research: Baserate Matters

Bayesian and Frequentist Approaches

Measurement and meaningfulness in Decision Modeling

Different Strokes of Different Folks

QUEENSLAND PHYSIOTHERAPY CENTRAL ALLOCATION PROCESS PROCEDURE MANUAL

A study of adverse reaction algorithms in a drug surveillance program

School of Nursing, University of British Columbia Vancouver, British Columbia, Canada

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions

USDA Nutrition Evidence Library: Systematic Review Methodology

Transcription:

Chapter IR:VIII VIII. Evaluation Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing IR:VIII-1 Evaluation HAGEN/POTTHAST/STEIN 2018

Retrieval Tasks Ad hoc retrieval: One question, one result set. Amenable to laboratory environments Canonical measurement of retrieval performance Reproducibility and scalability Interactive retrieval / task-based retrieval: The user has a goal or a task that requires many queries and exploration, refining the information need along the way. Depends not only on result quality, but also on human factors, context, user interface and experience, and the search engine s supporting facilities. Performance is difficult to be measured IR:VIII-2 Evaluation HAGEN/POTTHAST/STEIN 2018

Experimental Setup A laboratory experiment for ad hoc retrieval requires three items: 1. A document collection (corpus) A representative sample of documents from the domain of interest. The sampling method of how documents are drawn from the population determines a corpus s validity. Statistical representativeness may be difficult to achieve, e.g., in case of the web. In that case, the larger a corpus can be built for a given domain, the better. 2. A set of information needs (topics) Formalized, written descriptions of users tasks, goals, or gaps of knowledge. Alternatively, descriptions of desired search results. Often accompanied by specific queries the users (would) have used to search for relevant documents. 3. A set of relevance judgments (ground truth) Pairs of topics and documents, where each documents has been manually assessed with respect to its relevance to its associated topic. Ideally, the judgments are obtained from the same users who supplied the topics, but in practice judgments are collected from third parties. Judgments may be given in binary form, or on a Likert scale. Every search engine has parameters. Parameter optimization must use an experimental setup (training) different from that used for evaluation (test). Performance deltas between training and test hint at overfitting. IR:VIII-3 Evaluation HAGEN/POTTHAST/STEIN 2018

Remarks: This setup is sometimes referred to as an experiment under the Cranfield paradigm, in reference to Cleverdon s series of projects at Cranfield University in the 1960s, which first used this evaluation methodology. [codalism.com 1] [codalism.com 2] In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. [Wikipedia] The term has been adopted in various other branches of human language technologies. IR:VIII-4 Evaluation HAGEN/POTTHAST/STEIN 2018

Public Resources For ad hoc retrieval, the Text Retrieval Conference (TREC) has organized evaluation tracks as of 1992, inviting scientists to compete. Key document collections used: Collection Documents Size Words/Doc. Topics Words/Query Jdgmts/Query CACM 3,204 2.2 MB 64 64 13.0 16 AP 242,918 0.7 GB 474 100 4.3 220 GOV2 25 million 426.0 GB 1073 150 3.1 180 ClueWeb09 1 billion 25.0 TB 304 200 2-6 100-200 CACM: Communications of the ACM 1958-1979 (only titles and abstracts) AP: Associated Press newswire documents 1988 1990 GOV2: Crawl of.gov domains early 2004 ClueWeb09: Web crawl from 2009 Reusing an existing experimental setup when developing a web search engine allows for direct comparisons with previously evaluated approaches. IR:VIII-5 Evaluation HAGEN/POTTHAST/STEIN 2018

Remarks: TREC is organized by the United States National Institute of Standards and Technology (NIST). It has been key to popularize laboratory evaluation of search engines, organizing evaluation tracks on many different retrieval-related tasks every year: trec.nist.gov. Ad hoc retrieval has been studied in the ad hoc tracks, the terabyte tracks and the web tracks. Several similar initiatives have formed, namely CLEF, NTCIR, and FIRE. IR:VIII-6 Evaluation HAGEN/POTTHAST/STEIN 2018

Topic Descriptions <top> <num> Number: 794 <title> pet therapy <desc> Description: How are pets or animals used in therapy for humans and what are the benefits? <narr> Narrative: Relevant documents must include details of how pet or animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it. </top> IR:VIII-7 Evaluation HAGEN/POTTHAST/STEIN 2018

Remarks: This format has been used for many TREC tracks; it is an invalid form of XML where tags within <top> are not closed, their semantics explicitly repeated in writing. <title> s are typically used as queries. More recent tracks resort to well-formed XML: <topic number="265" type="faceted"> <query>f5 tornado</query> <description>what were the ten worst tornadoes in the USA?</description> <subtopic number="1" type="inf">what were the ten worst tornadoes in the USA?</subtopic> <subtopic number="2" type="inf">where is tornado alley?</subtopic> <subtopic number="3" type="inf">what damage can an F5 tornado do?</subtopic> <subtopic number="4" type="inf">find information on tornado shelters.</subtopic> <subtopic number="5" type="nav">what wind speed defines an F5 tornado?</subtopic> </topic> <topic number="266" type="single"> <query>symptoms of heart attack</query> <description>what are the symptoms of a heart attack in both men and women?</description> </topic> At TREC, every year usually 50 topics are provided. The ones from previous years can be used for training. IR:VIII-8 Evaluation HAGEN/POTTHAST/STEIN 2018

Relevance Judgments <top> <num> Number: 794 <title> pet therapy <desc> Description: How are pets or animals used in therapy for humans and what are the benefits? <narr> Narrative: Relevant documents must include details of how pet or animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it. </top> IR:VIII-9 Evaluation HAGEN/POTTHAST/STEIN 2018

Relevance Judgments <top> <num> Number: 794 <title> pet therapy <desc> Description: How are pets or animals used in therapy for humans and what are the benefits? <narr> Narrative: Relevant documents must include details of how pet or animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it. </top> irrelevant IR:VIII-10 Evaluation HAGEN/POTTHAST/STEIN 2018

Relevance Judgments <top> <num> Number: 794 <title> pet therapy <desc> Description: How are pets or animals used in therapy for humans and what are the benefits? <narr> Narrative: Relevant documents must include details of how pet or animal-assisted therapy is or has been used. Relevant details include information about pet therapy programs, descriptions of the circumstances in which pet therapy is used, the benefits of this type of therapy, the degree of success of this therapy, and any laws or regulations governing it. </top> relevant IR:VIII-11 Evaluation HAGEN/POTTHAST/STEIN 2018

Relevance Judgments A relevance judgment requires the manual assessment whether a document returned by a search engine for a given query is relevant under a given topic. Assessment depth Assessing every document s relevance for every topic quickly becomes infeasible with growing collection size, and growing numbers of topics and (variants of) search engines to be evaluated. Partial assessments are made based on a sampling strategy called pooling. Assessment scale Typically, binary assessments are made, judging documents as relevant or irrelevant. Less often, degrees of relevance are distinguished (e.g., relevant and highly relevant). Assessor selection and instruction Ideally, the persons who supplied topics also assess relevance, presuming they genuinely perceived the underlying information need. In practice, this usually cannot be achieved. Hence, a topic s description must be sufficiently exhaustive to serve as instruction. Assessor agreement Humans are unreliable judges, their judgments depending on many outside influences. One cannot expect objective truths from just one assessment per document for a given topic. Multi-assessments yield more reliable judgments; assessor agreement can be measured. IR:VIII-12 Evaluation HAGEN/POTTHAST/STEIN 2018

Remarks: At TREC, assessors are recruited from retired NIST staff: IR:VIII-13 Evaluation HAGEN/POTTHAST/STEIN 2018

Assessment Sampling Strategy: Pooling Pooling is a sampling strategy for documents retrieved by to-be-evaluated search engines for a given set of topics: For each topic: 1. Collect the top k results returned by each search engine (variant). 2. Merge the results, omitting duplicates, obtaining a pool of documents. 3. Present the pool of documents in random order to assessors. Observations: Presuming a certain correlation between search engines results on a topic, only documents considered relevant by search engines need be analyzed. New retrieval algorithms that are evaluated later may retrieve unjudged documents, requiring new judgments, probably from different assessors. All documents ranked below the threshold are deemed irrelevant by default, regardless the truth. IR:VIII-14 Evaluation HAGEN/POTTHAST/STEIN 2018

Measuring Annotator Agreement Annotator agreement is the degree to which annotators agree with each other in a series of decisions. If no agreement can be measured, the scale may be inappropriate, or the annotators need to be retrained. Annotator agreement is measured in cases where ambiguous or subjective decisions have to be made. Relevance is a subjective notion. IR:VIII-15 Evaluation HAGEN/POTTHAST/STEIN 2018

Measuring Annotator Agreement: Kappa Statistics Given the judgments of two annotators on a given topic, a kappa statistic measures their agreement as follows: κ = p o p e 1 p e, where p o denotes the proportion of agreement observed, and p e the expected proportion of agreement by chance. Properties: 1 p e denotes the agreement attainable above chance p o p e denotes the agreement attained above chance κ = 1 indicates perfect agreement κ = 0 indicates random agreement κ < 0 indicates disagreement worse than chance IR:VIII-16 Evaluation HAGEN/POTTHAST/STEIN 2018

Measuring Annotator Agreement: Kappa Statistics Given the judgments of two annotators on a given topic, a kappa statistic measures their agreement as follows: κ = p o p e 1 p e, where p o denotes the proportion of agreement observed, and p e the expected proportion of agreement by chance. Suppose A and B are two annotators asked to make n relevance judgments. Then a simple kappa statistic can be computed as follows: B yes no A yes a b c no d e f g h n p o = a + e n p e = P (yes) 2 + P (no) 2 P (yes) = c + g 2n, P (no) = f + h 2n IR:VIII-17 Evaluation HAGEN/POTTHAST/STEIN 2018

Measuring Annotator Agreement: Kappa Statistics Given the judgments of two annotators on a given topic, a kappa statistic measures their agreement as follows: κ = p o p e 1 p e, where p o denotes the proportion of agreement observed, and p e the expected proportion of agreement by chance. Suppose A and B are two annotators asked to make n = 400 relevance judgments. Then a simple kappa statistic can be computed as follows: B yes no A yes 300 20 320 no 10 70 80 310 90 400 p o = 300 + 70 400 p e = P (yes) 2 + P (no) 2 P (yes) = 320 + 310 80 + 90, P (no) = 2 400 2 400 κ = 0.776 IR:VIII-18 Evaluation HAGEN/POTTHAST/STEIN 2018

Remarks: Well-known kappa statistics include Cohen s κ, Scott s π, and Fleiss κ. Scott s π is the one exemplified. Fleiss κ is a generalization of Scott s π to arbitrary numbers of annotators and categories. It also does not presume that all cases have been annotated by the same group of people. Presuming that raters A and B work independently, the probability P (yes) 2 (P (no) 2 ) denotes the probability of both voting yes (no) by chance. Another way of computing p e is to sum the mutplication of the rater-specific probabilities of each rater voting yes (no). Some assign the following interpretations to κ values measured (disputed): κ Agreement <0 poor 0.01-0.20 slight 0.21-0.40 fair 0.41-0.60 moderate 0.61-0.80 substantial 0.81-1.00 almost perfect [Wikipedia] Within TREC evaluations, typically a substantial agreement (κ [0.67, 0.8]) is achieved. [Manning 2008] IR:VIII-19 Evaluation HAGEN/POTTHAST/STEIN 2018