Machine Learning and Event Detection for the Public Good

Similar documents
Machine Learning for Population Health and Disease Surveillance

A Pre-Syndromic Surveillance Approach for Early Detection of Novel and Rare Disease Outbreaks

Event and Pattern Detection for Urban Systems Daniel B. Neill H.J. Heinz III College Carnegie Mellon University

Visual and Decision Informatics (CVDI)

Abstract. But what happens when we apply GIS techniques to health data?

A Common-Sense Framework for Assessing Information-Based Counterterrorist Programs

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 06 Week ending February 11, 2012, with updates through 02/20/2012.

Infrastructure and Methods to Support Real Time Biosurveillance. Kenneth D. Mandl, MD, MPH Children s Hospital Boston Harvard Medical School

Novel Machine Learning Methods for Public Health and Disease Surveillance

Syndromic Surveillance in Public Health Practice

Syndromic Surveillance: Early CBRN Attack Detection by Computerized Medical Record Surveillance in Grey Bruce

Social Determinants of Health

CDR Kimberly Elenberg, Office of Food Defense and Emergency Response, Food Safety Inspection Service, United States Department of Agriculture

Influenza : What is going on? How can Community Health Centers help their patients?

Using Predictive Analytics to Save Lives

TABLE OF CONTENTS. Peterborough County-City Health Unit Pandemic Influenza Plan Section 1: Introduction

For questions, or to receive this report weekly by , send requests to either or

Basic Models for Mapping Prescription Drug Data

INFLUENZA FACTS AND RESOURCES

PANDEMIC INFLUENZA PLAN

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 10 Week ending March 10, 2012, with updates through 03/19/2012.

2013 NEHA AEC July 9 11, One Health and All-Hazards: the New Environmental Health. Want this on you?

VULNERABILITY AND EXPOSURE TO CRIME: APPLYING RISK TERRAIN MODELING

Agent-Based Models. Maksudul Alam, Wei Wang

CRIMINAL JUSTICE (CJ)

Lessons from the 2009 Swine Flu Pandemic, Avian Flu, and their Contribution to the Conquest of Induced and Natural Pandemics

GOVERNMENT OF ALBERTA. Alberta s Plan for Pandemic Influenza

Public Health Resources: Core Capacities to Address the Threat of Communicable Diseases

An Overview of Syndromic Surveillance

B - Planning for a Turkey Flu Pandemic

2009 / 2010 H1N1 FAQs

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 3 January 13-19, 2013

THE STATE UNIVERSITY OF NEW YORK. Assembly Standing Committees on Health, Labor, Education, Higher Education, and the Subcommittee on Workplace Safety

A. No. There are no current reports of avian influenza (bird flu) in birds in the U.S.

Thank You to Our Partners. Law Enforcement Fire Services Emergency Management Public Health Behavioral/Mental Health

Summary of National Action Plan for Pandemic Influenza and New Infectious Diseases

Improving surveillance and early detection of outbreaks. Frank M. Aarestrup

What do epidemiologists expect with containment, mitigation, business-as-usual strategies for swine-origin human influenza A?

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 7 February 10-16, 2013

Challenges for U.S. Attorneys Offices (USAO) in Opioid Cases

Influenza Pandemic: Overview of Ops Response. Ministry of Health SINGAPORE

Pandemic and Data- Future Preparedness

Criminal Justice (CJUS)

Hacettepe University Department of Computer Science & Engineering


Spatial and Temporal Data Fusion for Biosurveillance

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 15 Week ending 18 April, With updates through 04/26/2009.

A conversation with Michael Osterholm on July 30, 2013 about pandemics

Peterborough County-City Health Unit Pandemic Influenza Plan Section 1: Background

U.S. Human Cases of Swine Flu Infection (As of April 29, 2009, 11:00 AM ET)

Devon Community Resilience. Influenza Pandemics. Richard Clarke Emergency Preparedness Manager Public Health England South West Centre

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 14 April 3 9, 2011

Enhanced Surveillance Methods and Applications: A Local Perspective

Local Government Pandemic Influenza Planning. Mac McClendon, Chief / Office of Public Health Preparedness Emergency Management Coordinator

Conflict of Interest and Disclosures. Research funding from GSK, Biofire

Influenza Antiviral Planning Department of Veterans Affairs

Guideline for Antiviral Drugs

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 14 April 1-7, 2012

DRAFT WGE WGE WGE WGE WGE WGE WGE WGE WGE WGE WGE WGE WGE WGE GETREADYNOWGE GETREADYNOWGE GETREADYNOWGE GETREADYNOWGE.

Research on Social Psychology Based on Network Big Data

Statistical modeling for prospective surveillance: paradigm, approach, and methods

USING SOCIAL MEDIA TO STUDY PUBLIC HEALTH MICHAEL PAUL JOHNS HOPKINS UNIVERSITY MAY 29, 2014

FREQUENTLY ASKED QUESTIONS SWINE FLU

Guideline for the Surveillance of Pandemic Influenza (From Phase 4 Onwards)

Pandemic Influenza A Matter of Time

Harold Rogers Update Melissa McPheeters, PhD, MPH

Global Health Security and Domestic Preparedness:

Chapter 1. Introduction

Yugo Shobugawa, MD, PhD med.niigata-u.ac.jp

County of Los Angeles Department of Health Services Public Health

Obama declares H1N1 national emergency

Texting4Health Conference February 29, InSTEDD Proprietary Level I

8. Public Health Measures

Tracking Disease Outbreaks using Geotargeted Social Media and Big Data

Innovation. SAGA Project. June 2010 Saga Prefectural Government, Japan

Presentation to The National Association of Sentencing Commissions Annual Conference August 28, 2017

How many students at St. Francis Preparatory School in New York City have become ill or been confirmed with swine flu?

Business continuity planning for pandemic influenza

Module 1 : Influenza - what is it and how do you get it?

Introduction to Deep Reinforcement Learning and Control

Asthma Surveillance Using Social Media Data

Overview of Influenza Surveillance in Korea

FORECASTING THE DEMAND OF INFLUENZA VACCINES AND SOLVING TRANSPORTATION PROBLEM USING LINEAR PROGRAMMING

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

IS THE UK WELL PREPARED FOR A REPEAT OF THE 1918 INFLUENZA PANDEMIC?

Westchester County Community Health Electronic Syndromic Surveillance (CHESS)

Telehealth Data for Syndromic Surveillance

Emerging Infectious Disease Threats. Margaret A. Hamburg M.D. Foreign Secretary, U.S. National Academy of Medicine

Detection of Informative Disjunctive Patterns in Support of Clinical

Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers

Facing the Opioid Epidemic (FOE): Assessing and Responding to Prescription and Illicit Opioid Use and Misuse in 5 New England States

Emergency Preparedness at General Mills

SARS Outbreak Study 2

SWINE FLU: FROM CONTAINMENT TO TREATMENT

Module 4: Emergencies: Prevention, Preparedness, Response and Recovery

Influenza: The Threat of a Pandemic

Elizabeth Hinson ID Homework #2

Statement of research interest

Transcription:

Machine Learning and Event Detection for the Public Good Daniel B. Neill, Ph.D. H.J. Heinz III College Carnegie Mellon University E-mail: neill@cs.cmu.edu We gratefully acknowledge funding support from the National Science Foundation, grants IIS-0916345, IIS-0911032, and IIS-0953330.

Daniel B. Neill (neill@cs.cmu.edu) Assistant Professor of Information Systems, Heinz College Courtesy Assistant Professor of Machine Learning and Robotics, SCS My research has two main goals: to develop new machine learning methods for automatic detection of events and other patterns in massive datasets, and to apply these methods to improve the quality of public health, safety, and security. Customs monitoring: detecting patterns of illicit container shipments Biosurveillance: early detection of emerging outbreaks of disease Law enforcement: detection and prediction of crime hot-spots Our methods could have detected the May 2000 Walkerton E. coli outbreak two days earlier than the first public health response. We are able to accurately predict emerging clusters of violent crime 1-3 weeks in advance by detecting clusters of more minor leading indicator crimes. 2010 Carnegie Mellon University

Daniel B. Neill (neill@cs.cmu.edu) Assistant Professor of Information Systems, Heinz College Courtesy Assistant Professor of Machine Learning and Robotics, SCS My research has two main goals: to develop new machine learning methods for automatic detection of events and other patterns in massive datasets, and to apply these methods to improve the quality of public health, safety, and security. Customs monitoring: detecting patterns of illicit container shipments Biosurveillance: early detection of emerging outbreaks of disease Law enforcement: detection and prediction of crime hot-spots Our methods are currently in use for deployed biosurveillance systems in Ottawa and Grey-Bruce, Ontario; several other projects are underway. We collaborate directly with the Chicago Police Department, and our CrimeScan software is already in day-to-day operational use for predictive policing. 2010 Carnegie Mellon University

Why study machine learning? Critical importance of addressing global policy problems: disease pandemics, crime, terrorism, poverty, environment Increasing size and complexity of available data, thanks to the rapid growth of new and transformative technologies. Much more computing power, and scalable data analysis methods, enable us to extract actionable information from all of this data. Machine learning techniques have become increasingly essential for policy analysis, and for the development of new, practical information technologies that can be directly applied for the public good (e.g. public health, safety, and security)

Some definitions Machine Learning (ML) is the study of systems that improve their performance with experience (typically by learning from data). Artificial Intelligence (AI) is the science of automating complex behaviors such as learning, problem solving, and decision making. Data Mining (DM) is the process of extracting useful information from massive quantities of complex data. I would argue that these are not three distinct fields of study! While each has a slightly different emphasis, there is a tremendous amount of overlap in the problems they are trying to solve and the techniques used to solve them. Many of the techniques we will learn are statistical in nature, but are very different from classical statistics. ML/AI/DM systems and methods: Scale up to large, complex data Learn and improve from experience Perceive and change the environment Interact with humans or other agents Explain inferences and decisions Discover new and useful patterns

How is ML relevant for policy? ML provides a powerful set of tools for intelligent problem-solving. Building sophisticated models that combine data and prior knowledge to enable intelligent decisions. Automating tasks such as prediction and detection to reduce human effort. Scaling up to large, complex problems by focusing user attention on relevant aspects. Using ML to analyze data and guide policy decisions. Proposing policy initiatives to reduce the amount and impact of violent crime in urban areas. Predicting the adoption rate of new technology in developing countries. Analyzing which factors influence congressional votes or court decisions Using ML in information systems to improve public services Health care: diagnosis, drug prescribing Law enforcement: crime forecasting Public health: epidemic detection/response Urban planning: optimizing facility location Homeland security: detecting terrorism Analyzing impacts of ML technology adoption on society Internet search and e-commerce Data mining (security vs. privacy) Automated drug discovery Industrial and companion robots Ethical and legal issues

Advertisement: MLP@CMU We are working to build a comprehensive curriculum in machine learning and policy (MLP) here at CMU. Goals of the MLP initiative: increase collaboration between ML and PP researchers, train new researchers with deep knowledge of both areas, and encourage a widely shared focus on using ML to benefit the public good. Here are some of the many ways you can get involved: Joint Ph.D. Program in Machine Learning and Public Policy (MLD & Heinz) Ph.D. in Information Systems + M.S. in Machine Learning Large Scale Data Analysis for Policy: introduction to ML for PPM students. Research Seminar in Machine Learning & Policy: for ML/Heinz Ph.D. students. Special Topics in Machine Learning and Policy: Event and Pattern Detection, ML for Developing World, Harnessing the Wisdom of Crowds Workshop on Machine Learning and Policy Research & Education Research Labs: Event and Pattern Detection Lab, Auton Laboratory, ilab Center for Science and Technology in Human Rights, many others

LSDA course description This course will focus on applying large scale data analysis methods from the closely related fields of machine learning, data mining, and artificial intelligence to develop tools for intelligent problem solving in real-world policy applications. We will emphasize tools that can scale up to real-world problems with huge amounts of high-dimensional and multivariate data. Mountain of policy data Huge, unstructured, hard to interpret or use for decisions 1. Translate policy questions into ML paradigms. 2. Choose and apply appropriate methods. 3. Interpret, evaluate, and use results. Actionable knowledge of policy domain Predict & explain unknown values Model structures, relations Detect relevant patterns Use for decision-making, policy prescriptions, improved services

LSDA course syllabus Introduction to Large Scale Data Analysis Incorporates methods from machine learning, data mining, artificial intelligence. Goals, problem paradigms, and software tools (e.g. Weka) Module I (Prediction) Classification and regression (making, explaining predictions) Rule-based, case-based, and model-based learning. Module II (Modeling) Representation and heuristic search Clustering (modeling group structure of data) Bayesian networks (modeling probabilistic relationships) Module III (Detection) Anomaly Detection (detecting outliers, novelties, etc.) Pattern Detection (e.g. event surveillance, anomalous patterns) Applications to biosurveillance, crime prevention, etc. Guest mini-lectures from the Event and Pattern Detection Lab.

Common ML paradigms: prediction In prediction, we are interested in explaining a specific attribute of the data in terms of the other attributes. Classification: predict a discrete value What disease does this patient have, given his symptoms? Regression: estimate a numeric value How is a country s literacy rate affected by various social programs? Two main goals of prediction Guessing unknown values for specific instances (e.g. diagnosing a given patient) Explaining predictions of both known and unknown instances (providing relevant examples, a set of decision rules, or class-specific models). Example 1: What socio-economic factors lead to increased prevalence of diarrheal illness in a developing country? Example 2: Developing a system to diagnose a patient s risk of diabetes and related complications, for improved medical decision-making.

Common ML paradigms: modeling In modeling, we are interested in describing the underlying relationships between many attributes and many entities. Our goal is to produce models of the entire data (not just specific attributes or examples) that accurately reflect underlying complexity, yet are simple, understandable by humans, and usable for decision-making. Relations between entities Identifying link, group, and network structures Partitioning or clustering data into subgroups Relations between variables Identifying significant positive and negative correlations Visualizing dependence structure between multiple variables Example 1: Can we visualize the dependencies between various diet-related risk factors and health outcomes? Example 2: Can we better explain consumer purchasing behavior by identifying subgroups and incorporating social network ties?

Common ML paradigms: detection In detection, we are interested in identifying relevant patterns in massive, complex datasets. Main goal: focus the user s attention on a potentially relevant subset of the data. a) Automatically detect relevant individual records, or groups of records. b) Characterize and explain the pattern (type of pattern, H 0 and H 1 models, etc.) Some common detection tasks Detecting anomalous records or groups Discovering novelties (e.g. new drugs) Detecting clusters in space or time Removing noise or errors in data Detecting specific patterns (e.g. fraud) c) Present the pattern to the user. Detecting emerging events which may require rapid responses. Example 1: Detect emerging outbreaks of disease using electronic public health data from hospitals and pharmacies. Example 2: How common are patterns of fraudulent behavior on various e-commerce sites, and how can we deal with online fraud?

What is disease surveillance? The systematic collection and analysis of data for the purpose of detecting outbreaks of disease in people, plants, or animals. Primary goal: timely and accurate detection and characterization of an outbreak. Is there an outbreak? If so, what type of outbreak, and where/who is affected? End goal: enable public health to make rapid and informed decisions to prevent and control outbreaks. Treatment Vaccination Health advisories Cleanup 2011 Carnegie Mellon University Quarantines Travel restrictions

Why worry about disease outbreaks? Bioterrorist attacks are a very real, and scary, possibility 100 kg anthrax, released over D.C., could kill 1-3 million and hospitalize millions more. Emerging infectious diseases Conservative estimate of 2-7 million deaths from pandemic avian influenza. Better response to common outbreaks (seasonal flu, GI) 14

Benefits of early detection Reduces cost to society, both in lives and in dollars! Pre-symptomatic treatment, 1% mortality rate Post-symptomatic treatment, 40% mortality rate Without treatment, 95% mortality rate incubation stage 1 stage 2 Day 0 Day 4 Day 10 Exposure to inhalational anthrax Flu-like symptoms: headache, cough, fever Acute respiratory distress, high fever, shock, death DARPA estimate: a two-day gain in detection time and public health response could reduce fatalities by a factor of six. 15

Benefits of early detection Reduces cost to society, both in lives and in dollars! Pre-symptomatic treatment, 1% mortality rate Post-symptomatic treatment, 40% mortality rate Without treatment, 95% mortality rate incubation stage 1 stage 2 Day 0 Day 4 Day 10 Exposure to inhalational anthrax Flu-like symptoms: headache, cough, fever Acute respiratory distress, high fever, shock, death Improvements of even an hour over current detection capabilities could reduce economic impact of a bioterrorist anthrax attack by hundreds of millions of dollars. 16

Early detection is hard Start of symptoms Lag time Definitive diagnosis incubation stage 1 stage 2 Day 0 Day 4 Day 10 Buys OTC drugs Skips work/school Uses Google, Facebook, Twitter Visits doctor/hospital/ed 17

Syndromic surveillance Start of symptoms Definitive diagnosis incubation stage 1 stage 2 Day 0 Day 4 Day 10 Buys OTC drugs? Cough medication sales in affected area Days after attack 18

Syndromic surveillance Start of symptoms Definitive diagnosis We can achieve very early detection of outbreaks incubation by gathering stage syndromic 1 data, stage and 2identifying emerging spatial clusters of symptoms. Day 0 Day 4 Day 10 Buys OTC drugs? Cough medication sales in affected area Days after attack 19

A recent potential outbreak Spike in sales of pediatric electrolytes near Columbus, Ohio

Under the hood: how does it work? Finding emerging spatial clusters in a health data stream. This increase could be due to an outbreak, or due to chance. Daily over-the-counter sales of cough/cold medication, for each of over 20,000 zip codes nationwide. Time series of counts for each zip code (at least 3 months of historical data). Which increases are significant? Our solution 1. Infer the expected count for each zip code for each recent day. 2. Find regions where the recent counts are higher than expected. We want to be able to detect outbreaks whether they affect a small or large region, and whether they emerge quickly or gradually. Solution: the space-time scan statistic.

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) To detect and localize events, we can search for space-time regions where the number of cases is higher than expected. Imagine moving a window around the scan area, allowing the window size, shape, and temporal duration to vary. 22

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) To detect and localize events, we can search for space-time regions where the number of cases is higher than expected. Imagine moving a window around the scan area, allowing the window size, shape, and temporal duration to vary. 23

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) To detect and localize events, we can search for space-time regions where the number of cases is higher than expected. Imagine moving a window around the scan area, allowing the window size, shape, and temporal duration to vary. 24

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) To detect and localize events, we can search for space-time regions where the number of cases is higher than expected. Imagine moving a window around the scan area, allowing the window size, shape, and temporal duration to vary. For each of these regions, we examine the aggregated time series, and compare actual to expected counts. Time series of past counts Actual counts of last 3 days Expected counts of last 3 days 25

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) Not significant (p =.098) 2nd highest score = 8.4 We find the highest-scoring space-time regions, where the score of a region is computed by the likelihood ratio statistic. Maximum region score = 9.8 Significant! (p =.013) F( S) Pr(Data H 1( S)) Pr(Data H 0) Alternative hypothesis: outbreak in region S Null hypothesis: no outbreak These are the most likely clusters but how can we tell whether they are significant? Answer: compare to the maximum region scores of simulated datasets under H 0. F 1 * = 2.4 F 2 * = 9.1 F 999 * = 7.0

The space-time scan statistic (Kulldorff, 2001; Neill & Moore, 2005) Not significant (p =.098) 2nd highest score = 8.4 Recent advances in analytical methods for event detection enable us to: Maximum region score = 9.8 Significant! (p =.013) Integrate information from multiple streams Distinguish between multiple event types Scale up to many locations and streams Search over irregularly-shaped clusters Consider graph and non-spatial constraints These are the most likely clusters but how can we tell whether they are significant? Answer: compare to the maximum region scores of simulated datasets under H 0. F 1 * = 2.4 F 2 * = 9.1 F 999 * = 7.0

A sampling of current projects Integrating Learning and Detection Incorporate user feedback, distinguish relevant from irrelevant anomalies Automatic Contact Tracing Use cell phone location and proximity data to detect outbreaks and identify where and who is affected. Population Health Surveillance Move beyond outbreak detection, to monitor chronic disease, injury, crime, violence, drug abuse, patient care, etc.

Interested? More details on my web page: http://www.cs.cmu.edu/~neill Or e-mail me at: neill@cs.cmu.edu