Novel Machine Learning Methods for Public Health and Disease Surveillance

Similar documents
A Pre-Syndromic Surveillance Approach for Early Detection of Novel and Rare Disease Outbreaks

Event and Pattern Detection for Urban Systems Daniel B. Neill H.J. Heinz III College Carnegie Mellon University

Machine Learning for Population Health and Disease Surveillance

Machine Learning and Event Detection for the Public Good

Highlights. NEW YORK CITY DEPARTMENT OF HEALTH AND MENTAL HYGIENE Influenza Surveillance Report Week ending January 28, 2017 (Week 4)

An Overview of Syndromic Surveillance

Influenza : What is going on? How can Community Health Centers help their patients?

Jonathan D. Sugimoto, PhD Lecture Website:

Detecting Anomalous Patterns of Care Using Health Insurance Claims

New York City Mental Health Syndromic Surveillance System

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 15 Week ending 18 April, With updates through 04/26/2009.

Human Cases of Swine Influenza in California, Kansas, New York City, Ohio, Texas, and Mexico Key Points April 26, 2009

Bureau of Emergency Medical Services New York State Department of Health

Influenza Surveillance Report Saint Louis County Department of Public Health Week Ending 05/20/2018 Week 20

Folks: The attached information is just in from DOH. The highlights:

Asthma Surveillance Using Social Media Data

UNDERSTANDING THE CORRECT ANSWERS immunize.ca

For questions, or to receive this report weekly by , send requests to either or

Cold & Flu Information

General Questions and Answers on 2009 H1N1 Influenza Vaccine Safety

Westchester County Community Health Electronic Syndromic Surveillance (CHESS)

Tri-County Opioid Safety Coalition Data Brief March 2018 Clackamas, Multnomah, and Washington Counties

August 26, 2009 Florida Flu Information Line

Ontario Novel H1N1 Influenza A Virus Epidemiologic Summary June 4, 2009 As of 8:30am, June 4, 2009

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics

Enhanced Surveillance Methods and Applications: A Local Perspective

Storm Warnings Planning. Outline. Biosurveillance Using EpiCenter in New Jersey, Biosurveillance Using EpiCenter in New Jersey 1

COMMITMENT TO COMMUNITY

NC DETECT Stroke Advisory Council November 1, 2018

Swine Influenza (H1N1) precautions being taken in Europe No U.S. military travel advisories issued yet

Sleep History Questionnaire

Influenza Surveillance Report Saint Louis County Department of Public Health Week Ending 01/20/2019 Week 3

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 10 Week ending March 10, 2012, with updates through 03/19/2012.

December 1-December 7, 2013 (MMWR Week 49)

September 28-October 4, 2014 (MMWR Week 40) Highlights

Telehealth Data for Syndromic Surveillance

How many students at St. Francis Preparatory School in New York City have become ill or been confirmed with swine flu?

Tiredness/Fatigue Mild Moderate to severe, especially at onset of symptoms Head and Body Aches and Pains

Florida Department of Health - Polk County Weekly Morbidity Report - Confirmed and Probable cases * Week #9 (through March 3, 2018)

*521634* Sleep History Questionnaire. Name of primary care doctor:

Using and Improving EARS for Local Public Health Biosurveillance

STARK COUNTY INFLUENZA SNAPSHOT, WEEK 06 Week ending February 11, 2012, with updates through 02/20/2012.

Pandemic Influenza: Global and Philippine Situation

C o u g h s n e e z i n g h o t f l a s r u n n y n o

Situation update pandemic (H1N1) 2009

TABLE OF CONTENTS. Peterborough County-City Health Unit Pandemic Influenza Plan Section 1: Introduction

SURVEILLANCE, MONITORING ABSENTEEISM

STATISTICS & PROBABILITY

Data at a Glance: February 8 14, 2015 (Week 6)

A Guide for Parents. Protect your child. What parents should know. Flu Information The Flu:

2009 H1N1 (Pandemic) virus IPMA September 30, 2009 Anthony A Marfin

Guidance for Influenza in Long-Term Care Facilities

Enhanced State Surveillance of Opioid Involved Morbidity and Mortality (CE ) Opioid Surveillance in Missouri- A New Chapter

Visual and Decision Informatics (CVDI)

Buy The Complete Version of This Book at Booklocker.com:

December 8-December 14, 2013 (MMWR Week 50)

CDR Kimberly Elenberg, Office of Food Defense and Emergency Response, Food Safety Inspection Service, United States Department of Agriculture

Pandemic Influenza A Matter of Time

Human infection with pandemic (H1N1) 2009 virus: updated interim WHO guidance on global surveillance

Making the Best Use of Textual ED Data for Syndromic Surveillance

Influenza Fact Sheet

How you can help support the Beat Flu campaign

Information collected from influenza surveillance allows public health authorities to:

County-Wide Pandemic Influenza Preparedness & Response Plan

Swine Flu; Symptoms, Precautions & Treatments

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 14 April 3 9, 2011

NEW PATIENT APPOINTMENT UROLOGY

Data at a Glance: Dec 28, 2014 Jan 3rd, 2015 (Week 53)

Pandemic H1N1 2009: The Public Health Perspective. Massachusetts Department of Public Health November, 2009

Swine Influenza (Flu) Notification Utah Public Health 4/30/2009

Infectious Diseases Florida Department of Health, HIV/AIDS Section/ Division of Disease Control and Health Protection Tallahassee, Florida

MARYLAND DEPARTMENT OF HEALTH AND MENTAL HYGIENE John M. Colmers, Secretary

This is a repository copy of Measuring the effect of public health campaigns on Twitter: the case of World Autism Awareness Day.

But, North Carolina must be ready.

Training Your Caregiver: Flu Prevention and Treatment for Disabled and the Elderly

Help protect your child. At-a-glance guide to childhood vaccines.

Help protect your child. At-a-glance guide to childhood vaccines.

Syndromic Surveillance in Public Health Practice

2009 / 2010 H1N1 FAQs

Dr. Cristina Gutierrez, Laboratory Director, CARPHA SARI/ARI SURVEILLANCE IN CARPHA MEMBER STATES

Influenza Surveillance Report Saint Louis County Department of Public Health Week Ending 12/30/2018 Week 52

Peterborough County-City Health Unit Pandemic Influenza Plan Section 1: Background

Coach on Call. Thank you for your interest in Deciding to Get the Flu Vaccine. I hope you find this tip sheet helpful.

October 5-October 11, 2014 (MMWR Week 41)

Influenza B viruses are not divided into subtypes, but can be further broken down into different strains.

QHSE Campaign- Health

Utilizing syndromic surveillance data from ambulatory care settings in NYC

2014/2015 Ottawa County Influenza Surveillance Summary

THE STATE UNIVERSITY OF NEW YORK. Assembly Standing Committees on Health, Labor, Education, Higher Education, and the Subcommittee on Workplace Safety

INFLUENZA (FLU) Cleaning to Prevent the Flu

Georgia Weekly Influenza Report

U.S. Human Cases of Swine Flu Infection (As of April 29, 2009, 11:00 AM ET)

COUNSELING CARDS FOR IMMUNIZATIONS

Georgia Weekly Influenza Report

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

VCU CENTER FOR SLEEP MEDICINE NEW PATIENT QUESTIONNAIRE

P2 P7 SCN 1-13a HWB 1-15a, 2-15a HWB 1-16a, 2-16a HWB 1-17a, 2-17a Unit of Study Unit 6 Micro-organisms Estimated Teaching Time 50 minutes

What is Swine Flu (800)

What is the Flu? The Flu is also called Influenza (In-flu-en-za) It is caused by an infection of the. Nose Throat And lungs

Transcription:

Novel Machine Learning Methods for Public Health and Disease Surveillance Daniel B. Neill, Ph.D. Carnegie Mellon University (H.J. Heinz III College) New York University (Center for Urban Science & Progress) E-mail: neill@cs.cmu.edu We gratefully acknowledge funding support from the National Science Foundation, grant IIS-0953330, and MacArthur Foundation.

What can Machine Learning do for public health surveillance? Earlier, more accurate detection of emerging patterns in space and time. Better estimation of what we expect to see, and better detection of deviations from expected (more power, fewer false positives). Integrating multiple data streams; incorporating prior information. Tracking and source-tracing outbreak spread.

What can Machine Learning do for public health surveillance? Today s talk will focus on two specific examples of how ML expands the scope and scale of data sources that can be used for surveillance: Structuring unstructured data, e.g., free text from ED chief complaints. Incorporating complex data sources such as online social media.

Early outbreak detection (syndromic) Outbreak detection Spatial time series data from spatial locations s i (e.g. zip codes) Time series of counts c i,mt for each zip code s i for each data stream d m. d 1 = respiratory ED d 2 = constitutional ED d 3 = OTC cough/cold d 4 = OTC anti-fever (etc.) Three main goals of syndromic surveillance Detect any emerging events Pinpoint the affected subset of locations and time duration Characterize the event by identifying the affected subpopulation 4

Early outbreak detection (syndromic) Outbreak detection d 1 = respiratory ED d 2 = constitutional ED Spatial time series data from spatial locations s i (e.g. zip codes) Time series of counts c i,mt for each zip code s i for each data stream d m. d 3 = OTC cough/cold d 4 = OTC anti-fever (etc.) Recent spatial and subset scanning approaches can accurately and efficiently find the most anomalous clusters of disease, by maximizing a likelihood ratio statistic over subsets. Pr(Data H 1(D,S,P,W)) F(D,S,P,W) = Pr(Data H 0) Compare hypotheses: H 1 (D, S, P, W) D = subset of streams S = subset of locations P = subpopulation W = time duration vs. H 0 : no events occurring 5

Pre-syndromic surveillance Date/time Hosp. Age Complaint Jan 1 08:00 A 19-24 runny nose Jan 1 08:15 B 10-14 fever, chills Jan 1 08:16 A 0-1 broken arm Jan 2 08:20 C 65+ vomited 3x Jan 2 08:22 A 45-64 high temp Free-text ED chief complaint data from hospitals in New York City, North Carolina, and Allegheny County, Pennsylvania. Key challenge: A syndrome cannot be created to identify every possible cluster of potential public health significance. A method is needed to identify relevant clusters of disease cases without pre-classification into syndromes. Use case proposed by NC and NYC health depts. Solution requirements developed through a public health consultancy at the International Society for Disease Surveillance.

Where do existing methods fail? The typical syndromic surveillance approach can effectively detect emerging outbreaks with commonly seen, general patterns of symptoms (e.g. ILI). If we were monitoring these particular symptoms, it would only take a few such cases to realize that an outbreak is occurring! What happens when something new and scary comes along? - More specific symptoms ( coughing up blood ) - Previously unseen symptoms ( nose falls off ) Mapping specific chief complaints to a broader symptom category can dilute the outbreak signal, delaying or preventing detection.

Where do existing methods fail? The typical syndromic surveillance approach can effectively detect emerging outbreaks with commonly seen, general patterns of symptoms (e.g. ILI). What happens when something new and scary comes along? - More specific symptoms ( coughing up blood ) - Previously unseen symptoms ( nose falls off ) Our solution is to combine text-based (topic modeling) and event detection (multidimensional scan) approaches, to detect emerging patterns of keywords. If we were monitoring these particular symptoms, it would only take a few such cases to realize that an outbreak is occurring! Mapping specific chief complaints to a broader symptom category can dilute the outbreak signal, delaying or preventing detection.

The semantic scan statistic Date/time Hosp. Age Complaint Jan 1 08:00 A 19-24 runny nose Jan 1 08:15 B 10-14 fever, chills Jan 1 08:16 A 0-1 broken arm Jan 2 08:20 C 65+ vomited 3x Jan 2 08:22 A 45-64 high temp Bayesian inference using LDA model Topics Topic prior β Φ 1 Φ K α θ 1 θ N w ij Case prior Distribution over topics per case Observed words Classify cases to topics φ 1 : vomiting, nausea, diarrhea, φ 2 : dizzy, lightheaded, weak, φ 3 : cough, throat, sore, Now we can do a multidimensional scan, using the learned topics instead of pre-specified syndromes! Time series of hourly counts for each hospital and age group, for each topic φ j.

For each hour of data: For each combination S of: Hospital Time duration Age range Topic Multidimensional scanning Count: C(S) = # of cases in that time interval matching on hospital, age, topic. Baseline: B(S) = expected count (28-day moving average). Score: F(S) = C log (C/B) + B C, if C > B, and 0 otherwise (using the expectation-based Poisson log-likelihood ratio statistic) We return cases corresponding to each top-scoring subset S.

Simulation results (3 yrs. of data from 13 Allegheny County, PA hospitals) Semantic scan detected simulated novel outbreaks more than twice as quickly as the standard syndrome-based method: 5.3 days vs. 10.9 days to detect at 1 false positive per month. green nose possible color greenish nasal Simulated novel outbreak: green nose Top words from detected topic

NC DOH evaluation results We compared the top 500 clusters found by semantic scan and a keyword-based scan on data provided by the NC DOH in a blinded evaluation, with DOH labeling each cluster as relevant or not relevant. Semantic scan: for 10 true clusters, had to report 12; for 30 true clusters, had to report 54. Keyword scan: for 10 true clusters, had to report 21; for 30 true clusters, had to report 83.

NYC DOHMH dataset New York City s Department of Health and Mental Hygiene provided us with 5 years of data (2010-2014) consisting of ~20M chief complaint cases from 50 hospitals in NYC. For each case, we have data on the patient s chief complaint (free text), date and time of arrival, age group, gender, and discharge ICD-9 code. Substantial pre-processing of the chief complaint field was necessary because of size and messiness of data (typos, abbreviations, etc.). Standardized using the Emergency Medical Text Processor (EMTP) developed by Debbie Travers and colleagues at UNC. Spell checker for typo correction. If ICD-9 code in chief complaint field, convert to corresponding text.

Events identified by semantic scan The progression of detected clusters after Hurricane Sandy impacted NYC highlights the variety of strains placed on hospital emergency departments following a natural disaster: Acute cases: falls, SOB, leg Injuries Mental health disturbances: depression, anxiety Burden on medical infrastructure: methadone, dialysis Many other events of public health interest were identified: Accidents Motor vehicle Ferry School bus Elevator Contagious Diseases Meningitis Scabies Ringworm Other Drug overdoses Smoke inhalation Carbon monoxide poisoning Crime related, e.g., pepper spray attacks

Example of a detected cluster Arrival Date Arrival Time Hospital ID Chief Complaint Patient Sex Patient Age 11/28/2014 7:52:00 HOSP5 EVAUATION, DRANK COFFEE WITH CRUS M 45-49 11/28/2014 7:53:00 HOSP5 DRANK TAINTED COFFEE M 65-69 11/28/2014 7:57:00 HOSP5 DRANK TAINTED COFFEE F 20-24 11/28/2014 7:59:00 HOSP5 INGESTED TAINTED COFFEE M 35-39 11/28/2014 8:01:00 HOSP5 DRANK TAINTED COFFEE M 45-49 11/28/2014 8:03:00 HOSP5 DRANK TAINTED COFFEE M 40-44 11/28/2014 8:04:00 HOSP5 DRANK TAINTED COFFEE M 30-34 11/28/2014 8:06:00 HOSP5 DRANK TAINTED COFFEE M 35-39 11/28/2014 8:09:00 HOSP5 INGESTED TAINTED COFFEE M 25-29 This detected cluster represents 9 patients complaining of ingesting tainted coffee, and demonstrates Semantic Scan s ability to detect rare and novel events.

Pre-syndromic surveillance is a safety net that can supplement existing ED syndromic surveillance systems.by alerting public health to unusual or newly emerging threats. Our recently proposed semantic scan can accurately and automatically discover presyndromic case clusters corresponding to novel outbreaks and other patterns of interest.

What can Machine Learning do for public health surveillance? Today s talk will focus on two specific examples of how ML expands the scope and scale of data sources that can be used for surveillance: Structuring unstructured data, e.g., free text from ED chief complaints. Incorporating complex data sources such as online social media.

Event Detection from Social Media Protest in Mexico, 7/14/2012 2012 Washington D.C. Traffic Tweet Map for 2011 VA Earthquake Social media is a real-time sensor of large-scale population behavior, and can be used for early detection of emerging events. but it is very complex, noisy, and subject to biases. We have developed a new event detection methodology: Non-Parametric Heterogeneous Graph Scan (NPHGS) Applied to: rare disease outbreak detection (hantavirus in Chile).

Technical challenge: Integration of multiple heterogeneous information sources!

#VIRUSHANTA mentioned 100 times RT @SeremiSaludM: Se confirmó primer caso de hantavirus en el Maule y con consecuencia fatal. Se trata de un joven de 25 años de Pencahue re-tweeted 50 times Technical challenge: Integration of multiple heterogeneous information sources! Keyword virus mentioned 500 times http://t.co/5lkd0czdmf mentioned 10 times Influential User SeremiSaludM (1497 followers) posted 8 tweets

#VIRUSHANTA mentioned 100 times?? Technical challenge: Integration of multiple heterogeneous information sources! RT @SeremiSaludM: Se confirmó primer caso de hantavirus en el Maule y con consecuencia fatal. Se trata de un joven de 25 años de Pencahue re-tweeted 50 times?????? Keyword virus mentioned 500 times http://t.co/5lkd0czdmf mentioned 10 times? Influential User SeremiSaludM (1497 followers) posted 8 tweets

Twitter Heterogeneous Network

Step 1: Evaluate each node in the Twitter graph Each node (user, tweet, location, etc.) reports a value measuring how anomalous it is for the current hour or day. Object Type User Tweet City, State, Country Term Link Hashtag Features # tweets, # retweets, # followers, #followees, #mentioned_by, #replied_by, diffusion graph depth, diffusion graph size Klout, sentiment, replied_by_graph_size, reply_graph_size, retweet_graph_size, retweet_graph_depth # tweets, # active users # tweets # tweets # tweets Features empirical calibration Individual p-value for each feature min Minimum empirical p- value for each node empirical calibration Overall p-value for each node

Step 2: Find the most anomalous subgraphs This step allows us to find groups of nodes (users, keywords, tweets, hashtags, etc.), that are most anomalous when considered collectively.

Using gold standard data from 17 hantavirus outbreaks in Chile, we demonstrated that this approach outperforms existing state-of-the-art methods with respect to timeliness of detection, detection power, and accuracy. Photo credits: Claudio Co-Bar, Martin Garrido

Detected Hantavirus outbreak, 10 Jan 2013 Locations Users Keywords Hashtags Links Videos Temuco and Villarrica, Chile First news report: 11 Jan 2013 Photo credits: Claudio Co-Bar, Martin Garrido

Word cloud from tweets

Conclusions The continually increasing size, variety, and complexity of data available for population health and disease surveillance necessitate development of new detection methods to make use of these data. We have developed new machine learning approaches for extracting emerging patterns from free text data, and for analyzing the heterogeneous network structure of online social media. These methods enable us to detect rare or previously unseen outbreaks that would be missed by typical syndromic surveillance approaches.

References D.B. Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society (Series B: Statistical Methodology) 74(2): 337-360, 2012. D.B. Neill, E. McFowland III, and H. Zheng. Fast subset scan for multivariate event detection. Statistics in Medicine 32: 2185-2208, 2013. E. McFowland III, S. Speakman, and D.B. Neill. Fast generalized subset scan for anomalous pattern detection. Journal of Machine Learning Research, 14: 1533-1561, 2013. F. Chen and D.B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. Proc. 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014. S. Speakman, S. Somanchi, E. McFowland III, D.B. Neill. Penalized fast subset scanning. Journal of Computational and Graphical Statistics 25: 382-404, 2016. D.B. Neill and W. Herlands. Machine learning for drug overdose surveillance. Journal of Technology in Human Services, 2018, in press. T. Kumar and D.B. Neill. Fast multidimensional subset scan for event detection and characterization. Submitted for publication. A. Maurya, D.B. Neill, et al., Semantic scan: detecting subtle, spatially localized events in text streams. Submitted for publication. 30

Thanks for listening! More details on our web site: http://epdlab.heinz.cmu.edu Or e-mail me at: neill@cs.cmu.edu 31