Data Driven Methods for Disease Forecasting

Similar documents
Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics

CHALLENGES IN INFLUENZA FORECASTING AND OPPORTUNITIES FOR SOCIAL MEDIA MICHAEL J. PAUL, MARK DREDZE, DAVID BRONIATOWSKI

Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks

Visual and Decision Informatics (CVDI)

SourceSeer: Forecasting Rare Disease Outbreaks Using Multiple Data Sources

Forecasting Influenza Levels using Real-Time Social Media Streams

Searching for flu. Detecting influenza epidemics using search engine query data. Monday, February 18, 13

EpiDash A GI Case Study

Statistical modeling for prospective surveillance: paradigm, approach, and methods

Influenza forecast optimization when using different surveillance data types and geographic scale

Guess Who s Not Coming to Dinner? Evaluating Online Restaurant Reservations for Disease Surveillance

Towards Real Time Epidemic Vigilance through Online Social Networks

Flu Gone Viral: Syndromic Surveillance of Flu on Twitter using Temporal Topic Models

Project title: Using Non-Local Connectivity Information to Identify Nascent Disease Outbreaks

A systematic review of studies on forecasting the dynamics of influenza outbreaks

Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach

Counteracting structural errors in ensemble forecast of influenza outbreaks

Google Flu Trends Still Appears Sick: An Evaluation of the Flu Season

Automated Social Network Epidemic Data Collector

Ongoing Influenza Activity Inference with Real-time Digital Surveillance Data

Exploiting Implicit Item Relationships for Recommender Systems

Abstract. But what happens when we apply GIS techniques to health data?

EpiCaster: An Integrated Web Application For Situation Assessment and Forecasting of Global Epidemics

Global infectious disease surveillance through automated multi lingual georeferencing of Internet media reports

Regional Level Influenza Study with Geo-Tagged Twitter Data

Towards Real-Time Measurement of Public Epidemic Awareness

Predicting Spatio Temporal Propagation of Seasonal Influenza Using Variational Gaussian Process Regression

Tom Hilbert. Influenza Detection from Twitter

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 14 April 3 9, 2011

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 14 April 1-7, 2012

Centers for Disease Control and Prevention U.S. INFLUENZA SEASON SUMMARY*

H1N1 Flu Virus (Human Swine Influenza) Information Resources (last updated May 6th, 2009, 5 pm)

Cloud- based Electronic Health Records for Real- time, Region- specific

arxiv: v2 [cs.lg] 3 Jun 2016

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 3 January 13-19, 2013

20. The purpose of this document is to examine the pre-pandemic efforts and the response to the new influenza A (H1N1) virus since April 2009.

Ohio Department of Health Seasonal Influenza Activity Summary MMWR Week 7 February 10-16, 2013

Advances in epidemic forecasting & implications for vaccination

Jonathan D. Sugimoto, PhD Lecture Website:

SEPTIC SHOCK PREDICTION FOR PATIENTS WITH MISSING DATA. Joyce C Ho, Cheng Lee, Joydeep Ghosh University of Texas at Austin

4.3.9 Pandemic Disease

#Swineflu: Twitter Predicts Swine Flu Outbreak in 2009

Influenza Surveillance in the United St ates

Social Media based Simulation Models for Understanding Disease Dynamics

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Effects of Sampling on Twitter Trend Detection

List of Figures. 3.1 Endsley s Model of Situational Awareness Endsley s Model Applied to Biosurveillance... 64

Asthma Surveillance Using Social Media Data

Weekly Influenza Surveillance Summary

Influenza, Board of Health Monthly Meeting February 14, 2018 Jenifer Leaf Jaeger, MD, MPH Interim Medical Director

SourceSeer: Forecasting Rare Disease Outbreaks Using Multiple Data Sources

Spatial and Temporal Data Fusion for Biosurveillance

Tracking Disease Outbreaks using Geotargeted Social Media and Big Data

Data Integration and Predictive Analysis System for Disease Prophylaxis

Animal Disease Event Recognition and Classification

Weekly Influenza Surveillance Summary

H1N1 Flu Virus (Human Swine Influenza) Information Resources (last updated May 7th, 2009, 12pm)

2009 H1N1 Influenza Pandemic Defense Health Board Briefing

USING SOCIAL MEDIA TO STUDY PUBLIC HEALTH MICHAEL PAUL JOHNS HOPKINS UNIVERSITY MAY 29, 2014

Analysing Twitter and web queries for flu trend prediction

A non-homogeneous Markov spatial temporal model for dengue occurrence

On the Combination of Collaborative and Item-based Filtering

Machine Learning for Population Health and Disease Surveillance

Influenza Report Week 44

School Dismissal Monitoring Revised Protocol Overview July 30, 2009

Regional Update Pandemic (H1N1) 2009 (March 15, h GMT; 12 h EST)

Using Social Media to Understand Cyber Attack Behavior

Spatio-temporal modeling of weekly malaria incidence in children under 5 for early epidemic detection in Mozambique

Towards Integrated Syndromic Surveillance in Europe?

Influenza Report Week 45

Syndromic Surveillance of Flu on Twitter Using Weakly Supervised Temporal Topic Models

MAIMUNA S. MAJUMDER

Predicting Breast Cancer Survivability Rates

City, University of London Institutional Repository

Texting4Health Conference February 29, InSTEDD Proprietary Level I

Global Challenges of Pandemic and Avian Influenza. 19 December 2006 Keiji Fukuda Global influenza Programme

Forecasting the Influenza Season using Wikipedia

Influenza Report Week 47

Seasonal Influenza Communication Update

H1N1 Planning, Response and Lessons to Date

Characterizing Influenza surveillance systems performance: application of a Bayesian hierarchical statistical model to Hong Kong surveillance data

Religious Festivals and Influenza. Hong Kong (SAR) China. *Correspondence to:

The Spatio-Temporal Spread of Influenza in Virginia,

Regional Update Pandemic (H1N1) 2009 (July 6, h GMT; 12 h EST)

Centers for Disease Control and Prevention (CDC) FY 2009 Budget Request Summary

Regional Update Pandemic (H1N1) 2009 (July 19, h GMT; 12 h EST)

Pandemic (H1N1) (August 28, h GMT; 12 h EST) Update on the qualitative indicators

Twitter Knowledge Inference for Latin American biosurveillance

National Environmental Assessment Reporting System (NEARS)

Regional Update Pandemic (H1N1) 2009

Centers for Disease Control and Prevention U.S. INFLUENZA SEASON SUMMARY*

Data at a Glance: February 8 14, 2015 (Week 6)

Influenza Report Week 03

Statistical Methods for Improving Biosurveillance System Performance

Information collected from influenza surveillance allows public health authorities to:

Weekly Influenza Surveillance Summary

Hacettepe University Department of Computer Science & Engineering

Transcription:

Data Driven Methods for Disease Forecasting Prithwish Chakraborty 1,2, Pejman Khadivi 1,2, Bryan Lewis 3, Aravindan Mahendiran 1,2, Jiangzhuo Chen 3, Patrick Butler 1,2, Elaine O. Nsoesie 3,4,5, Sumiko R. Mekaru 4,5, John S. Brownstein 4,5, Madhav V. Marathe 3, Naren Ramakrishnan 1,2 1 Dept. of Computer Science, Virginia Tech, USA 2 Discovery Analytics Center, Virginia Tech, USA 3 NDSSL, Virginia Bioinformatics Institute, USA 4 Children s Hopsital Informatics Program, Boston Children s Hopsital, USA 5 Dept. of Pediatrics, Harvard Medical School, USA. October 27, 2014 1 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 2 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Traditional Approaches: Computational Epidemiology Computatational models (ode, etc.) Population level vs Network level Effectiveness depends on Good Surveillance data. 3 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Traditional Approaches: Computational Epidemiology Computatational models (ode, etc.) Population level vs Network level Effectiveness depends on Good Surveillance data. Surveillance often delayed Surveillance often updated over time 3 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Epidemiology in data driven world Surrogate information can be found in social medium Physical indicators can also have causal effects on diseases. Can complement traidtionl surveillance Provide real-time estimates Provide robust estimates of already published data 4 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Example Problems (see www.dac.cs.vt.edu) Predicting Hantavirus outbreaks from news articles Chikungunya Spread detection Influenza like Illness (ILI) forecasting. Saurav Ghosh et al. Forecasting Rare Disease Outbreaks with Spatio-temporal Topic Models. In: NIPS 2013 workshop on Topic Models. 2013 5 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Problem Overview Near-horizon forecast of ILI case counts at country level 6 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Problem Overview Near-horizon forecast of ILI case counts at country level Predicting weekly Influenza-like-illness (ILI) case counts for 15 Latin American countries Investigating different open source data-streams as possible surrogate indicators of ILI Prithwish Chakraborty et al. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. In: Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014. 2014, pp. 262 270 6 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 4 Accounting for uncertainties in the official surveillance estimates 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 4 Accounting for uncertainties in the official surveillance estimates 5 Investigate importance of different sources - Ablation test 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 8 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Key Ingredients Better Data - extract information from external indicators. Better Models - handle non-linearity. Handle Real world noise 9 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Introduction Methods Instability Analysis Ablation Test Conclusion Overall Framework Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data 5LLGB6Historical PL6GBI6per6week Weather6Data P6GB6historical P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week OpenTable6Res6Data PP6MB6historical P766KBI6week Google6Flu6Trends 46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data P4L6MB6hist: b6mbi6week Twitter6Data PTB6hist: OL6GBI6week Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB Twitter6Data666666666:666676GB66 Filtering6for6 Flu6Related6Content Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB Weather6Data6666666:666P56KB Twitter6Data666666666:666PL6KB 6 ILI6Prediction 10 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting References

Data Sources Non-physical indicators 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 2 Custom User Keywords 1 Google Search Trends 2 Healthmap News Feed 3 Twitter Feed 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 2 Custom User Keywords 1 Google Search Trends 2 Healthmap News Feed 3 Twitter Feed Physical indicators Misc. Indicators 1 Opentable reservations 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Google Flu Trends 12 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. Time series correlation analysis Used Google Correlate to find words with search history correlated with ILI incidence curve. Interesting words such as ginger and Acemuk found. 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. Time series correlation analysis Used Google Correlate to find words with search history correlated with ILI incidence curve. Interesting words such as ginger and Acemuk found. Final filtering : 114 words 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Finding Custom user keyword dictionary (contd..) 14 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

GFT vs other non-physical indicators using custom keyword set Google Search Trends (GST) Healthmap Twitter 15 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Physical Indicators Meteorological data for every lat-long, worldwide, every 8 hours Humidity, Temperature, Rainfall Analyzing grid cells covering PAHO sites. 16 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Introduction Methods Instability Analysis Ablation Test Conclusion System framework once again!! Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data 5LLGB6Historical PL6GBI6per6week Weather6Data P6GB6historical P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week OpenTable6Res6Data PP6MB6historical P766KBI6week Google6Flu6Trends 46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data P4L6MB6hist: b6mbi6week Twitter6Data PTB6hist: OL6GBI6week Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB Twitter6Data666666666:666676GB66 Filtering6for6 Flu6Related6Content Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB Weather6Data6666666:666P56KB Twitter6Data666666666:666PL6KB 6 ILI6Prediction 17 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting References

Preliminaries To find predictive modelf Variable Setup f : P t = f (P, X ) V t P t β α, X t β α, P t+1 β α, X t+1 β α,..., P t α, X t α L t P t Parameters α : the lookahead window length β : the lookback window length 18 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization (MF) Can find latent factors in the dataset. 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization (MF) Can find latent factors in the dataset. Model M i,j = b i,j + Ui T b i,j = M + b j F j 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization (MF) Can find latent factors in the dataset. Model M i,j = b i,j + Ui T b i,j = M + b j Fitting F j b, F, U = argmin( m 1 i=1 (M i,n M i,n ) 2 +λ 1 ( n bj 2 + m 1 U i 2 + n F j 2 )) j=1 i=1 j=1 (1) 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Nearest Neighbor model (NN) Impose non-linearity. 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Nearest Neighbor model (NN) Impose non-linearity. N (i) = {k : V k is one of the top K nearest neighbors of V i } 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Nearest Neighbor model (NN) Impose non-linearity. N (i) = {k : V k is one of the top K nearest neighbors of V i } Fitting P T = ( k N (T ) θ k L k,t α )/ K θ k (2) k=1 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. M i,j = b i,j + Ui T F j +F j N (i) 1 2 k N(i) (M (3) i,k b i,k )x k 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. M i,j = b i,j + Ui T F j +F j N (i) 1 2 k N(i) (M (3) i,k b i,k )x k Fitting b, F, U, x = argmin( m 1 i=1 (M i,n M i,n ) 2 +λ 2 ( n bj 2 + m 1 U i 2 + n F j 2 + x k 2 )) j=1 i=1 j=1 k (4) Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of KDD 08. 2008, pp. 426 434 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Accuracy comparison Quality Metric A = 4 N p t e t=t s ( ) P t ˆP t 1 max(p t, ˆP t, 10) (5) 22 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Accuracy comparison On average, MFN has better performance over MF and NN In Mexico, MF has the best accuracy - possibly because the 2013 ILI season in Mexico was late by a few weeks than in usual. 23 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Model level fusion Output from models combined based on historical accuracy. 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Model level fusion Output from models combined based on historical accuracy. Model [ C M t = 1 P t... C P t P t ] (6) 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Model level fusion Output from models combined based on historical accuracy. Model [ C M t = 1 P t... C P t P t ] (6) Fitting C M i,j = µ i + C b j + C Ui T + C F j C N (i) 1 2 k C N(i) ( (7) C M i,k µ i + C b k ) C x k C F j 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Data level fusion Feature vector is a tuple over all data set features. Use MFN to fit the value X t = T t, W t 25 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Accuracy comparison 26 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Accuracy comparison On average, model level fusion produces better accuracy than data level fusion. Interesting deviations like Chile and El Salvador indicates that a possible strategy could be a mix of data level and model fusion - however complexity of training will increase manifold. 26 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 27 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Uncertainty in official estimates Can take up to several months to stabilize. Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina. 28 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Correcting uncertainty Recognize high, low and mid-season months for countries. Variable setup P A S = Correction Model { (1, P (1) ˆṖ (m) i i, Ṗi, N (1) i = a 0 + a 1 m + a 2 P (m) i } ),..., (m, P (m) i, Ṗi, N (m) i ),... + a 3 N (m) i (8) 29 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Correcting uncertainty Recognize high, low and mid-season months for countries. Variable setup P A S = Correction Model { (1, P (1) ˆṖ (m) i i, Ṗi, N (1) i = a 0 + a 1 m + a 2 P (m) i } ),..., (m, P (m) i, Ṗi, N (m) i ),... + a 3 N (m) i (8) 29 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 30 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Investigating importance of each source : Ablation Test 31 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Investigating importance of each source : Ablation Test Greater drop in accuracy = Source more important Physical indicators are in general more important Still there is value in supplementing physical indicators with non-physical indicators. 31 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Final look at real time predictions Weekly predictions sent out for 15 Latin American countries Predictions publicly available at http://embers.cs.vt. edu/embers/alerts/ visualizer_isi 32 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Conclusion: How to extend to other sources Data about number of unreserved tables at restaurants in Mexico 33 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Summary MFN performs better than MF, NN on average over individual sources for predicting ILI case counts. In average there is a small advantage in combining models over different sources than to combine data. Employing information about number of samples used and how far from the actual date the estimate is being updated by the reporting agency, we have been able to improve our overall accuracy by a quality score of 0.05. Generally physical indicators offer more advantage over non-physical indicators. However for some situations Healthmap and Twitter feed have been found to outperform physical indicators. Experiments with Opentable reservation data shows that there is some perceptible signal embedded w.r.t to ILI case counts. 34 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Future Work Reconcile these phenomenological models with true epidemiological models. Explore inter-country characteristics of ILI profiles. 35 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Acknowledgements Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337 and by the Defense Threat Reduction agency (DTRA) via the CNIMS Contract HDTRA1-11-D-0016-0001. The US Government is authorized to reproduce and distribute reprints of this work for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, DTRA, or the US Government. 36 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Thanks! 37 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Thanks! Any questions? 37 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

References Prithwish Chakraborty et al. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. In: Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014. 2014, pp. 262 270. Saurav Ghosh et al. Forecasting Rare Disease Outbreaks with Spatio-temporal Topic Models. In: NIPS 2013 workshop on Topic Models. 2013. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of KDD 08. 2008, pp. 426 434. 38 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Appendix: Physical Indicators Collection Framework 1 / 2 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting

Appendix: Accuracy of different methods for different countries 2 / 2 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting