Data Driven Methods for Disease Forecasting Prithwish Chakraborty 1,2, Pejman Khadivi 1,2, Bryan Lewis 3, Aravindan Mahendiran 1,2, Jiangzhuo Chen 3, Patrick Butler 1,2, Elaine O. Nsoesie 3,4,5, Sumiko R. Mekaru 4,5, John S. Brownstein 4,5, Madhav V. Marathe 3, Naren Ramakrishnan 1,2 1 Dept. of Computer Science, Virginia Tech, USA 2 Discovery Analytics Center, Virginia Tech, USA 3 NDSSL, Virginia Bioinformatics Institute, USA 4 Children s Hopsital Informatics Program, Boston Children s Hopsital, USA 5 Dept. of Pediatrics, Harvard Medical School, USA. October 27, 2014 1 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 2 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Traditional Approaches: Computational Epidemiology Computatational models (ode, etc.) Population level vs Network level Effectiveness depends on Good Surveillance data. 3 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Traditional Approaches: Computational Epidemiology Computatational models (ode, etc.) Population level vs Network level Effectiveness depends on Good Surveillance data. Surveillance often delayed Surveillance often updated over time 3 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Epidemiology in data driven world Surrogate information can be found in social medium Physical indicators can also have causal effects on diseases. Can complement traidtionl surveillance Provide real-time estimates Provide robust estimates of already published data 4 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Example Problems (see www.dac.cs.vt.edu) Predicting Hantavirus outbreaks from news articles Chikungunya Spread detection Influenza like Illness (ILI) forecasting. Saurav Ghosh et al. Forecasting Rare Disease Outbreaks with Spatio-temporal Topic Models. In: NIPS 2013 workshop on Topic Models. 2013 5 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Problem Overview Near-horizon forecast of ILI case counts at country level 6 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Problem Overview Near-horizon forecast of ILI case counts at country level Predicting weekly Influenza-like-illness (ILI) case counts for 15 Latin American countries Investigating different open source data-streams as possible surrogate indicators of ILI Prithwish Chakraborty et al. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. In: Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014. 2014, pp. 262 270 6 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 4 Accounting for uncertainties in the official surveillance estimates 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Main Goals 1 Real-time prospective study - most studies till this paper were retrospective. 2 Integrates both social and physical indicators 3 Data level fusion vs Model level fusion? 4 Accounting for uncertainties in the official surveillance estimates 5 Investigate importance of different sources - Ablation test 7 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 8 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Key Ingredients Better Data - extract information from external indicators. Better Models - handle non-linearity. Handle Real world noise 9 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Introduction Methods Instability Analysis Ablation Test Conclusion Overall Framework Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data 5LLGB6Historical PL6GBI6per6week Weather6Data P6GB6historical P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week OpenTable6Res6Data PP6MB6historical P766KBI6week Google6Flu6Trends 46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data P4L6MB6hist: b6mbi6week Twitter6Data PTB6hist: OL6GBI6week Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB Twitter6Data666666666:666676GB66 Filtering6for6 Flu6Related6Content Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB Weather6Data6666666:666P56KB Twitter6Data666666666:666PL6KB 6 ILI6Prediction 10 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting References
Data Sources Non-physical indicators 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 2 Custom User Keywords 1 Google Search Trends 2 Healthmap News Feed 3 Twitter Feed 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Data Sources Non-physical indicators 1 Google Flu Trends - uses unpublished set of keywords 2 Custom User Keywords 1 Google Search Trends 2 Healthmap News Feed 3 Twitter Feed Physical indicators Misc. Indicators 1 Opentable reservations 11 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Google Flu Trends 12 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. Time series correlation analysis Used Google Correlate to find words with search history correlated with ILI incidence curve. Interesting words such as ginger and Acemuk found. 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Finding Custom user keyword dictionary A multiple step process : Started with a seed set of keywords from experts. Seed set contains words in Spanish, Portuguese, and English. example : gripe (flu in Spanish) Pseudo-query expansion Crawled top 20 web-sites for each seed word. Crawled expert web-sites e.g. CDC. Crawled few other hand-picked sites. Top 500 frequently occurring words selected. Time series correlation analysis Used Google Correlate to find words with search history correlated with ILI incidence curve. Interesting words such as ginger and Acemuk found. Final filtering : 114 words 13 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Finding Custom user keyword dictionary (contd..) 14 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
GFT vs other non-physical indicators using custom keyword set Google Search Trends (GST) Healthmap Twitter 15 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Physical Indicators Meteorological data for every lat-long, worldwide, every 8 hours Humidity, Temperature, Rainfall Analyzing grid cells covering PAHO sites. 16 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Introduction Methods Instability Analysis Ablation Test Conclusion System framework once again!! Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data 5LLGB6Historical PL6GBI6per6week Weather6Data P6GB6historical P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week OpenTable6Res6Data PP6MB6historical P766KBI6week Google6Flu6Trends 46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data P4L6MB6hist: b6mbi6week Twitter6Data PTB6hist: OL6GBI6week Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB Twitter6Data666666666:666676GB66 Filtering6for6 Flu6Related6Content Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB Weather6Data6666666:666P56KB Twitter6Data666666666:666PL6KB 6 ILI6Prediction 17 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting References
Preliminaries To find predictive modelf Variable Setup f : P t = f (P, X ) V t P t β α, X t β α, P t+1 β α, X t+1 β α,..., P t α, X t α L t P t Parameters α : the lookahead window length β : the lookback window length 18 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization (MF) Can find latent factors in the dataset. 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization (MF) Can find latent factors in the dataset. Model M i,j = b i,j + Ui T b i,j = M + b j F j 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization (MF) Can find latent factors in the dataset. Model M i,j = b i,j + Ui T b i,j = M + b j Fitting F j b, F, U = argmin( m 1 i=1 (M i,n M i,n ) 2 +λ 1 ( n bj 2 + m 1 U i 2 + n F j 2 )) j=1 i=1 j=1 (1) 19 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Nearest Neighbor model (NN) Impose non-linearity. 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Nearest Neighbor model (NN) Impose non-linearity. N (i) = {k : V k is one of the top K nearest neighbors of V i } 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Nearest Neighbor model (NN) Impose non-linearity. N (i) = {k : V k is one of the top K nearest neighbors of V i } Fitting P T = ( k N (T ) θ k L k,t α )/ K θ k (2) k=1 20 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. M i,j = b i,j + Ui T F j +F j N (i) 1 2 k N(i) (M (3) i,k b i,k )x k 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Matrix Factorization using Nearest Neighborhood (MFN) Inspired from Koren et al. s work in Recommender systems. M i,j = b i,j + Ui T F j +F j N (i) 1 2 k N(i) (M (3) i,k b i,k )x k Fitting b, F, U, x = argmin( m 1 i=1 (M i,n M i,n ) 2 +λ 2 ( n bj 2 + m 1 U i 2 + n F j 2 + x k 2 )) j=1 i=1 j=1 k (4) Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of KDD 08. 2008, pp. 426 434 21 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Accuracy comparison Quality Metric A = 4 N p t e t=t s ( ) P t ˆP t 1 max(p t, ˆP t, 10) (5) 22 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Accuracy comparison On average, MFN has better performance over MF and NN In Mexico, MF has the best accuracy - possibly because the 2013 ILI season in Mexico was late by a few weeks than in usual. 23 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Model level fusion Output from models combined based on historical accuracy. 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Model level fusion Output from models combined based on historical accuracy. Model [ C M t = 1 P t... C P t P t ] (6) 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Model level fusion Output from models combined based on historical accuracy. Model [ C M t = 1 P t... C P t P t ] (6) Fitting C M i,j = µ i + C b j + C Ui T + C F j C N (i) 1 2 k C N(i) ( (7) C M i,k µ i + C b k ) C x k C F j 24 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Data level fusion Feature vector is a tuple over all data set features. Use MFN to fit the value X t = T t, W t 25 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Accuracy comparison 26 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Accuracy comparison On average, model level fusion produces better accuracy than data level fusion. Interesting deviations like Chile and El Salvador indicates that a possible strategy could be a mix of data level and model fusion - however complexity of training will increase manifold. 26 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 27 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Uncertainty in official estimates Can take up to several months to stabilize. Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina. 28 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Correcting uncertainty Recognize high, low and mid-season months for countries. Variable setup P A S = Correction Model { (1, P (1) ˆṖ (m) i i, Ṗi, N (1) i = a 0 + a 1 m + a 2 P (m) i } ),..., (m, P (m) i, Ṗi, N (m) i ),... + a 3 N (m) i (8) 29 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Correcting uncertainty Recognize high, low and mid-season months for countries. Variable setup P A S = Correction Model { (1, P (1) ˆṖ (m) i i, Ṗi, N (1) i = a 0 + a 1 m + a 2 P (m) i } ),..., (m, P (m) i, Ṗi, N (m) i ),... + a 3 N (m) i (8) 29 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
1 Introduction Motivation: Data driven epidemiology Data driven Epidemiology: Problems Main Goals 2 Methods Data Sources Custom User Keywords Matrix Factorization using Nearest Neighborhood Model level vs Data level fusion 3 Instability Analysis 4 Ablation Test 5 Conclusion Extending to other sources: Opentable Summary 30 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Investigating importance of each source : Ablation Test 31 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Investigating importance of each source : Ablation Test Greater drop in accuracy = Source more important Physical indicators are in general more important Still there is value in supplementing physical indicators with non-physical indicators. 31 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Final look at real time predictions Weekly predictions sent out for 15 Latin American countries Predictions publicly available at http://embers.cs.vt. edu/embers/alerts/ visualizer_isi 32 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Conclusion: How to extend to other sources Data about number of unreserved tables at restaurants in Mexico 33 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Summary MFN performs better than MF, NN on average over individual sources for predicting ILI case counts. In average there is a small advantage in combining models over different sources than to combine data. Employing information about number of samples used and how far from the actual date the estimate is being updated by the reporting agency, we have been able to improve our overall accuracy by a quality score of 0.05. Generally physical indicators offer more advantage over non-physical indicators. However for some situations Healthmap and Twitter feed have been found to outperform physical indicators. Experiments with Opentable reservation data shows that there is some perceptible signal embedded w.r.t to ILI case counts. 34 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Future Work Reconcile these phenomenological models with true epidemiological models. Explore inter-country characteristics of ILI profiles. 35 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Acknowledgements Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337 and by the Defense Threat Reduction agency (DTRA) via the CNIMS Contract HDTRA1-11-D-0016-0001. The US Government is authorized to reproduce and distribute reprints of this work for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, DTRA, or the US Government. 36 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Thanks! 37 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Thanks! Any questions? 37 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
References Prithwish Chakraborty et al. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. In: Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014. 2014, pp. 262 270. Saurav Ghosh et al. Forecasting Rare Disease Outbreaks with Spatio-temporal Topic Models. In: NIPS 2013 workshop on Topic Models. 2013. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of KDD 08. 2008, pp. 426 434. 38 / 38 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Appendix: Physical Indicators Collection Framework 1 / 2 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting
Appendix: Accuracy of different methods for different countries 2 / 2 Prithwish Chakraborty (prithwi@vt.edu) Data Driven Disease Forecasting