Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates

Size: px

Start display at page:

Download "Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates"

Dana Patterson
5 years ago
Views:

1 Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Narendran Ramakrishnan, Chair Madhav Marathe Chang-Tien Lu Ravi Tandon John S. Brownstein April 28, 2016 Arlington, VA Keywords: Multivariate Time Series, Surrogates, Generalized Linear Models, Bayesian Sequential Analysis, Computational Epidemiology Copyright c 2015, Prithwish Chakraborty

2 Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty (ABSTRACT) Modeling and predicting multivariate time series data has been of prime interest to researchers for many decades. Traditionally, time series prediction models have focused on finding attributes that have consistent correlations with target variable(s). However, diverse surrogate signals, such as News data and Twitter chatter, are increasingly available which can provide real-time information albeit with inconsistent correlations. Intelligent use of such sources can lead to early and real-time warning systems such as Google Flu Trends. Furthermore, the target variables of interest, such as public heath surveillance, can be noisy. Thus models built for such data sources should be flexible as well as adaptable to changing correlation patterns. In this thesis we explore various methods of using surrogates to generate more reliable and timely forecasts for noisy target signals. We primarily investigate three key components of the forecasting problem viz. (i) short-term forecasting where surrogates can be employed in a now-casting framework, (ii) long-term forecasting problem where surrogates acts as forcing parameters to model system dynamics and, (iii) robust drift models that detect and exploit changepoints in surrogate-target relationship to produce robust models. We explore various physical and social surrogate sources to study these sub-problems, primarily to generate real-time forecasts for endemic diseases. On modeling side, we employed matrix factorization and generalized linear models to detect short-term trends and explored various Bayesian sequential analysis methods to model long-term effects. Our research indicates that, in general, a combination of surrogates can lead to more robust models. Interestingly, our findings indicate that under specific scenarios, particular surrogates can decrease overall forecasting accuracy - thus providing an argument towards the use of Good data against Big data. This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC The US Government is authorized to reproduce and distribute reprints of this work for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government.

3 Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty (GENERAL AUDIENCE ABSTRACT) In the context of public health, modeling and early-forecasting of infectious diseases is of prime importance. Such efforts help agencies to devise interventions and implement effective counter-measures. However, disease surveillance is an involved process where agencies estimate the intensity of diseases in the public domain using various networks. The process involves various levels of data cleaning and aggregation and as such the resultant surveillance data is inherently noisy (requiring several revisions to stabilize) and delayed. Thus real-time forecasting about such diseases necessitates stable and robust methods that can provide accurate public health information in time-critical manner. This work focuses on data-driven modeling and forecasting of time series, especially infectious diseases, for a number regions of the world including Latin America and the United States of America. With the increasing popularity of social media, real-time societal information could be extracted from various media such as Twitter and News. This work addresses this critical area where a number of models have been presented to systematically integrate and compare the usefulness of such real-time information from both physical- (such as Temperature) and non-physical-indicators (such as Twitter) towards robust disease forecasting. Specifically, this work focuses on three critical areas: (a) Short-term forecasting of disease case counts to get better estimates of current on ground scenario, (b) long-term forecasting about disease season characteristics to get help public health agencies plan and implement interventions and finally (c) Concept drift detection and adaptation to consider the ever evolving relationship of the societal surrogates and the public health surveillance and lend robustness to the disease forecasting models. This work shows that such indicators could be useful for reliable estimation of disease characteristics - even when the ground-truth itself is unreliable and provide insights as to how such indicators can be integrated as part of public surveillance. This work has used principles from diverse fields spanning Bayesian Statistics, Machine Learning, Information Theory, and Public Health to analyze and characterize such diseases.

4 Acknowledgments I extend my sincere thanks and gratitude to my advisor Dr. Naren Ramakrishnan for his continued encouragement and guidance throughout my work. His feedback, insights and inputs have contributed immensely to the final form of this work. He has been my mentor and my guide. I have always found in him a patient listener who rendered clarity to my thoughts and I have always come out of our discussion with renewed vigor and focus. It has been my utmost privilege to work with him for all these years. I would also like to thank my entire committee. I sincerely thank Dr. Madhav Marathe and Dr. John Brownstein for their unique perspectives on public health without which this work wouldn t have been complete. I have especially enjoyed my meetings with Dr. Madhav Marathe and our collaborations that have helped me to gain a broader understanding about the field of computational epidemiology. I cannot thank Dr. Ravi Tandon enough for his inputs and insights that ultimately materialized in the form of concept drift - a crucial component of this work. Finally, Dr. C.T. Lu have always been welcoming and encouraging, and I thank him for his crucial feedback and inputs about this work. I consider myself fortunate to have received the guidance of such an esteemed and kind group of people. I extend my heartfelt thanks and gratitude to Dr. Bryan Lewis, NDSSL at Virginia Tech for all his encouragement, guidance and countless hours working with me on this work. I have been lucky to have him as my mentor. I would also like to thank Discovery Analytics Center at Virginia Tech which has been myhome-away-from-home for these past few years. I have found mentors like Tozammel Hossain and Patrick Butler who have immensely shaped my early PhD years. All my lab members have been crucial and I will miss my time with all of them. They have been my friend, my colleague and more often than not my support group throughout this process. I wish all of you the best for your future. I have also been fortunate to work with a varied group of collaborators from NDSSL, HealthMap and YeLab as well as public health agencies such as IARPA and CDC which has made my PhD a great experience that I will cherish forever. I would also like to express my gratitude to my wonderful friends - Deba Pratim Saha, Gourab Ghosh Roy, Saurav Ghosh, Sathappan Muthiah, Arijit Chattopadhyay, Sayantan Guha and Abhishek Mukherjee, to name a few - with whom I have shared unique moments throughout this time. Thanks for being around and being there for me whenever I needed iv

5 you all. Thanking my family is perhaps not enough. My mother Mrs. Devyani Chakraborty and my brother Mr. Prasenjit Chakraborty have been my closest friends and confidants. This work as well as me owes everything to you. My late father Mr. Prasanta Kr. Chakraborty would have been happy to see me where I am today. My sister-in-law Mrs. Amrita Dhole Chakraborty and my cousins, I thank you for being the best family I could hope for and being there for me always. v

6 Contents 1 Background and Motivation Flu Surveillance Effects Motivation towards using surrogates I Short-term Forecasting using Surrogates 7 2 Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions Related Work Problem Formulation Methods Ensemble Approaches Data level fusion: Model level fusion: Forecasting a Moving Target Experimental Setup Reference Data Evaluation criteria Surrogate data sources Results Discussion vi

7 3 Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction Summary Model Similarity Forecasting Results Seasonal Analysis Discussion II Long-term Forecasting using Surrogates 36 4 Curve-matching from library of curves 38 5 Data Assimilation methods for long-term forecasting Data Assimilation Data Assimilation Models in disease forecasting Data Assimilation Using surrogate Sources Experimental Results and Performance Summary Discussion III Detecting and Adapting to Concept Drift 52 6 Hierarchical Quickest Change Detection via Surrogates HQCD Hierarchical Quickest Change Detection Quickest Change Detection (QCD) Changepoint detection in Hierarchical Data HQCD for Count Data via Surrogates Hierarchical Model for Count Data vii

8 6.2.2 Changepoint Posterior Estimation Experiments Synthetic Data Real life case study Discussion Concept Drift Adaptation for Google Flu Trends Background Robust Models via Concept Drift Adaptation Experimental evaluation and comparing Surrogate Sources Discussion Conclusion Importance of Open Source Indicators for Public Health Guidelines for using surrogates for Health Surveillance Future Work A Data Assimilation: detailed performance 92 B Sequential Bayesian Inference 112 B.1 SMC 2 algorithm traces B.2 SMC 2 priors C HQCD: Additional Experimental Results 115 viii

9 List of Figures 1.1 Epidemic Pyramid: Depicts the process of how disease exposure in general population goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from The public health officer - Antimicrobial Resistance Learning Site For Veterinary Students, amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer Christmas Effect in USA: Number of people seeking care drops during Christmas holidays. However, number of ILI related visits don t vary from non- Christmas times leading to an inflated percent ILI in general population ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina Average relative error of PAHO count values before and after correction for different countries Accuracy of different methods for each country ix

10 3.1 The distance matrix obtained from our learned DPARX model (bottom figure), associated with the ground truth ILI case count series (top figure) on the AR dataset. We can observe the strong seasonality automatically inferred in the matrix. Each element in the matrix is the Euclidean distance between a pair of the learned models at two corresponding time points after training. For the top figure, the x axis is the index of the weeks; the y axis is the number of ILI cases. For the bottom figure, both x and y axes are the index of the time points. Note that the starting time point (index 0) for the distance matrix is week 15 of the ILI case count series Model distance matrices for US dataset. The three matrices are derived from the fully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal 3-nearest neighbor similarity graph, from left to right correspondingly Comparison of seasonal characteristics for Mexico using different algorithms for one-step ahead prediction. Blue vertical dashed lines indicate the actual start and end of the season. ILI season considered: Filtering library of curves based on season size and season shape Example of seasonal forecasts for ILI using curve-matching methods Performance measures for ILI seasonal characteristics using curve-matching Performance summary for (a) ILI and (b) CHIKV seasonal forecasts using Weather as a surrogate source under data assimilation framework Comparison of forecasting accuracy for Date metrics using surrogates Comparison of forecasting accuracy for Value metrics using surrogates Comparison of forecasting accuracy for Start Date using different surrogate sources Comparison of forecasting accuracy for End Date using different surrogate sources Comparison of forecasting accuracy for Peak Date using different surrogate sources Comparison of forecasting accuracy for Peak Value using different surrogate sources Comparison of forecasting accuracy for Season Value using different surrogate sources x

11 6.1 Illustration of Quickest Change Detection (QCD): blue colored line represents the actual changepoint at time Γ = t 4. (a) declaring a change at γ 1 leads to a false alarm, whereas (b) declaring the change at γ 2 leads to detection delay. QCD can strike a tradeoff between false alarm and detection delay Generative process for HQCD. As an example consider civil unrest protests. In the framework, different protest types (such as Education- and Housingrelated protests) form the targets denoted by S i s. The total number of protests will be denoted by the top-most variable E. Finally, the set of surrogates, such as counts of Twitter keywords, stock price data, weather data, network usage data etc. are denoted by K j s Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) target source (Number of protests of different categories), for various temporal windows, under i.i.d. assumptions. These assumptions lead to satisfactory distribution fit, at a batch level, for both sources. The top-most row corresponds to the period before the Brazilian spring (pre ), the second row is for the period to , and the third is for the period after The last row shows the fit for the entire period. These temporal fits are indicative of significant changes in distribution along the Brazilian Spring timeline, for both target and surrogates Computation time for one complete run of changepoint detection (in mins) on a 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD without surrogates. Gibbs sampling computation times are unsuitable for online detection Comparison of HQCD against state-of-the-art on simulated target sources. X- axis represents time and Y-axis represents actual value. Solid blue lines refer to the true changepoint, solid green refers to the ones detected by HQCD and brown refers to HQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detection for most targets with low overall detection delay and false alarms False Alarm vs Delay trade-off for different methods. HQCD shows the best trade-off Comparison of detected changepoints at the sum-of-targets (all Protests). HQCD detections are shown in solid green while those from the state-ofthe-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines. HQCD detection is the closest to the traditional start date of Mass Protests in the three countries studied.. 70 xi

12 6.8 (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a); and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepoint influence. (a) shows presence of strong off-diagonal elements indicating strong cross-target changepoint information. (b) shows a mixture of uninformative and informative surrogates Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), the corresponding 52-week rolling mean (right) exhibits a saddle point in early indicates a possible mean shift drift in GFT for Argentina Concept Drift Adaption Framework. Framework ingest target sources such as CDC ILI case count data and surrogate sources such as GFT and detects changepoints via Concept Drift Detector stage. Drift probabilities are next passed onto Drift Adaptation stage where robust predictions are generated using resampling based methods Drift Adaptation for Mexico using GFT Drift Adaptation for Mexico using GST Drift Adaptation for Mexico using HealthMap Drift Adaptation for Mexico using weather sources Drift Adaptation for Mexico using All sources Correlation of surrogate sources with disease incidence. Count of influenza related keywords from (a) HealthMap and (b) GST compared against influenza case counts for Argentina as available from PAHO. HealthMap keywords capture the start of the season more accurately, while GST keywords exhibit a sub-optimal but consistent correlation with PAHO counts C.1 Comparison of detected changepoints at the target sources (Protest types) HQCD detections are shown in solid green while those from the state-ofthe-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines xii

13 List of Tables 2.1 Comparing forecasting accuracy of models using individual sources. Scores in this and other tables are normalized to [0,4] so that 4 is the most accurate Comparison of prediction accuracy while combining all data sources and using MFN regression Comparison of prediction accuracy while using model level fusion on MFN regressors and employing PAHO stabilization Discovering importance of sources in Model level fusion on MFN regressors by ablating one source at a time ILI case count prediction accuracy for Mexico using OpenTable data as a single source, and by combining it with all other sources using model level fusion on uncorrected ILI case count data Prediction accuracies for competing algorithms with different forecast steps over different countries using the GFT input source. GFT data is not available for other countries Prediction accuracies for competing algorithms with different forecast steps over different countries using the weather data source Prediction accuracies for competing algorithms with different forecast steps over different countries using the GST data source Prediction accuracies for competing algorithms with different forecast steps over different countries using the HealthMap data source Forecasting performance of seasonal characteristics using data assimilation methods Comparison of state-of-the-art methods vs Hierarchical Quickest Change Detection xiii

14 6.2 (Synthetic data) comparing true changepoint (Γ) for targets against detected changepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additive detection delay (ADD). Each row represent a target and best detected changepoint is shown in bold whereas false alarms are shown in red Comparison of surrogate sources pre- and post-drift adaptation A.1 Performance of Data assimilation methods using different surrogate sources w.r.t. seasonal characteristics C.1 (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect to detected changepoints xiv

15 Chapter 1 Background and Motivation The problem of multivariate time series forecasting has been studied extensively for several decades and have found use in diverse fields such as Economics and Statistics [9]. Some of the more popular methods that have been used in this sphere are Autoregressive (AR) models, Autoregressive Moving Average models (ARMA) and Vector Autoregressive Models (VAR) for linear problems. For nonlinear problems, some of the more popular methods have been Kernel Regression and Gaussian Process. However, the traditional approaches have focused on admitting only coherent time series and/or admitting independent time series which exhibits consistent causal relation with the target of interest. In recent years, big data in the form of diverse real-time sources such as social media and news has been readily available. These data sources are in general noisy, and relationships with any target sources can change over time such as, search patterns of users. However, if used intelligently, such sources can aid in accurately modeling complex target sources such as the number of influenza case counts for a country, in near real-time. This thesis focuses on such noisy surrogates. We explore the problem of flu forecasting in Section 1.1 to identify the key advantages in using surrogates and motivate our methods in Section Flu Surveillance Effects Accurate and timely influenza (flu) forecasting has gained significant traction in recent times. If done well, such forecasting can aid in deploying effective public health measures. Unlike other statistical or machine learning problems, however, flu forecasting brings unique challenges and considerations stemming from the nature of the surveillance apparatus and the end-utility of forecasts. However flu surveillance is an inherently complex process and identifying the quirks of this process can lead to a better understanding of the possible problems facing a forecasting model. 1

2 Final reports to Health Agencies Surveillance estimations Specimen obtained Person seeks care Person becomes ill Exposures in general population Figure 1.

16 2 Final reports to Health Agencies Surveillance estimations Specimen obtained Person seeks care Person becomes ill Exposures in general population Figure 1.1: Epidemic Pyramid: Depicts the process of how disease exposure in general population goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from The public health officer - Antimicrobial Resistance Learning Site For Veterinary Students, meet-the-public-health-officer

17 3 Figure 1.2: Christmas Effect in USA: Number of people seeking care drops during Christmas holidays. However, number of ILI related visits don t vary from non-christmas times leading to an inflated percent ILI in general population. Influenza-like Illnesses (ILI), tracked by many agencies such as CDC, PAHO, and WHO [10, 44, 64], is a category designed to capture severe respiratory disease, like influenza (flu), but also includes many other less severe respiratory illness due to their similar presentation. Surveillance methods often vary between agencies. Even for a single agency, there may be different networks (such as outpatient based and lab sample based) tracking ILI/Flu. While outpatient reporting networks such as ILINet aim to measure exact case counts for the regions under consideration, lab surveillance networks such as WHO NREVSS (used by PAHO) seek to confirm and identify the specific strain. In the absence of a clinic based surveillance system, lab-based systems can provide estimates at per X population level; however making an estimate of actual influenza flu cases from these systems is challenging [10]. Furthermore, surveillance reports are often non-representative of actual ILI incidence. Figure 1.1 shows a representative epidemic pyramid which depicts the surveillance system. The entire process is inherently associated with possible reporting errors starting from patients seeking care to final determination of confirmed case through laboratory tests. Surveillance networks are also affected by cultural phenomenon such as holiday periods where behavior of people visiting hospitals changes from other weeks. Figure 1.2 depicts the Christmas effect observed during the holidays when people seek care from physicians only in emergency situations leading to inflated ILI percentages. Such effects may render the surveillance reports non-representative of on-ground scenarios. In addition to these effects, surveillance systems are also affected by other systematic artifacts. Surveillance reporting has been known to taper off or stop altogether during the post-peak part of the season. For example, as is evident from Figure 1.3, the number of providers who reported to US CDC ILINet surveillance tapers off towards the end of the ILI season (for US, calendar week 40 corresponds to first ILI season week [10]). Specifically, the inflection

18 Figure 1.3: ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate. 4

5 Figure 1.4: ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries.

19 5 Figure 1.4: ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other. point of the average curve occurs at season week 33. Such effects can possibly be attributed to resource re-allocation due to reduced interest in post-peak activities. A combination of such effects ultimately causes surveillance data to be delayed from real-time. Even when the reports are published, the reports can be candidates for revision/updating for several weeks after initial publication. The lag between initial publication and final revision can be as small as 2 weeks (e.g., for CDC ILINet data) or can wildly fluctuate. For example, PAHO reports for some Latin American countries such as Argentina, Colombia and Mexico can take more than 10 weeks to settle. On the other hand, PAHO reports stabilize within 5 weeks for countries such as Chile, Costa Rica and Peru (see Figure 1.4). The reason for such discrepancies has to do with the maturity of the surveillance apparatus and the level of coordination underlying public health reporting.

20 6 1.2 Motivation towards using surrogates The flu surveillance effects described above can be thought of as a representative scenario for a large class of problems dealing with real-time surveillance where on-ground scenario is difficult to ascertain. Most work on forecasting do not account for such instability. In essence, these problems requires forecasting a moving target. Real-time surrogates as outlined above can be useful in such scenarios to augment the surveillance mechanism with information from general population. Thus motivating the problem of flu forecasting, this thesis outlines three key problems as follows: Short-term forecasts using surrogates to augment delayed surveillance reports and provide real-time information of on-ground scenarios. Long-term forecasts using surrogates as forcing parameters to determine long-term characteristics with increased accuracy. Identifying and adapting to Concept Drift to detect changing relationships of surrogates and increase robustness of short- and long-term forecasts using such surrogates.

21 Part I Short-term Forecasting using Surrogates 7

22 8 The first problem of this thesis is aimed at short-term forecasting of often delayed and unstable target sources such as Influenza-Like-Illness (ILI) case counts as reported by surveillance agencies such as CDC [10] and PAHO [44]. We compared a range of surrogates encompassing physical sources such as humidity and temperature, and social sources such as Twitter and News in [12] under a Matrix Factorization framework for ILI prediction in 15 Latin American countries. We found that no single source is best suited to model ILI for all countries. However, physical sources were in general the most informative sources. Furthermore, combining the sources led to better forecasting accuracy in general. We present these considerations in Chapter 2. We next focused on increasing the forecasting horizon and used Regularized Generalized Linear Models to capture dynamic trends of ILI data in [62]. Our experiments indicate that we can reliably forecast up to 4 weeks in advance, for a range of countries including USA and several Latin American countries, using our proposed methods. We highlight the important aspects of our findings from the problem in Chapter 3.

23 Chapter 2 Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions Traditionally, epidemiological forecasts of common illnesses, such as the flu, rely heavily on surveillance reports published by health organizations. However, as discussed in Chapter 1, traditional surveillance reports are often published with a considerable delay and thus recent research has focused on mining social signals from search engine query volume [67, 24] and social media chatter [27, 34, 39, 15, 56]. One of the pioneering work in this space is the work of Ginsberg et al. [24] where ILI case counts are predicted from the volume of search engine queries. This work inspired significant follow-on work, e.g., [67], where Yuan et al. used search query data from Baidu (a popular search engine in China) to detect influenza outbreaks. More real-time ILI detection [34] systems have been proposed by modeling Twitter streams. Apart from such social media sources, there has also been considerable research on exploiting physical indicators such as climate data. The primary advantage of such data sources is that the effects are much more causal and less noisy. Shaman et al. [57, 49, 51] explored this area in detail and found absolute humidity to be a good indicator of influenza outbreaks. While the aforementioned efforts have made important strides, there are important areas that have been relatively less studied. First, only few efforts have focused on combining multiple data sources [29, 27] to aid in forecasting. In particular, to the best of our knowledge there has been no work that investigates the combination of social indicators and physical indicators to forecast ILI incidence. Second, and more importantly, official estimates as reported by health organizations (e.g., WHO, PAHO) are often lagged by several weeks and even when reported are typically revised for several weeks before the case counts are finalized. Real-time prediction systems must be designed to handle the forecasting of such a moving 9

24 10 target. Finally, most existing work have been retrospective and not set in the context of a formal data mining validation framework. To overcome these deficiencies, we propose a novel approach to ILI case count forecasting. Our contributions are: Our approach integrates both social indicators and physical indicators and thus leverages the selective superiorities of both types of feature sets. We systematize such integration using a novel matrix factorization-based regression approach using neighborhood embedding, thus helping account for non-linear relationships between the surrogates and the official ILI estimates. We investigate the efficacy of combining diverse different sources at two levels: data fusion level, and model level, and discuss the relative (de)merits. We propose different ways of handling uncertainties in the official estimates and factor these uncertainties into our prediction models. Finally, we present a detailed and prospective analysis of our proposed methods by comparing predictions from a near-horizon real time prediction system to official estimates of ILI case counts in 15 countries of Latin America. 2.1 Related Work Related work naturally falls into the categories of social media analytics, physical indicators, and event dynamics modeling. These are next described as follows: Social media analytics: Most relevant work using social media analytics focuses on Twitter, specifically by tracking a dictionary of ILI-related keywords in the data stream. Such investigations have often focused on the importance of diversity in keyword lists, e.g., [39, 15]. In [39], Kanhabua and Nejdl used clustering methods to determine important topics in Twitter data, constructed time series for matched keywords, and used Jaccards coefficient to characterize the temporal diversity of tweets. They noted, that such temporal diversity may be correlated with real-world ILI outbreaks. In [15] the authors studied the dynamics between the change in circulated tweets and the H1N1 virus. Inspired by these work, we curated a custom ILI related keyword dictionary which is described in details in Section Physical indicators for detecting ILI incidence levels: Tamerius et al. [57] investigated the existence of seasonal cycles of influenza epidemics in different climate regions. For the said work, they considered climatic information from 78 globally distributed sites. Using logistic regression they found that, strong correlations exist between influenza epidemics and weather conditions, especially when conditions are cold-dry or humid-rainy. Similarly, exciting results were reported by Shaman et al. in [49, 51] where they discovered absolute humidity to be a key indicator of flu. To uncover these relationships they used non-linear

11 Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data

P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week

46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data

Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB

Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB

this chapter to forecast ILI case counts.

in finding a uniform model for the varied data sources as explained in

[27] proposed an event-based approach for early prediction of ILI

25 11 Healthmap6Data 7P6MB6Historical P:56MBI6per6week Twitter6Data 5LLGB6Historical PL6GBI6per6week Weather6Data P6GB6historical P86MBI6week Google6Trends 6LL6MB6historical 86MBI6week OpenTable6Res6Data PP6MB6historical P766KBI6week Google6Flu6Trends 46MB6historical PLL6KBI6week Data6Enrichment Healthmap6 Data P4L6MB6hist: b6mbi6week Twitter6Data PTB6hist: OL6GBI6week Healthmap6Data66:666POMB6 Weather6Data666666:66665L6MB Twitter6Data :666676GB66 Filtering6for6 Flu6Related6Content Time6series6Surrogate Extraction Healthmap6Data666:666PL6KB Weather6Data :666P56KB Twitter6Data :666PL6KB 6 ILI6Prediction Figure 2.1: Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts. regressors such as Kalman filters, and this was a key inspiration for us in finding a uniform model for the varied data sources as explained in Section Event dynamics modeling: Denecke et al. [27] proposed an event-based approach for early prediction of ILI threats [27]. Their method (M-Eco) considers multiple resources such as Twitter, TV reports, online news articles, and blogs and uses clustering to identify signals for event detection. Network dynamic solutions have also been used [3] to study the behavior of an epidemic in a society.

26 Problem Formulation In this section, we formally introduce the problem. Let P = P 1, P 2,..., P T denote the known total weekly ILI case count for the country under consideration, where P t denotes the case count for time point t and T denotes the time point till which the ILI case count is known. Corresponding to the ILI case count data, let us denote the available surrogate information for the same country by X = X 1, X 2,..., X T 1, where T 1 is the time point till which the surrogate information is available and X t denotes the surrogate attributes for time point t. The problem we desire to solve is to find a predictive model (f) for the case count data, as presented formally in equation 2.1. f : P t = f (P, X ) (2.1) In this chapter, in order to better understand the importance of different sources, we assume that the ILI activities in different countries are independent of each other Methods Focusing on the methods, we employ non-linear temporal regressions over the surrogate attributes to forecast the case count using three models: (a) Matrix Factorization Based Regression (MF), (b) Nearest Neighbor Based Regression (NN), and (c) Matrix Factorization Regression using Nearest Neighbor embedding (MFN). For each of the methods, we define two parameters: β and α. α is the lookahead window length, denoting distance of the time point for prediction from T ; β is the lookback window length denoting the number of time points to look back in order to find the regression relation between the case count and the surrogate data. We define regression vectors V t and labels L t, t = 1,..., T as below:. V t P t β α, X t β α, P t+1 β α, X t+1 β α,..., P t α, X t α L t P t The regression vector for predicting the case count at time point T (T + α > T > T ) is given by equation 2.2. V T P T β α, X T β α, P t+1 β α, X t+1 β α,..., P T α, X T α Under these definitions we describe the models as follows: (2.2) Matrix Factorization Based Regression (MF): Matrix Factorization is a well accepted technique in the recommender systems literature to predict user preferences from incomplete user ratings/information. Typically [7] a user-

27 13 preference matrix is factored into an user-factor and factor-preference matrix. However, such factorizations are incognizant of any temporal continuity. As such to enforce temporal continuity, to predict for the time point T (T + α > T > T ) we use the regression vectors and labels as defined earlier, to define a m n prediction matrix M, as given in equation 2.3: V α+β+1 L α+β+1 M =.. V T L T V T L T (2.3) The prediction matrix is factorized into a f m factor-feature matrix U and a f n factorprediction matrix as: M i,j = b i,j + U T i F j Here, b i,j is the baseline estimate given by: b i,j = M + b j (2.4) where M represents the all-element average and b j represents the column wise deviations from the average and is generally a free-parameter, i.e., it is fitted as part of the optimization problem. U and F matrix are estimated by minimizing the error function: b, F, U = argmin( m 1 +λ 1 n j=1 b 2 j + m 1 i=1 i=1 (M i,n M i,n ) 2 U i 2 + n F j 2 )) j=1 (2.5) where λ 1 is a regularization parameter. An important design criteria in the error function of equation 2.5 is the fact that we only compute the error between the predicted label values and the actual label values i.e., the n th column of the prediction matrix M. The rationale behind this choice is the fact that unlike traditional recommender systems we are only concerned with the label column and can sacrifice reconstruction accuracies for other columns. The lookback window β, the factor size f and the regularization parameter λ 1 are estimated using cross-validation and the final prediction for time point T is given by: P T = b m,n + U T mf n Nearest Neighbor Based Regression (NN): For our second class of models, viz. nearest neighbor models, we define a training set Γ NN = {V t, L t }, where V t represents the regression attributes and L t denote the corresponding labels. Also, let us define the set N (i) = {k : V k is one of the top K nearest neighbors of V i } where

28 14 K indicates the maximum number of nearest neighbors considered. The predicted count P T for the time point T is given as: ( ) P T = θ k L k,t α / K θ k (2.6) k N (T ) k=1 Here θ k indicates the weight assigned to the k th nearest neighbor. Euclidean distances to V T are chosen as the weights. Typically the inverse Matrix Factorization Based Regression using Nearest Neighbor Embedding (MFN): It has been shown in [28] that matrix factorization using nearest neighbor constraints can outperform classical matrix factorization approach as well as traditional nearest neighbor approaches towards recommender systems. Drawing inspirations from the result, we modify the method to suit the temporal nature of our problem in similar ways as described in section We again define a similar prediction matrix M (see equation 2.3). Following [28], we define the matrix decomposition rule as M i,j = b i,j + Ui T F j +F j N (i) 1 2 k N(i) (M (2.7) i,k b i,k )x k The key difference between equation 2.7 and the one proposed in [28] is that we don t have any term for implicit feedback and, further, only the top K neighbors as found through Euclidean distance are used. The model is fitted using equation 2.8 as given below: b, F, U, x = argmin( m 1 i=1 (M i,n M i,n ) 2 +λ 2 ( n b 2 j + m 1 U i 2 + n F j 2 + x k 2 )) j=1 i=1 j=1 k (2.8) 2.3 Ensemble Approaches In the last section, we described different strategies to correlate a specific source with the ILI case count of a specific country and predict future ILI counts. In practice, we desire to work with a multitude of data sources and there are two broad ways to accomplish this objective: (a) data level fusion, where a single regressor is constructed from different data sources to the ILI case count, and (b) model level fusion, where we build one regressor for each data source and subsequently combine the predictions from the models. In this section, we describe these fusion methods. Experimental results with both methods are presented in Section 2.6.

29 Data level fusion: Here we express the feature vector X, as a tuple over all the different data sources and then proceed with any one of the regression methods as outlined in Section For example, while combining Twitter and weather data sources (see Figure 2.1), the feature vector X is given by: X t = T t, W t where T t and W t denote attributes derived from Twitter and weather, respectively Model level fusion: In this approach, the models are combined using matrix factorization regression with nearest neighbor embedding by comparing the prediction estimates from each model with the actual estimate (since the ground truth can change as well) and the average ILI case count for the month for the particular country (to help organize a baseline). Let us denote the average ILI case count for a particular calendar month I for a given country by: µ I = t I P t / {t I} Considering C different sources and hence C different models, let us denote the prediction for the t T h time point from the c T h model by c Pt. Using these definitions we can now proceed to describe the fusion model. Essentially, the model is similar to the one described in Section 2.2.1, where the differences can be found in the way we construct the feature vectors. Similar to equation 2.3, we construct a prediction m n matrix for fusion given by C M where the t T h row is represented by equation 2.9. [ CM t = P 1 t... P ] C t P t (2.9) Then similar to equation 2.7, we factor this matrix into latent factors, C U, C F, C b as given by equation 2.10: M C i,j = µ i + C b j + C Ui T CF j + C F j C N (i) 1 2 k C N(i) ( (2.10) CM i,k µ i + C b k ) C Z k so that the final prediction for the T Th data point is given by P T = C MT, n. The fitting function is given by equation 2.11: Cb, C F, C U, C x = argon( m 1 +λ 3 ( n j=1 Cb 2 j + m 1 i=1 i=1 ( ) 2 CM i,n C Mi,n C U i 2 + n C F j 2 + k Cx k 2 )) j=1 (2.11)

30 16 As before the free parameters are estimated through cross-validation. 2.4 Forecasting a Moving Target One of the key challenges in creating a prospective ILI case count predictor is the fact that the official estimates are often delayed and, furthermore, even when published the estimates are revised over a number of weeks before these become finally stable. For this chapter, we concentrate on 15 Latin American countries as described in Section 2.5 and consider the official ILI estimates from the Pan American Health Organization (PAHO).Thus we can categorize PAHO count values downloaded on any week into three different types: (a) the unknown PAHO counts represented by P t, (b) the known and stable PAHO counts denoted by P t, and (c) the known and unstable PAHO counts denoted by P t. While we desire to predict P t, the uncertainty associated with P t introduces errors in the predictions. In this section, we study the effects of such unstable data and propose three different models to adjust these unstable values to more accurate ones. Figure 2.2a plots the relative error of an unstable PAHO data series w.r.t. its final estimate, as a function of time. It can be seen that different countries have different stability characteristics: for some countries, PAHO count values are stabilized very slowly whereas for others they stabilize faster (esp as the number of updates for a week increases). Stability behavior of PAHO count values were also found to be dependent on the time of the year as shown in Figure 2.2b. To plot this curve for Argentina, we categorized any week with less than 100 cases to belong to a low season, greater than 300 to be a high season, and the remaining values to be mid season (the thresholds were different for different countries). At the same time, the PAHO official updates provide an indication of the number of samples used to generate the case count estimate. Preliminary experiments show that this size is correlated with the accuracy of ILI case counts. In other words, in general, larger values of statistical population size results in smaller relative errors for ILI case count. Thus using both the number of samples and the lag in uploading the week data, we can use machine learning techniques to revise the officially published PAHO estimates. Preliminary results show that for different seasons and different countries, we encounter different stability patterns. Therefore, any PAHO count adjustment method should be customized for seasons and countries separately. Let us assume that Ṗ is the set of stable PAHO counts for a specific country. Also, assume that the sequence of updates for each stable PAHO count value is available. In other words, for P i we have the following set: { } Ṗ i = P (1) i, P (2) i,..., P (m) i,... (2.12) where P (m) i is the value of P i after m weeks of update.

31 17 (a) (b) Figure 2.2: Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina.

32 18 After recognizing high, low, and mid-season months for the country, we can categorize each P i to belong to one of these categories. Then, for category S, an adjustment dataset is constructed named as P S A which is defined as follows: { P S A = (1, P (1) i, P i, N (1) i ),..., (m, P (m) i, P } i, N (m) i ),... (2.13) Each member of P S A is a tuple with four entries: the first entry denotes the time slot that the sample belongs to; the second entry is the actual unstable value of P i ; the third entry is the related stable value; and finally, N (m) i is the size of the statistical population for that week. In the next step, a linear regression algorithm is used to adjust unstable PAHO values. In order to adjust value of the PAHO values in the mth time slot of season S, we use P A S set to learn a 0, a 1, a 2, and a 3 coefficients in the following equation: P ˆ (m) i = a 0 + a 1 m + a 2 P (m) i + a 3 N (m) i (2.14) where P ˆ (m) i is the adjusted PAHO count value for the mth time slot. Experimental results show that this adjustment method results in more accurate known PAHO values. Average relative errors of the published unstable PAHO values before and after correction for each country are shown in Figure 2.3. While in a few cases, we do not experience any improvement, in countries such as Argentina and Paraguay, we experience significant improvements. Finally, similar to equation 2.14, in addition to P (m) i, one can use only time difference (m) or size of population (N (m) i ) to correct unstable PAHO values. Effect of these corrections on overall accuracy of predictions are explored in Section Experimental Setup Reference Data. In this chapter, we focus on 15 Latin American countries viz. Argentina, Bolivia, Cos ta RCA, Colombia, Chile, Ecuador, El Salvador, Guatemala, French Guiana, Honduras, Mexico, Nicaragua, Paraguay, Panama and Peru. We collected weekly ILI counts from the official Pan American Health Organization (PAHO) website( viz/ed_flu.asp), every day from January 2013 to August The estimates downloaded every day for each country contain data from January 2010 to the latest available week on the day of collection. This dataset is stored in a database we refer to as the Temporal Data Repository (TDR). The TDR is also timestamped so that for any given day, we can readily

33 19 Figure 2.3: Average relative error of PAHO count values before and after correction for different countries. retrieve the ILI case counts that were download on that day. This is important as historic data may be updated by PAHO even a number of weeks after the first update. For the purpose of experimental validation we used the data for the period Jan 2010 to December 2012 as the static training set. We considered Wednesdays of the weeks as a reference day within a week. For each Wednesday from Jan 2013 to July 2013, we used the latest available PAHO data in TDR for that day and predicted 2 weeks from the last available week for which the PAHO data was available. These predictions are next evaluated against the final ILI case count as downloaded on September 1, 2013 and we report the performance of our algorithms in Section Evaluation criteria. We evaluate the prediction accuracy of the different algorithms using a modified version of percentage relative error: ) A = 4 t e P t ˆPt (1 N p t=t s max(p t, ˆP (2.15) t, 10) where t s and t e indicate the starting and the ending time point for which predictions were generated. N p indicates the number of time points over the same time period (i.e. N p = t e t s + 1). Note that the measure is scaled to have values in [0, 4] and the denominator is designed to not over-penalize small deviations from the true ILI case count (e.g., when the

34 20 true case count is 0 and the predicted count is 1). It is to be noted that the accuracy metric so defined is non-convex and is in general multi-modal Surrogate data sources. Before describing our data sources in detail, we describe our overall methodology for organizing a flu-related dictionary (for tracking in multiple media such as news, tweets, and search queries). Dictionary creation. The keywords relating to ILI were organized from a seed set of words and expanded using a combination of time series correlation analysis and pseudo-query expansion. The seed set of keywords (e.g., gripe) was constructed in Spanish, Portuguese, and English using feedback from our in-house subject matter experts. Pseudo-query expansion. Using the seed set, we crawled the top 20 web sites (according to Google Search) associated with each word in this set. We also crawled some expert sites such as the official CDC website and equivalent websites of the countries under consideration, detailing the causes, symptoms and treatment for influenza. Additionally we crawled a few hand-picked websites such as and channel/flu_treatments. We filtered the words from these sites using standard language processing filtering techniques such as stopword removal and Porter stemming. The filtered set of keywords were then ranked according to the absolute frequency of occurrence. The top 500 words for Spanish and English were then selected. For example, words such as enfermedad and pandemia were obtained from this step. Time series correlation analysis. Next we used Google Correlate (now a part of Google Trends) to identify keywords most correlated with the ILI case count time series for each country. Once again these words were found to be a mix of both English and Spanish. As an added step in this process, we also compared time-shifted ILI counts: left-shifted to capture the words searched leading up to the actual flu infection and right-shifted to capture the words commonly searched during the tail of the infection. This entire exercise provided us some interesting terms like ginger which has been used as a natural herbal remedy in the eastern world. We also found popular flu medications such as Acemuk and Oseltamivir, which are also sold under the trade name of Tamiflu as highly correlated search queries, especially particularly for Argentina.

35 21 Final filtering. The set of terms obtained from query expansion and correlation analysis were then pruned by hand to obtain a vocabulary of 151 words. We then performed a final correlation check and retained a final set of 114 words. Google Flu Trends (F): Google Flu Trends (GFT is a tool based on [24] and provided by Google.org which gives weekly and up-to-date ILI case count estimates using search query volumes. Of the countries under consideration, GFT provides weekly estimates for only 6 of them viz. Argentina, Bolivia, Chile, Mexico, Peru and Paraguay. These estimates are typically at a different scale than the ILI case counts provided by PAHO and therefore need to be scaled accordingly. We collected this data weekly on Monday from Jan 2013 to Aug (The data downloaded on a particular day contains the entire time series from 2004 to the corresponding week.) Google Search Trends (S): Google Search Trends( is another tool provided by Google. Using this tool we can download an estimate of search query volume as a percentage over its own temporal history, filtered geographically. We download the search query volume time series for the 114 keywords described earlier and convert the percentage measures to absolute values using a static dataset we downloaded on Oct 2012 when Google Search Trends used to provide absolute query volumes. Twitter (T ): Twitter data was collected from Datasift.com and geotagged using an in-house geocoder. We lemmatized the tweet contents and used language detection and POS tagging to help differentiate relevant from irrelevant uses of our keywords (e.g., the Spanish word gripe, meaning flu, is part of our flu keyword list as opposed to the undesired and unrelated English word gripe ). The resulting analysis yields a weekly occurrence count of our dictionary in tweets. HealthMap (H): Similar to Twitter, we also collect flu-related news stories using HealthMap( org), an online global disease alert system capturing outbreak data from over 50,000 electronic sources. Using this service we receive flu-related news as a daily feed which is similarly enriched and filtered to obtain a multivariate time series over lemmatized version of the keywords. While Twitter is more suitable to ascertain general public response, the HealthMap

36 22 data provides more detailed information but may capture the trends at a slower rate. Thus each of these sources offers utility in capturing different surrogate signals: Twitter offers leading but noisy indicators whereas HealthMap provides a slightly delayed but more reliable indicator. OpenTable (O): We also use data on trends of restaurant table reservations, initially studied in [41] to be a potential early indicator for outbreak surveillance, as another surrogate for ILI detection. This novel data stream is based on the postulate that a higher than average number of restaurants with table availability in a region can serve as an indicator of an event of interest, such as increase in flu cases. Table availability was monitored using OpenTable http: // an online restaurant reservation site with 28,000 restaurants at the time of this writing. Daily searches were performed starting from September 2012 for a table for two persons at lunch and dinner; between 12:30-3pm, and between 6-10:30pm. Data was collected for Mexico by city (Cancun, Mexico City, Puebla, Monterrey, and Guadalajara) and for the entire country. The daily proportion (proportion used due to changes in the number of restaurants in the system) of restaurants with available tables was aggregated as a weekly time series. Weather (W): All of the previously described data sources can be termed as non-physical indicators which can work suitably as indirect indicators about the state of the population with respect to flu by exposing different population characteristics. On the other hand, meteorological data can be considered a more direct and physical driver of influenza transmission [65]. It has been shown in [49, 51, 57] that absolute humidity can be directly used to predict the onset of influenza epidemics. Here, we collect several other meteorological indicators such as temperature and rainfall in addition to humidity from the Global Data Assimilation System (GDAS). We accessed this data in GRIB format from at a resolution of 1 degrees lat/long interval. However, looking at all the lat/long for a country can often lead to noisy data. As such we filtered the downloaded data and used the indicators only around the surveillance centers. We also aggregate this data using weekly averages and thus obtain a resultant time series for each country. We collected this data weekly from Jan 2013 to August 2013.

23 2.6 Results In this section, we present an exhaustive set of experiments evaluating our algorithms over 6 months of predictions from Jan 2013 to August 2013.

37 Results In this section, we present an exhaustive set of experiments evaluating our algorithms over 6 months of predictions from Jan 2013 to August The final and stable estimates of ILI case counts are considered to be the estimates downloaded from PAHO on Oct 1, All models considered here were used to forecast 2 weeks beyond the latest available PAHO ILI estimates. Key findings are presented in Table We analyze some important observations from this table next. Figure 2.4: Accuracy of different methods for each country. Can we beat Google Flu Trends with our custom dictionary? The key difference between Google Flu Trends (which can be considered as a base rate) and Google Search Trends is that the former uses a closed dictionary whereas we constructed the dictionary to use with GST. As can be seen Table 2.1, for majority of the common countries (countries for which data from both GST and GFT is present), regressors running on GST consistently outperform those running on GFT (with Mexico and Peru being the exception). Thus we posit that the GST model devised here is a sufficiently close approximation to GFT, with the added advantages of having access to raw level data and being available for more countries than GFT (among the 15 countries we consider, only 6 of them are present in the GFT database). Which is the optimal regression model? From Table 2.1, we can also analyze the three different regressors proposed in Section with respect to overall accuracy. With respect to each individual source, we can see that matrix factorization with nearest neighbor embedding (MFN) performs the best in average over the countries. For some countries such

38 24 Table 2.1: Comparing forecasting accuracy of models using individual sources. Scores in this and other tables are normalized to [0,4] so that 4 is the most accurate. Model Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All MF NN MFN W H T F N/A N/A N/A N/A N/A N/A 2.71 N/A N/A N/A 2.33 S W H T F N/A N/A N/A N/A N/A N/A 2.19 N/A N/A N/A 2.26 S W H T F N/A N/A N/A N/A N/A N/A 2.69 N/A N/A N/A 2.46 S Table 2.2: Comparison of prediction accuracy while combining all data sources and using MFN regression. Fusion Level AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All Model Data Table 2.3: Comparison of prediction accuracy while using model level fusion on MFN regressors and employing PAHO stabilization. Correction Method AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All None Weeks Ahead Numb samples Combined Table 2.4: Discovering importance of sources in Model level fusion on MFN regressors by ablating one source at a time. Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All All w/o W w/o H w/o T w/o S w/o F as Panama, when using only GST, MFN performs poorer than vanilla MF; nevertheless the average accuracy over all countries for any given data source is best when using MFN.

39 25 Table 2.5: ILI case count prediction accuracy for Mexico using OpenTable data as a single source, and by combining it with all other sources using model level fusion on uncorrected ILI case count data. Method Lunch Dinner Lunch & Dinner MF NN MFN Model Fusion Which is the best strategy to combine multiple data sources? As shown in Table 2.2, in overall, model level fusion works better than data level fusion. For 8 of the 15 countries, model level fusion works appreciably better than data level fusion, while the reverse trend is seen for 4 other countries. This showcases the importance of considering both kinds of fusion depending on the country of interest. How effective are we at forecasting a moving PAHO target? As shown in Table 2.3, our corrected estimates using both the number of samples and the weeks ahead from the upload date are generally better. It is instructive to note that our correction strategy is able to increase the overall accuracy only by a score of approximately 0.05 over all the countries, for some countries such as Mexico and Argentina (for which the data update is typically noisy) we obtain a substantial improvement of scores. This suggests that the correction strategy may be selectively applied when forecasting for certain countries. How do physical vs social indicators fare against each other? From Table 2.1, we see that the data source with the best single accuracy happens to be the physical indicator source, i.e., weather data. However, Table 2.4 conveys a mixed story. Here we conduct an ablation test, wherein we remove one data source at a time from our model level MFN fusion framework and contrast accuracies. While removing the weather data degrades the accuracy score the most, removing the social indicators also degrades the score to varying degrees. Thus we posit that it is important to consider both the physical and social indicators to get a refined signal about the prevalent ILI incidence in the population. How relevant is restaurant reservation data to forecasting ILI? All the results thus far do not consider the OpenTable reservation data, since this source is available only for Mexico (among the countries studied here). We considered table availability for different time ranges and compared performance using our MFN model. As Table 2.5 demonstrates, we obtain the best performance when considering both lunch and dinner reservation data. Nevertheless, we have observed that including this source as part of the ensemble decreases the overall accuracy by 0.01 over the uncorrected ILI case count data. Thus it is our opinion that although the reservation data could exhibit some signals about prevalent ILI conditions, it likely is also a surrogate for non-health conditions (e.g., social unrest) which must be factored out to make the data source more useful. Finally, we present Figure 2.4 where we compare for each country the accuracies of prediction

40 26 from the best individual source, with those from both data level and model level fusion of the different sources and the the model level fusion of MF regressors applied on the corrected PAHO estimates rather than the raw ones. As can be seen, we progressively increase our accuracies with the corrected PAHO estimates providing the final increase in predictive power to our model level fusion framework. 2.7 Discussion In this chapter, we have aimed to generate short-term ILI forecasts over a range of Latin American countries using a gamut of options pertaining to data sources, fusion possibilities, and corrections to track a moving target. Our results demonstrate that there are significant opportunities to improve forecasting performance and selective superiority among data sources that can be leveraged. However, the presented method works best for near-horizon forecasts with significant drop in accuracy (see Chapter 3) for longer range forecasts. Thus we will next explore methods to increase the forecasting horizon while adhering to the principles of using multiple sources to generate such forecasts.

41 Chapter 3 Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction In Chapter 2, we have presented our initial efforts at forecasting influenza-like-illness (ILI) case counts. Seasonal influenza regularly affects the global population and improvements in forecasting capability can directly translate into tangible measures of public health. The methods presented in the said chapter successfully incorporated surrogate sources of information to produce real-time forecasts. However, the reliable forecasting horizon was limited, both by model complexity as well as inability to maintain coherence between successive forecasts. In this chapter, we aim to relax such limitations and increase the forecasting horizon without increasing the computational complexity of the model. Traditionally, epidemiologists aim to predict several characteristics about ILI from surveillance reports. Such characteristics of interest can be broadly classified into: (a) seasonal characteristics and (b) short-term characteristics. Seasonal characteristics are concerned with the overall shape of ILI counts for the particular season (See Part II for more details). Such methods are generally trained by assigning greater importance to statistics of the ILI curve such as peak value and the peak size. Conversely, short-term characteristics are concerned with accurately predicting the next few data points in absolute value rather than aiming for an overall fit for the season. In this chapter we are motivated by the second problem, i.e. the short-term forecasting challenge (but we also evaluate our methods w.r.t. seasonal characteristics). As discussed earlier, among the several challenges towards ILI case count forecasting, one of the most important fact is that the surveillance reports are often delayed by a number of weeks and therefore estimating the current on-ground scenario is a crucial problem. The case count estimates for a given week can be delayed anywhere from 1 week to 4 weeks, depending on the quality of the surveillance apparatus in a given country. Thus in this chapter we aim to provide reliable short-term forecasts from the last available 27

42 28 surveillance data such that we can estimate the on-ground case counts and increase our forecasting horizon to atleast 4 weeks. In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposed to model the temporal profile of infectious diseases. In modern computational epidemiology, more advanced methods have been used. One of the currently popular methods is to fit prediction models by matching observational data against a large library of simulated curves [5, 40, 58]. The curve simulations are generated by using different epidemiological parameters and assumptions. Sometimes network-based models are used to generate the curves [3]. Partially observed influenza counts for a particular year can then be matched to a library of curves to produce the best set of predictions [40]. Closely related to such curve matching methods are filtering-based methods that dynamically fit epidemic models onto observed data by letting the base epidemic parameters vary over time. Yang et al. [66] provide an excellent survey of filtering-based methods used for influenza forecasting and also present comparative analysis of such methods. Concurrently, there has been a lot of interest in using indicator data sources to predict seasonal influenza. In [24], Ginsberg et al. presented a method of estimating weekly influenza counts based on search query volumes (Google Flu Trend). Following this seminal work, researchers have investigated a wide-variety of data sources such as Wikipedia [25], Twitter [12, 34, 46], and online restaurant reservations [41]. Weather has been found to be a significant indicator of seasonal influenza [49, 50, 51, 57]. In [12], different indicator sources are contrasted to understand their relative influence on short-term forecasting quality. As rich and varied as the above approaches are, most approaches in the literature aim to use the same model to predict for the entire influenza season. This is not entirely desirable as in-season ILI characteristics may vary significantly from the out-of-season characteristics (see Section 3.1.3). While researchers appreciate the need for dynamic models (e.g., [12]), constraints on temporal consistency are never explicitly imposed in current models. Thus in this chapter we aim to propose a general purpose time series prediction model allowing external factors from indicator sources to produce robust short-term forecasts in a consistent manner. A popular model for analyzing time series data is the autoregressive exogenous (ARX) model [4, 36]. The ARX model has also been adopted by Paul et al. [46] to predict ILI case counts by using Twitter and Google Flu Trends (GFT) as the indicator sources. However, the underlying static autoregressive model may not be suitable for flu trend forecasting, as the activity of the disease and the human living environment evolve over time. Ohlsson et al. [42] have designed a more flexible ARX model for time-varying systems based on model segmentation. It allows the weight of the autoregressive model to be temporally piecewise constant. In this chapter, we further relax this requirement. We build separate models for each time point, but we constrain the models to share common characteristics. To capture such characteristics, we build a graph over models at different time points and embed the prior knowledge on model similarity in terms of the structure of the graph. Then we for-

43 29 mulate the dynamic ARX model learning problem as a convex optimization problem, whose objective balances the autoregressive loss and the model similarity regularization induced by the graph structure. In this optimization problem, the variables have a natural block structure. Thus we apply a block coordinate descent method to solve this problem. We further extend our dynamic ARX modeling to the Poisson regression model for a better fitting of the count data [4, 14], as is relevant for ILI case counts forecasting. We perform extensive experimental studies to evaluate the effectiveness of the proposed model and the corresponding learning algorithm. We use various real world datasets in the experiments, including different types of indicator data sources from 15 countries around the world. Our experimental studies illustrate that the dynamic modeling of the linear Poisson autoregressive model captures well the underlying progression of disease counts. Further, our results also show that our proposed method outperforms state-of-the-art ILI case counts forecast methods. Our main contributions are summarized as follows: We propose a new dynamic ARX model for the task of ILI case count forecasting. This approach incorporates a linear Poisson regression model with non-negativity constraints into an ARX model, ideal for case counts modeling. Prior domain knowledge can be encoded as structural relationships among different time points in a graph, which is embedded into the objective as a regularization term while still ensuring that the optimization problem is convex. We evaluate the proposed method using various real world datasets, including different types of indicator data sources from the USA and 14 Latin American countries. 3.1 Summary We present a brief summary of our findings here. For a more detailed treatise we ask the reader to refer to [62]. We developed two dynamic generalized linear models viz. Dynamic Autoregressive model (DARX) and Dynamic Poisson Autoregressive model (DPARX) and compared the forecasting performance for various sources against a number of state-of-the-art algorithms. We highlight some of our interesting findings here Model Similarity First, we conduct experiments to investigate the model similarities posited by our proposed algorithm. In this experiment, we calculate the distance between all pairs of models learned by DPARX during a period of time on the AR dataset. We present the distance matrix associated with the ground truth ILI case count series in Figure 3.1. We see that the

44 30 distance matrix has a strong seasonal pattern, which is consistent with the pattern of the ILI case count series. At the beginning of each flu season, the model is significantly different from the rest of the models at other time points. This result demonstrates that ILI case counts have a strong periodic pattern and that the dynamic modeling approach successfully captures this pattern. It also validates the necessity of conducting this level of modeling for flu forecasting. In the next experiment, we run our proposed DPARX method on the US dataset under three different model similarity graphs including the fully connected graph, the 3-nearest neighbor graph and the seasonal 3-nearest neighbor graph. We then calculate the three corresponding distance matrices of the learned models, which are shown in Figure 3.2. The patterns in the three distance matrices are very similar. However, the distances between the pairs of models are smaller for the fully connected similarity graph. Without strong prior knowledge, the fully connected similarity graph is preferred, as during different seasons the target signal may still be very different. In the following experiments, we will use the fully connected similarity graph for the regularization term Forecasting Results In the ILI cast count forecast experiments, we use the data record from all 15 countries. All the case count data are associated with several data sources similar to the ones in Section 2. We start with 50 given time points and test the prediction result on the remaining time points. We run all the competing methods in an online manner: the models are re-trained and updated after the arrival of values at every additional time point. For the DARX and DPARX models, we use the same parameter settings: p = 1, b = 15 for GFT and Weather data sources as these data sources have relatively small dimension; p = 1, b = 4 for GST and HealthMap data sources as these data sources have relatively high dimension. The ARX model does not provide numerical stable results for high dimensional data. Thus we present its results on GFT and Weather data sources with p = 1, b = 15. Likewise, the training of the SARX model is very time consuming, especially for high dimensional data. We thus only present its results using the GFT data source with the same setting (p = 1 and b = 15). The remaining parameter in our model is the regularization parameter that controls the variation of the model. We fix it as η = 1 for the DARX model and η = 5 for the DPARX model during all experiments. For MFN algorithm, we follow the same procedure and parameter setting as in [12]. We present the results of short-term ILI case count forecasting for different countries with both 1-step forecast and multi-step forecasts with step sizes of 2, 3, and 4. The prediction accuracy on data sources GFT, Weather, GST, and HealthMap are presented in Tables 3.1, 3.2, 3.3, and 3.4, correspondingly. The experiments show that our models yield better prediction accuracy, especially for multistep forecasting. Multi-step forecast is a much harder task than 1-step forecast. The dynamic

45 31 modeling of ARX provides more flexibility in handling the uncertainty associated with the target signal. Table 3.1: Prediction accuracies for competing algorithms with different forecast steps over different countries using the GFT input source. GFT data is not available for other countries. Step Method AR BO CL MX PE PY US ARX MFN SARX DARX DPARX ARX MFN SARX DARX DPARX ARX MFN SARX DARX DPARX ARX MFN SARX DARX DPARX Seasonal Analysis In this chapter, we have not trained the models to predict the seasonal metrics. However, we can construct ILI prediction curves for each step-ahead, i.e., 1-step ILI prediction curve, 2-step ILI prediction curve and so on. From these prediction curves we can then calculate the season-characteristics and compare them against those calculated from the observed PAHO (or CDC) ILI counts. We compare the predicted and observed seasonal characteristics, for the last ILI year in our set for each country. Our experimental results show [62], the proposed algorithms work well for a number of countries. In general DPARX performs better in terms of the overall prediction characteristics. This is consistent with our results for near-term forecasts. For seasonal characteristics, Weather and GFT seem to be the most important sources for prediction. We also present the predicted and real curves for Mexico for the ILI season 2013 in Figure 3.3 based on 1-step ahead predictions. Excepting GST and HealthMap data for some of the state-of-the-arts, all the curves match up closely to the observed ILI curve.

46 Discussion In this chapter, we presented a practical short-term ILI case count forecasting method using multiple digital data sources. One of the main contributions of the proposed model is that the underlying autoregressive model is allowed to change over time. In order to control the variation of the model, we built a model similarity graph to indicate the relationship between each pair of models at two different time points and embed the prior knowledge as the structure of the graph. The experiments demonstrate that our proposed algorithm provides consistently better forecasting results than state-of-the-art time series models used for shortterm ILI case count forecasting. We also observed that the dynamic model successfully captures the seasonal pattern of flu activity. Finally, while these techniques were applied to the relatively specialized field of ILI case count forecasting, the methods presented are generic enough such that these may be adapted towards other similar count prediction problems.

33 3000 2000 1000 0 0 50 100 150 200 20 40 60 80 100 120 0.06 0.05 0.04 0.03 140 160 180 200 0.02 0.01 220 20 40 60 80 100 120 140 160 180 200 220 Figure 3.

47 Figure 3.1: The distance matrix obtained from our learned DPARX model (bottom figure), associated with the ground truth ILI case count series (top figure) on the AR dataset. We can observe the strong seasonality automatically inferred in the matrix. Each element in the matrix is the Euclidean distance between a pair of the learned models at two corresponding time points after training. For the top figure, the x axis is the index of the weeks; the y axis is the number of ILI cases. For the bottom figure, both x and y axes are the index of the time points. Note that the starting time point (index 0) for the distance matrix is week 15 of the ILI case count series.

34 0.45 0.5 0.5 20 40 60 80 0.4 0.35 0.3 0.25 0.2 20 40 60 80 0.45 0.4 0.35 0.3 0.25 0.2 20 40 60 80 0.45 0.4 0.35 0.3 0.25 0.2 100 0.15 0.1 100 0.15 0.1 100 0.15 0.1 120 20 40 60 80 100 120 0.

48 Figure 3.2: Model distance matrices for US dataset. The three matrices are derived from the fully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal 3-nearest neighbor similarity graph, from left to right correspondingly ILI Count Season Start Season End MX for 2013 Actual HM-DARX HM-DPARX GST-DARX GST-DPARX Weather-DARX Weather-DPARX GFT-SARX GFT-DARX GFT-DPARX Weeks Figure 3.3: Comparison of seasonal characteristics for Mexico using different algorithms for one-step ahead prediction. Blue vertical dashed lines indicate the actual start and end of the season. ILI season considered: 2013.

49 35 Table 3.2: Prediction accuracies for competing algorithms with different forecast steps over different countries using the weather data source. Step Method AR BO CL CO CR EC GT HN MX NI PA PE PY SV US ARX MFN DARX DPARX ARX MFN DARX DPARX ARX MFN DARX DPARX ARX MFN DARX DPARX Table 3.3: Prediction accuracies for competing algorithms with different forecast steps over different countries using the GST data source. Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV MFN DARX DPARX MFN DARX DPARX MFN DARX DPARX MFN DARX DPARX Table 3.4: Prediction accuracies for competing algorithms with different forecast steps over different countries using the HealthMap data source. Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV US MFN DARX DPARX MFN DARX DPARX MFN DARX DPARX MFN DARX DPARX

50 Part II Long-term Forecasting using Surrogates 36

51 37 We discussed several facets of short-term forecasting, specially with respect to ILI, in Part I. Concomitant to short-term forecasting, which provides real-time insights about current onground scenario, often times long-term characteristics of targets are of prime interest. Considering the example of epidemic diseases, surveillance agencies are interested in identifying seasonal characteristics such as follows: 1. Start week: Within a particular ILI year (may not be calendar year, e.g., in the USA, the ILI year spans from Epi Week 40 to Epi Week 39 [11]), start week is the week from which ILI is said to be in season. We define start week for a ILI year to be the first week where the ILI count for 3 consecutive past weeks (including itself) is greater than a pre-defined threshold. 2. Peak week: Within a particular ILI year, the peak week is the week for which the ILI count is highest for that ILI year. 3. Peak Size: Peak Size is the ILI count observed on the peak week. 4. End week: Within a particular ILI year, the end week is the first week after the peak week such that ILI counts for 3 consecutive past weeks (including itself) is lower than a pre-defined threshold. End week signifies the end of the ILI season and is thus of interest to epidemiologists. 5. Season Size: Season size is used as a proxy for the size of the epidemic. It is calculated by summing up the total ILI count from the start to the end week. In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposed to model the temporal profile of infectious diseases. In modern computational epidemiology, more advanced methods have been used. One of the currently popular methods is to fit prediction models by matching observational data against a large library of simulated curves [5, 40, 58]. The curve simulations are generated by using different epidemiological parameters and assumptions. Sometimes network-based models are used to generate the curves [3]. Partially observed influenza counts for a particular year can then be matched to a library of curves to produce the best set of predictions [40]. Closely related to such curve matching methods are filtering-based methods that dynamically fit epidemic models onto observed data by letting the base epidemic parameters vary over time. Yang et al. [66] provide an excellent survey of filtering-based methods used for influenza forecasting and also present comparative analysis of such methods. We present our efforts at disease forecasting using curve matching methods in Chapter 4 and subsequently present our data assimilation based models towards easier integration of surrogates in Chapter 5.

Chapter 4 Curve-matching from library of curves One of the simplest and more-intuitive strategies towards long-term forecasting is based on curve matching from amongst library of curves [3, 5].

52 Chapter 4 Curve-matching from library of curves One of the simplest and more-intuitive strategies towards long-term forecasting is based on curve matching from amongst library of curves [3, 5]. Typically, library of curves can be generated using various parameter choices of compartmental models such as SEIR Curves can also be generated from agent based models that are informed through a combination of diverse sources such as census and road-network. Curves from such library of curves can then be matched against specific epidemic surveillance data to predict the seasonal curves. Seasonal characteristics can then be identified from these detected curves using the definitions as outlined earlier. Some of the considerations in this process can be identified as follows: Figure 4.1: Filtering library of curves based on season size and season shape. 1. Appending Short-term forecasts to surveillance data: Surveillance reports are typically delayed. As presented in Part I, we can use Surrogates to generate robust predictions for short-term. In general, these predictions are robust i.e. stable w.r.t. to 38

53 39 surveillance updates. Short-term forecasts can also provide measures of uncertainty about current surveillance. We append these predictions to the last-available surveillance data so that the partial time series to match against the curves is longer and hence more accurate. This is especially useful during the initial part of the season where only a few data points are available from the surveillance reports to match against the library of curves. 2. Filtering Library of Curves: Typically, the library may contain a wide variety of curves corresponding to various kinds of diseases identified through various epidemiological parameters used to simulate the curves. Many of these curves can be unsuitable for matching against the disease of interest (such as ILI). Moreover, admitting these curves for matching may lead to increased false detections. As such we filter the curves from historical trends of the disease of interest by the following factors: Filter curve by average season size of the disease. Filter curves by average peak-to-season size ratio. Effectively, this strategy filters according to the shape of the epidemic curve. Figure 4.1 shows examples of curves that were filtered out from such a library while matching against ILI data for Latin America Performance Highlights We used the aforementioned curve-matching strategy to predict ILI seasonal characteristics for 15 Latin American countries. Figure 4.2 gives an example of such forecasts and Figure 4.3 outlines the results for several countries against the aforementioned metrics as reported by IARPA. As can be seen, the framework works well for a few metrics and for a few countries such as Ecuador for total RSV counts. However, our performance was poor for several other metrics. Furthermore, we found curve matching models to be inconsistent with respect to the week of the season when the forecasts were generated. Also, this method admits the use of surrogates only for increasing the time series length to match against and fails to use it for determining more interesting facets such as disease transmission rate.

54 40 Figure 4.2: Example of seasonal forecasts for ILI using curve-matching methods. Figure 4.3: Performance measures for ILI seasonal characteristics using curve-matching

55 Chapter 5 Data Assimilation methods for long-term forecasting Chapter 4 outlined our current efforts at seasonal forecasting using curve-matching models. As identified in the chapter, such methods involve a sub-optimal use of surrogate information at determining seasonal characteristics. Motivated by the efforts of Shaman et al. [51], we developed data assimilation models where surrogates are used to force the disease parameters and seasonal characteristics are found by optimizing over the most probable seasonal curves. We present our efforts to some details in the following sections by first describing some of the relevant data assimilation models in Section 5.1 and present our disease forecasting models using data assimilation methods in Sections 5.2 and Data Assimilation Originally proposed in the 1960s [26], Kalman filter (KF) has rapidly gained its reputation in a myriad of applications [17] that features estimation and forecasting. Nowadays the original KF has evolved and given rise to a entire class of dynamic estimation/forecasting algorithms that recursively estimate and forecast: it optimizes the estimates by data assimilation using noisy measurements (observations), and forecasts using a presumed process model. We first consider linear process models. Such systems can be expressed as a pair of linear stochastic process and measurement equations as show below: { xk+1 = Ax k + Bu k + w k, w k N (0, Q) (5.1) z k = Hx k + v k, v k N (0, R) where x R n is the state vector, z R m is the measurement vector, A R n n is called the process matrix, B is the matrix that relates optional control input u to the state, and H R m n is the measurement matrix. The process noise w k and measurement noise v k 41

56 42 are assumed to be mutually independent random variables with zero mean and normally distributed with noise covariance Q and R respectively. The classic Kalman Filter was developed to estimate the hidden states as well as forecast the observed targets of such linear processes and the relevant equations can then be outlined in the following two groups: Estimation K k = P f k HT (HP f k HT + R) 1 x a k = xf k + K k(z k Hx f k ) P a k = (I K kh)p f k { x f k+1 Forecast = Axa k + Bu k P f k+1 = AP k aat + Q (5.2) (5.3) where, x a k is the optimal state estimate given measurement vector z k R m, Pk a is the analysis state error covariance, R is the measurement noise covariance. Equation 5.2 assimilates measurements into the estimate via Kalman gain matrix K, which weighs the impact from measurement versus that from prediction. Larger R or smaller Q increases the weight of prediction; while smaller R or larger Q increases the weight of measurement. x f k+1 Rn is the forecast state, P f k+1 Rn n is the forecast state error covariance, Q is the process noise covariance. Equation 5.3 forecasts x and P for time step k + 1. In reality, the process and measurement models are often nonlinear. A nonlinear system can be modeled by nonlinear stochastic equations: { xk+1 = a(x k, u k ) + w k, w k N (0, Q) (5.4) z k = h(x k ) + v k, v k N (0, R) One of the more popular solutions to such non-linear systems is the extended Kalman filter (EKF), which is essentially a Kalman filter modified to linearize the estimation about the current mean and covariance [63]. EKF equations are similar to KF equations, except that EKF needs to compute Jacobian matrices at each time step A = a(x) x x, H = h(x) x x. (5.5) EKF came to prominence in aerospace and robotics applications where the state space is small; however in more complex systems with high-dimensional state space such as those in weather and disease prediction, it falls short due to the intractable computational burden associated with Jacobian matrices, as well as maintaining and evolving a separate covariance matrix at each time step. Ensemble Kalman filter (EnKF) [21] is thus developed to alleviate computation complexity. It is related to the Particle Filter (PF) [20] in the sense that each ensemble member can be considered to be a particle aimed at estimating the relevant probability distributions using Monte Carlo procedures. However, contrary to PF, EnKF assumes Gaussian distributed

57 43 noise characteristics and is thus more computationally more efficient than the PF albeit with stricter assumptions about the underlying process. The essential steps of EnKF are: 1) maintaining an ensemble of state estimates instead of a single estimate, 2) simply advancing each member of the ensemble and, 3) calculating the mean and error covariance matrix directly from this ensemble. Assuming that we have an ensemble of q state estimates with random sample errors, EnKF steps can be expressed via the following equations: where, x f k+1 = 1 q i=1 ˆK k = E f x k (E f z k ) T (E f z k (E f z k ) T + R) 1 x a i k = xf i k + ˆK k (z k + vk i zf i k ), i = 1, 2,...q x a k = q 1 x a i q k i=1 x f i k+1 = a(xa i k, u k) + wk i, i = 1, 2,...q z f i k+1 = h(xf i k+1 ), i = 1, 2,...q Ex f k+1 = 1 q 1 [x f 1 k+1 xf k+1... xfq k+1 xf k+1 ] Ez f k+1 = 1 q 1 [z f 1 k+1 zf k+1... zfq k+1 zf k+1 ] q x f i k+1 and zf k+1 = 1 q q i=1 z f i k+1 (5.6) (5.7) are the means of state forecast ensemble and measurement forecast ensemble. Ex f k+1 and Ez f k+1 are the corresponding perturbation matrices ensembles. wk i and vi k are generated random noise variables that follow the normal distribution N (0, Q) and N (0, R) respectively. EnKF offers great ease of implementation and handling of non-linearity due to the absence of Jacobian calculations; on the other hand, it is critical to choose an ensemble size that is large enough to be statistically representative. The details of EnKF can be found in [21, 37]. Some of the more popular variations of EnKF, viz. Ensemble Adjustment Kalman filter (EAKF) [2] and Ensemble Transform Kalman Filter (ETKF) [61], do not add Gaussian noise to form measurement ensembles and instead deterministically adjust each ensemble member so that the posterior variance is identical to that predicted by Bayesian theorem under Gaussian distribution assumptions, while keeping the ensemble mean unchanged. With respect to ETKF, EAKF shows better numerical stability but requires extra SVD operations and is thus computationally more expensive. In EAKF, the estimated state perturbation matrix can be written in the pre-multiplier form: E a x = AE f x (5.8) Compared to ETKF, in which E a x can only be expressed in post-multiplier form, EAKF does not suffer from the two issues that may appear in ETKF: 1) Producing analysis ensembles with inconsistent statistics such as biased mean and/or small standard deviations of the coordinates; 2) Each assimilation of an observation produces a collapse in the number of distinct values of the observed coordinates in the ensemble. More discussion of EAKF and ETKF can be found in [2, 61, 37].

58 Data Assimilation Models in disease forecasting We described some of the more practical and popular data assimilation models in Section 5.1. In this section, we present our disease specific data assimilation model which we used to generate seasonal forecasts for ILI and subsequently expanded towards CHIKV forecasts. To build such data assimilation models, we need to specify disease spread processes whose parameters are learned through the data assimilation algorithm of choice (see [51]). For our purpose, we chose dynamic data-driven SIRS model. This dynamic model is inspired from the Shaman et al. [51] and aims to use a Bayesian Filter to continuously assimilate observed data sources into the model characteristics and generate an ensemble of models. A key distinguishing feature of our work is aimed at the diversity of syndromic surveillance sources used. The spread of the ensemble predictions also reveals the underlying probability distribution of various seasonal characteristics such as start week and peak week. The model used for ILI can be formally described as follows. Let us denote the observed ILI percentage for the region of interest (including national level data) by y t. We choose as a candidate the well defined SIRS model where S t and I t denote the number of people in Susceptible and Infectious compartment, at time t. Let us also denote the new infections moving into the I bucket at time t by newi t which can be directly computed from I t. Let us denote the population size by N, the mean infectious period by D, the mean resistance period by L, and the basic reproductive rate at time t by R 0,t. Then the basic SIRS equation at time t can be given as where β(t) = R 0,t /D. ds t = N St It β(t)itst dt L N di t = β(t)is It dt N α D + α (5.9) Let us denote a hidden layer of variables x t that connect the SIRS model with the observed ILI percentages. The hidden variable set can be thought of as an n-tuple x t, as x t = (S t, I t, R 0, D, L, f, r) The equations governing the Bayesian filter can be given as: y t = f newi t + N (0, r) x t = g(x t x t 1 ) (5.10) where g denotes the dynamic model transition from time t 1 to t. g can be a general purpose transition function. For our purpose, we perturb S and I via the SIRS equation and the remaining state parameters using a random walk model within specified bounds. We studied a number of data assimilation models as presented in

59 45 Section 5.1 and selected EnKF filters to allow for greater flexibility in modeling and with a stated goal of comparing different sources towards their relevant importance in disease forecasting. We used an EnKF with ensembles to estimate the disease parameters. The distribution of the ensembles provide the posterior distribution over the SIRS parameters and can be used to directly infer the parameters. 5.3 Data Assimilation Using surrogate Sources The method so described above can be thought of as a general purpose algorithm where we can introduce information about different sources by modifying equation Earlier research [51] has shown that surrogate sources such as absolute humidity can be used to locally modify disease parameters and generate more robust forecasts. However, such methods have mainly focused on allowing a single surrogate and/or using custom state transition equations which are not easily generalizable to other sources. We focused on extending such methods to more generic sources and study the relative importance of such sources towards longterm forecasting. We have used a number of surrogate sources such as Weather, Google Flu Trends, Google Search Trends, HealthMap, and Twitter chatter. For the sake of simplicity, we explain our model using Google Flu Trends (GFT) as the illustrative source. Additional data sources can be incorporated following similar equations. As discussed in Part I, surrogate sources were found to encode disease transmission information but also exhibiting significant noise. However, from our experiments we found that although absolute surrogate counts are noisy, their rolling covariance can be used to inform a sudden increase/decrease of disease incidence in the population. Thus surrogate information was used to modify the transition equation for other latent variables such as R 0 as: R 0,t = R 0,t 1 + N (0, cov(gf T t 1, GF T t )) (5.11) Following Chakraborty et al. [12] we intend to analyze a myriad of data sources to train a more precise model with lower uncertainty bounds. 5.4 Experimental Results and Performance Summary We used our data assimilation model to generate forecasts for ILI and CHIKV, for various regions of the world. While ILI is an human-to-human transmitted infectious diseases, CHIKV is a vector driven disease and hence forecasting models for CHIKV needs to cognizant about the same. For both diseases, weather attributes such as Temperature and Humidity could be argued to be an important transmission modulator. We applied data assimilation methods as outlined in Section 5.3 using weather as a surrogate source. These forecasts

60 46 (a) ILI (b) CHIKV Figure 5.1: Performance summary for (a) ILI and (b) CHIKV seasonal forecasts using Weather as a surrogate source under data assimilation framework

61 47 were generated continuously for CHIKV in the Americas and for ILI in the US. As can be seen, data assimilation methods were able to more accurately forecast several seasonal characteristics for ILI compared to CHIKV. CHIKV, being a newly introduced disease in the Americas were characterized by more noise and our results also indicate the possible importance of modeling the vectors (mosquitoes) in addition to surrogate sources which may improve our forecasting performance. Table 5.1: Forecasting performance of seasonal characteristics using data assimilation methods Metric BO CL MX PE start date end date peak date peak val season val Similar to our efforts in short-term forecasting we compared the importance of each individual surrogate source towards long-term forecasts. We applied our data assimilation model as outlined in Section 5.3 to ILI incidence for the season over four Latin American countries viz. Bolivia, Chile, Mexico and Peru. We chose these countries as Google Flu Trends was available for these countries as well as these countries exhibits different modes of seasonality in the Latin Americas. We generated seasonal forecasts using data present at weeks 4 8 of the flu season for each of these countries. Table 5.1, summarizes the performance summary of our forecasts. The complete performance summary for these forecasts could be see in Appendix A. Figure 5.2 and Figure 5.2 plots the distribution of forecasting accuracy for dates (deviation in days) and values (quality score), respectively. As can be seen, HealthMap sources performs the best for both categories, indicating that the news media captures long-term signals about the season. The combination of all sources performs with similar accuracy as HealthMap, indicating that the competing sources could be potentially used, especially to improve accuracies against local variations. We analyze the forecasting performances furthermore by analyzing the change in forecasting accuracy over the number of season weeks used to generate the forecasts in Figures 5.4, 5.5, 5.6, 5.7 and 5.8. As can be seen, a combination of all sources shows most consistent performance over the season weeks compared to a single source. Furthermore, forecasting accuracy over value metrics (such as peak value and season value) benefits more from observation of a number of seasonal weeks compared to dates. Our results indicate that the shape of the disease curve can be forecasted with better accuracy compared to the actual size when only a few data points are observable for the season. Furthermore, the temporal accuracy plots indicate that surrogates sources such as HealthMap and GST contributes more heavily in the initial part of the disease season compared to the later part.

62 48 11 start_date 11 end_date 11 peak_date Score 7 Score 7 Score Weather gft gst hmap merged twitter source 3 Weather gft gst hmap merged twitter source 3 Weather gft gst hmap merged twitter source Figure 5.2: Comparison of forecasting accuracy for Date metrics using surrogates 11 peak_val 11 season_val Score 7 Score Weather gft gst hmap merged twitter source 3 Weather gft gst hmap merged twitter source Figure 5.3: Comparison of forecasting accuracy for Value metrics using surrogates

63 Country: BO source twitter hmap gst Country: CL source twitter hmap gst 8.6 gft Weather merged 3.05 gft Weather merged curr_week curr_week Country: MX Country: PE source source twitter twitter hmap hmap gst gst gft gft Weather Weather 11.2 merged 5.1 merged curr_week curr_week Figure 5.4: Comparison of forecasting accuracy for Start Date using different surrogate sources Country: BO source twitter hmap gst gft Weather merged source twitter hmap gst gft Weather merged Country: CL curr_week curr_week Country: MX Country: PE source twitter 34.2 hmap gst gft 34.0 Weather merged curr_week 10 source twitter 9 hmap gst gft 8 Weather merged curr_week Figure 5.5: Comparison of forecasting accuracy for End Date using different surrogate sources

64 Country: BO Country: CL source twitter hmap gst gft Weather merged source twitter 6 hmap 1.0 gst 4 gft 0.5 Weather merged curr_week curr_week Country: MX Country: PE source source twitter twitter hmap hmap 3.8 gst gst 27.5 gft gft Weather Weather merged 3.6 merged curr_week curr_week Figure 5.6: Comparison of forecasting accuracy for Peak Date using different surrogate sources source twitter hmap gst Country: BO Country: CL source twitter hmap gst gft Weather merged 3.45 gft Weather merged curr_week curr_week Country: MX Country: PE source twitter hmap gst gft Weather merged curr_week 2.5 source twitter hmap 2.0 gst gft Weather merged curr_week Figure 5.7: Comparison of forecasting accuracy for Peak Value using different surrogate sources

65 source twitter hmap gst gft Weather merged Country: BO Country: CL source twitter hmap gst gft Weather merged curr_week curr_week Country: MX Country: PE 2.8 source twitter hmap gst gft Weather merged curr_week 2.0 source twitter hmap gst 1.8 gft Weather merged curr_week Figure 5.8: Comparison of forecasting accuracy for Season Value using different surrogate sources 5.5 Discussion We have presented our work on long-term forecasts using both data assimilation methods and curve matching process. Our results indicate that data assimilation methods are in general more flexible and robust towards long-term forecasts. Surrogate sources such as HealthMap are important factors for such forecasts, especially during the initial part of the season. Our future research will focus on systematically including other infectious diseases with the framework and towards sparse selection of surrogates for more robust forecasting.

66 Part III Detecting and Adapting to Concept Drift 52

67 Part I and Part II outlined our efforts at short-term and long-term forecasting using surrogates. However, surrogates are typically noisy and relationships to targets may be dynamic in nature. The changes in surrogate-target relationships can be significant, which if undetected may subsequently render any model developed on these surrogates ineffective. This motivates the third problem of this thesis where we first try to identify such major changes under the concept of changepoints. For this, we developed a hierarchical changepoint detection framework which can inform the changepoints in targets using information from the surrogate layers in Chapter 6. We also propose the use of such changepoints towards adaptive target forecasting in Chapter 7. 53

68 Chapter 6 Hierarchical Quickest Change Detection via Surrogates With the increasing availability of digital data sources, there is a concomitant interest in using such sources to understand and detect events of interest, reliably and rapidly. For instance, protest uprisings in unstable countries can be better analyzed by considering a variety of sources such as economic indicators (e.g. inflation, food prices) and social media indicators (e.g. Twitter and news activity). Concurrently, detecting the onset of such events with minimal delay is of critical importance. For instance, detecting a disease outbreak [45] in real time can help in triggering preventive measures to control the outbreak. Similarly, early alerts about possible protest uprisings can help in designing traffic diversions and enhanced security to ensure peaceful protests. Motivated by similar real-life scenarios where significant events can be argued to be observable in social sphere, we propose Hierarchical Quickest Change Detection (HQCD), for online change detection across multiple sources, viz. target and surrogates. Typically, targets are sources of imminent interest (such as disease outbreaks or civil unrest); whereas surrogates (such as counts of the word protesta in Twitter) by themselves are not of significant interest. Thus, HQCD is aimed towards continuously utilizing both categories, but more focused on early (or quickest) detection of significant changes across the target sources. Traditional event (or change) detection approaches are not suitable for such problems. These are either a) offline approaches [43, 60, 52, 8] using the entire data retrospectively - thus not applicable to real-time scenarios, or b) online detection approaches [53, 54, 30, 31, 1, 35] with primary focus on the target source of interest and do not utilize other correlated sources. Table 6.1 shows a comparison of HQCD and several state-of-the-art methods in terms of the desirable attributes. The main contributions of the work presented in this chapter are: HQCD formalizes a hierarchical structure which in addition to the observed set of target 54

69 55 Table 6.1: Comparison of state-of-the-art methods vs Hierarchical Quickest Change Detection Desirable Sequential Window- Bayesian Relative Hierarchical HQCD Properties GLRT Limited Online Density- Bayesian (This [53] GLRT CPD ratio Analysis of Paper) [54] [30] [1] Estimation Change [31] (RuLSIF) Point [35] Problems [8] Online Hierarchical Bounded False Alarm Rate / Detection delay Handles Non-IID data sources (i.e., S i s), incorporates additional surrogates, denoted by K j s, and encodes propagation of change from surrogate to target sources. HQCD presents a specialized change detection metric that guarantees a maximum level of false alarm rate while reducing the detection delay in quickest detection framework. In addition, HQCD yields a natural methodology for analyzing the causality of change in a particular target source through a sequence of change propagation in other sources. HQCD presents a specialized sequential Monte Carlo based change detection framework that along with specialized change detection metrics enables hierarchical data to be analyzed in online fashion. We extensively test HQCD on both synthetic and real world data. We compare against state-of-the-art methods and illustrate the robustness of our methods and the usefulness of surrogates. Moreover, we analyzed target-surrogate relationships and uncover important propagation patterns that led to such uprisings. 6.1 HQCD Hierarchical Quickest Change Detection We first provide a brief overview of classical QCD problem and then present the HQCD framework Quickest Change Detection (QCD) Let us consider a data source S changing over time and following different stochastic processes before and after an unknown time Γ (changepoint). The task of QCD is to produce an estimate ˆΓ = γ in an online setting (i.e., at time t, only S 1,..., S t is available). Figure 6.1 illustrates the two fundamental performance metrics related to this problem. In the figure,

70 False Alarm True Change Point Delayed Detection 56 Γ = t 4 is the actual time-point when the changepoint happened. An early estimate such as γ 1 = t 1 in the figure leads to a false alarm, where another estimate, such as γ 2 = t 6 leads to an additive delay of γ 2 Γ = t 6 t 4. The goal of QCD is to design an online detection strategy which minimizes the expected additive detection delay (EADD) while not exceeding a maximum pre-specified probability of false alarm (PFA). QCD has been studied in various contexts. Some of the foremost methods have considered i.i.d. distributions with known (or unknown) parameters before and after unknown changepoints [59]. Some of the more popular methods have used CUSUM (cumulative sum of likelihood) based tests while more general approaches are adapted in GLRT (generalized likelihood ratio test) based methods [19]. γ 1 Γ γ 2 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 Detection Delay = γ 2 - Γ Figure 6.1: Illustration of Quickest Change Detection (QCD): blue colored line represents the actual changepoint at time Γ = t 4. (a) declaring a change at γ 1 leads to a false alarm, whereas (b) declaring the change at γ 2 leads to detection delay. QCD can strike a tradeoff between false alarm and detection delay Changepoint detection in Hierarchical Data We next present our approach to generalize QCD to a hierarchical setting. We first describe a generic hierarchical model and then propose the QCD statistics for such models in Section For computational feasibility, we present a bounded approximate of the same and our multilevel changepoint algorithm in Section Generic Hierarchical Model Let us consider S (T ), a set of I correlated temporal sequences {S (T ) 1, S (T ) 2,... S (T ) I } where, S (T ) i represents the i th target data sequence S (T ) i = [s i (1), s i (2),..., s i (T )] for i = 1,..., I,

57 Sum of Target sources E E (Target sources) S1 S1 S2 S2 S3 S3... SI SI (Surrogate sources) K1 K2 K1 K2 K3.

In the framework, different protest types (such as Education- and Housing-related protests) form the targets denoted by

Finally, the set of surrogates, such as counts of Twitter keywords, stock price data, weather data, network usage data

The cumulative sum of the target sources Si s at time t is given by E(t), i.e., E(t) = Ii=1 si (t).

.., KJ }, where Kj = [kj (1), kj (2),..., kj (T )], for j = 1,.

We assume that targets and surrogates follow a stochastic Markov process as follows: (T ) (T ) (T ) (T ) P (S (T ), K (T

t=1 j=1 i=1 S The binary variables φk j, φi {0, 1} capture the notion of significant changes in events through changes

71 57 Sum of Target sources E E (Target sources) S1 S1 S2 S2 S3 S3... SI SI (Surrogate sources) K1 K2 K1 K2 K3... K3 KJ KJ Figure 6.2: Generative process for HQCD. As an example consider civil unrest protests. In the framework, different protest types (such as Education- and Housing-related protests) form the targets denoted by Si s. The total number of protests will be denoted by the topmost variable E. Finally, the set of surrogates, such as counts of Twitter keywords, stock price data, weather data, network usage data etc. are denoted by Kj s. collected up and until somep time T. The cumulative sum of the target sources Si s at time t is given by E(t), i.e., E(t) = Ii=1 si (t). Concurrent to target sources, we also observe a set of (T ) (T ) (T ) (T ) J surrogate sources, K (T ) = {K1, K2,..., KJ }, where Kj = [kj (1), kj (2),..., kj (T )], for j = 1,..., J, which may either have a causal or effectual relationship with the target source set S (T ) (see Figure 6.2). We assume that targets and surrogates follow a stochastic Markov process as follows: (T ) (T ) (T ) (T ) P (S (T ), K (T ) ) =P (S1,..., SI, K1,..., KJ ) ( J ) T I Y Y φk Y S φ = Pt j (Kj (t)) Pt i Si (t) S (t 1), K (t 1). t=1 j=1 i=1 S The binary variables φk j, φi {0, 1} capture the notion of significant changes in events through changes in distribution of the generative process as follows: if the surrogate source Kj undergoes a change in distribution at some time t, then, φk j changes from 0 to 1. In other 0 1 words, Pt (Kj ) (respectively Pt (Kj )) denotes the pre-change (post-change) distribution of the jth surrogate source. Similarly, if the target source Si undergoes a change in distribution at

72 58 some time t, then φ S i changes from 0 to 1. In other words, Pt 0 (S i ) (respectively Pt 1 (S i )) denotes the pre-change (post-change) conditional distribution of the jth target data source. We denote Γ Kj (respectively Γ Si ) as the random variable denoting the time at which φ K j (respectively, φ S i ) changes from 0 to 1. Finally, we write Γ K = (Γ K1,..., Γ KJ ), and Γ S = (Γ S1,..., Γ SI ) as the collective sets of changepoints in the surrogate and target sources, respectively. Finally, denote Γ E as the changepoint random variable for the top layer, E, which represents the sum total of all target sources. From QCD to HQCD We extend the concepts of QCD presented in Section to multilevel setting by formalizing the problem as the earliest detection of the set of all (J + I + 1) changepoints, i.e., Γ = { Γ K, Γ S, Γ E } having observed the target and surrogate sources i.e. ( S(T ), K (T )). Let γ = { γ K, γ S, γ E } be the (J + I + 1) vector of decision variables for the changepoints. To measure detection performance, we define the following two novel performance criteria: Multi-Level Probability-of-False-Alarm (ML-PFA): ML-PFA( γ) = P ( γ Γ ), (6.1) where for any two N length vectors a b, the notation implies a i b i, for i = 1,..., N. For instance, consider the example of I = 1 target, and J = 1 surrogate. Then Γ = (Γ K1, Γ S1 ) and γ = (γ K1, γ S1 ), and the probability of multi-level false alarm is given by ML-PFA(γ) = P(γ K1 Γ K1, γ S1 Γ S1 ). This definition of ML-PFA declares a false alarm only if all the (J + I + 1) change decision variables are smaller than the true changepoints. Expected Additive Detection Delay (EADD): EADD(γ) = E ( γ Γ ) J 1 = E( γ Kj Γ Kj ) + j=1 } {{ } Surrogate layer delay I E( γ Si Γ Si ) i=1 } {{ } Target layer delay + E γ E Γ E }{{} Top layer delay (6.2) Given the observations, i.e., all target and surrogate sources ( S (T ), K (T ) ) till time T governed by unknown changepoints Γ, we aim to make an optimal decision γ about these changepoints under the following criterion γ (α) = arg min γ EADD(γ) s.t. ML-PFA(γ) α. (6.3) In other words, γ (α) is the optimal change decision vector which minimizes the EADD while guaranteeing that the ML-PFA is no more than a tolerable threshold α. We note that the above optimal test is challenging to implement for real-world data sets due to following issues: a) it requires the knowledge of pre- and post- change distributions (for all

73 59 sources) and the distribution of the changepoint random vector Γ, b) unlike single source QCD, finding the optimal γ (α) requires a multi-dimensional search over multiple sources, making it computationally expensive, and c) it does not discriminate between false alarms across different sources. For instance, declaring false alarm at a target source (such as premature declaration of the onset of protests or disease outbreaks) must be penalized more in comparison to declaring false alarm at a surrogate source (such as incorrectly declaring rise in Twitter activity). Bounded approximation of HQCD We can circumvent the problem (b) of the original definition of ML-PFA as given in equation 6.1 by upper bounding it in Theorem 6.1. Theorem 6.1 (Modified-PFA). Let γ = { γ S, γ K, γ E } be the a set of estimates about true changepoint for targets, surrogates and sum-of-targets, respectively. Then under the condition of greater importance to accurate target layer detections, ML-PFA (see 6.1) is upper-bounded by Modified-PFA, where: Modified-PFA(γ) I max P(γ Si Γ Si ) + min P(γ Kj Γ Kj ) + P(γ E Γ E ) i j (6.4) Proof. We can prove the upper bound of ML-PFA with the following reductions: ML-PFA(γ) = P(γ Γ) = P(γ S Γ S, γ K Γ K, γ E Γ E ) (a) P(γ S Γ S) + P(γ K Γ K) + P(γ E Γ E ) (b) (c) I i=1 P(γ S i Γ Si ) + P(γ K Γ K) + P(γ E Γ E ) I max P(γ Si Γ Si ) + P(γ K Γ K) + P(γ E Γ E ) i I max P(γ Si Γ Si ) + min P(γ Kj Γ Kj ) + P(γ E Γ E ), i j (6.5) where (a) and (b) follows from the union bound on probability and (c) follows from the fact that the joint probability of a set of events is less than the probability of any one event, i.e., P(γ K Γ K) P(γ Kj Γ Kj ), for any j = 1,..., J, and then taking the minimum over all j. The resulting upper bound in (6.5) leads to the basis of the modification of the multi-level PFA: Modified-PFA(γ) I max P(γ Si Γ Si ) + min P(γ Kj Γ Kj ) + P(γ E Γ E ) i j Modified-PFA expression leads to intuitive interpretations as follows: (i) as false alarms at targets can have a higher impact, it is desirable to keep the worst case PFA across these to

74 60 be the smallest, or equivalently, max i P(γ Si Γ Si ) should be minimized. (ii) false alarms at surrogates are not as important and we can declare a false alarm if all of the surrogate level detection(s) are unreliable, or equivalently, min j P(γ Kj Γ Kj ) needs to be minimized. (iii) notably, the above modification leads to a low-complexity change detection approach across multiple sources by locally optimal detection strategies avoiding a multi-dimensional search. Based on Modified-PFA, we next present a compact test suite to declare changes at prespecified levels of maximum PFA as given in Theorem 6.2 and incorporate specificity issues pointed out in problem (c) of the original formulation of PFA. Theorem 6.2 (Multi-level Change Detection). Let Γ Si be the true change point random variable for the ith target source, S i. Let Γ Kj and Γ E represent the same for the jth surrogate and the sum-of-targets, respectively. Let the data observed till time T be D (T ) ( S(T ) (T, K )) and P ( Γ D (T ) ) denote the estimate of the conditional distribution (see Section 6.2.2). Then, if α i, β j, λ represent the PFA thresholds for the S i, K j, E, the changepoint tests can be given as: { γ Si (α i ) = inf n : TS Si (D (T ) ) α } i, i = 1,..., I (6.6a) 1 + α { i γ Kj (β j ) = inf n : TS Kj (D (T ) ) β } j, j = 1,..., J (6.6b) 1 + β j { γ E (λ) = inf n : TS E (D (T ) ) λ }, (6.6c) 1 + λ where TS X (D (T ) ) = P(Γ X n D (T ) ) is the test statistic (TS) for a source X. Proof. In quickest change detection, our goal at time T is to decide if a change should be declared for some n T for a particular data source. To this end, we can use the following change detection test γ Si (α i ) = inf n : log P ( (T Γ Si n D )) ) log(α i ) P (Γ Si > n D (T ), which is equivalent to the following test: { γ Si (α i ) = inf n : P ( Γ Si n D (T )) α } i. (6.7) 1 + α i Intuitively, the above test declares the change for the ith target source S i at the smallest time n for which the test statistic (i.e., posterior probability of the change point random variable being less than n) exceeds a threshold. The probability of false alarm for the above

75 61 test can be bounded in terms of the threshold α i as: P(γ Si Γ Si ) = D n (T ) P(D(T ), γ Si = n)p(γ Si > n D (T ), γ Si = n) (d) ( ) P(D (T ) 1, γ Si = n) 1+α i D (T ) n }{{} =1 1 = 1+α i, (6.8) where (d) follows from the fact that given the observed data and the event, γ Si = n, i.e., the change is declared at n, then it follows from equation 6.7 that P(Γ Si > n D (T ), γ Si = n) 1/(1 + α i ) Let us denote the test statistic (TS) for a data source X as: TS X (D (T ) ) = P(Γ X n D (T ) ) Then, then the multi-level change detection test is: γ Si (α i ) = inf{n : TS Si (D (T ) ) α i 1 + α i }, i = 1,..., I γ Kj (β j ) = inf{n : TS Kj (D (T ) ) }, j = 1,..., J 1 + β j γ E (λ) = inf{n : TS E (D (T ) ) λ 1 + λ } β j From Theorem 6.2, we can infer the following boundedness property of Modified-PFA as expressed in the following Lemma. Lemma 6.3. If we define α = min i (α i ) and β = max j (β j ), then Modified-PFA in equation 6.4 can be bounded as: Modified-PFA(γ) I α β λ (6.10) 6.2 HQCD for Count Data via Surrogates In this section we discuss the HQCD framework for count data sources which may be observed in real life. For example, we can analyze the number of protests towards early detection of protest uprisings via surrogate sources. Protests can happen in civil society for various reasons such as protests against fare hike or protests demanding more job opportunities. Such

76 62 Algorithm 1: HQCD Multi-level Change Point Detection Algorithm Input : At time T, Target and Surrogate Sources D (T ) = ( S (T ), K Parameters: PFA threshold for targets (α), surrogates (β), and sum of targets (λ) Output : Changepoint Decisions γ S, γ K, γ E at each timepoint T 1 for each T do 2 Update joint posteriorp (Γ K, Γ S, Γ E D (T ) ) // target change detection 3 for i 1 to I do 4 Compute target marginal P (Γ Si D (T ) ) 5 Find γ Si (α) using 6.6a 6 γ S {γ S1 (α),..., γ SI (α)} // surrogate change detection 7 for j 1 to J do 8 Compute surrogate marginal P (Γ Kj D (T ) ) 9 Find γ Kj (β) using 6.6a 10 γ K {γ K1 (β),..., γ KJ (β)} // sum-of-targets change detection 11 Compute sum-of-targets marginal P (Γ E D (T ) ) 12 Find γ E (λ) using 6.6c 13 Return Decision γ S, γ K, γ E (λ) at T (T )) protests, especially major changes in protest base levels, are potentially interlinked. However explaining such interactions is a non-trivial process. [48] found several social sources, especially Twitter chatter, to capture protest related information. We apply HQCD to find significant changes in protests concurrent to changes in Twitter chatter, such that detecting changes accurately are of primary importance in contrast to the chatters which can be influenced by a range of factors, including protests. In general, HQCD can be applied in similar events, such as disease outbreaks, to find significant changes in targets using information from noisy surrogates Hierarchical Model for Count Data In general, HQCD can be applied to any count data sources. However, the exact specification may depend on the application. For example, considering protest uprisings, we first note that surrogate sources such as Twitter are in general noisy and involve a complex interplay of several factors - one of which could be protest uprisings. Furthermore, for protest uprisings, we are more concerned in using the surrogates (Twitter chatter) to help declare changes at target level (protest counts) than accurately identifying the changes in surrogates. Thus, without loss of generality, we model the surrogates as i.i.d. distributed variables. Figure 6.3) evaluates the i.i.d. assumptions, for both protest counts and Twitter chatter. Our results indicate that Log-normal is a reasonable fit for Twitter chatter. Surrogate Sources: Formally, we assume that the j th surrogate source K j is generated i.i.d.

77 Pre LogNorm Fit LogNorm Fit Post Norm Fit M = s =1.77 M = s =1.48 µ = σ = M = s = Full Data LogNorm Fit Pre LogNorm Fit LogNorm Fit Post LogNorm Fit (a) (b) Figure 6.3: Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) target source (Number of protests of different categories), for various temporal windows, under i.i.d. assumptions. These assumptions lead to satisfactory distribution fit, at a batch level, for both sources. The top-most row corresponds to the period before the Brazilian spring (pre ), the second row is for the period to , and the third is for the period after The last row shows the fit for the entire period. These temporal fits are indicative of significant changes in distribution along the Brazilian Spring timeline, for both target and surrogates. from a distribution f K w.r.t to the associated changepoint Γ Kj as: { k j (t) i.i.d f K (φ K j 0 ) t Γ Kj f K (φ K j 1 ) t > Γ Kj (6.11) where, φ K j 0 and φ K j 1 are the pre- and post-change parameters. Following our earlier discussion, we select f K as Log-normal (with location and scale parameters φ K j = {c K j, d K j }) for Twitter counts. Target Sources: Target sources can in general be dependent on both the past values of targets as well as the surrogates. Here, we restrict the target source process to be a first order Markov process. Under this assumption, we formalize the i th target source S i to follow a Markov process ft S w.r.t to its changepoint Γ Si as: { f S s i (t) t (φ S i 0 (t)) t Γ Si ft S (φ S (6.12) i 1 (t)) t > Γ Si

78 64 where, φ S i 0 and φ S i 1 are the pre- and post-change parameters of the process. Poisson process with dynamic rate parameters has been shown [8] to be effective in specifying hierarchical count data w.r.t changepoints. Here, we model the rate parameters as a nested autoregressive process [22, 8] given as: ( φ S i 0/1 (t) = φs i 0/1 (t 1) + Ai 0/1 (t) S(t 1) + N (0, σ A i 0/1 (t) S ) K(t 1) (6.13) A i 0/1 (t) = Ai 0/1 (t 1) + N (0, Σ A i) Here, φ S 0/1 (t) captures the latent rate and σ S denotes the error variance. A i 0/1 (t) captures the variation due to the observed values of target and surrogates sources. Changepoint Priors: Following our prior discussion, surrogate changepoints can be assumed to have an uninformative prior and we model Γ Kj via a memoryless arrival distribution (static probability of observing change given it hasn t occurred earlier) as: Γ Kj Geom(ρ Kj ) P (K j = t K j t) = ρ Kj (6.14) Conversely, target changepoints can be influenced by surrogate changepoints as their generative process is dependent on the surrogates. Specifically, whenever we observe a changepoint in the surrogates, we assume that the base rate of changepoint for a target to increase for a certain period of time. Formally, target changepoint priors are assumed to follow a dynamic process as: Γ Si Geom(ρ Si (t)) (6.15) ρ Si (t) = ρ Si + I(Γ Kj < t)µ 1 je µ2 j (t Γ K j ) j where, I is the indicator function. ρ Si represents the nominal base rate for the changepoint. It can be seen, a change in the jth surrogate source is modeled as an exponentially decaying impulse of amplitude µ 1 j. The summation of targets, E(t) is known deterministically given S i (t). Moreover, given S i (t 1), E(t) can be considered to be summation of independent Poisson processes following similar dynamics as equation 6.13 which is omitted due to limited space. Similarly, relationships for dependence of Γ E can be modeled to be dependent on K similar to equation Changepoint Posterior Estimation Algorithm 1 involves posterior estimation of the changepoints given the data at a particular time point. Earlier work has focused mainly on offline methods such as Gibbs Sampling [8]. Online posterior estimation for such problems have been studied extensively in the context of Sequential Bayesian Inference [9] such as Kalman filters [26, 55, 2] (Gaussian transitions) and Particle Filters [18, 47, 20]. Recently, Chopin et al. [16] proposed a robust Particle Filter, SMC 2 which is ideally suited for fitting the parameters of the non-linear hierarchical model described in Section In this section we formulate a Sequential Bayesian Algorithm that makes the HQCD tractable under real world constraints (see Figure 6.4). )

79 Gibbs Sampling HQCD HQCD without surrogates Time (in min) Simulated Brazil Venezuela Uruguay Dataset Figure 6.4: Computation time for one complete run of changepoint detection (in mins) on a 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD without surrogates. Gibbs sampling computation times are unsuitable for online detection. To find the posterior P ( ΓS, Γ K, Γ E D (T )) at any time T using SMC 2 we first cast the model parameters and variables into the following three categories: Observations (y T ): In the context of SMC 2 these are the parameters that correspond to observed variables at each time point T. For HQCD we can model y T as: y T = {S(T ), K(T )} (6.16) Hidden States (x T ): SMC 2 estimates the observations based on interaction with hidden states which are dynamic, unobserved and is sufficient to describe y T at T. For HQCD, we can express x T as follows: x T = { Γ S, Γ K, Γ E, φ S 0/1(T 1), φ K 0/1, (6.17) ρ K (T ), ρ S (T ), Ā0/1, S(T 1), K(T 1)} Static Parameters (θ): Finally, SMC 2 also accommodates the concept of static parameters which do not change over time such as the base probabilities of changepoint ρ S and the noise matrix Σ A in HQCD. We can express θ as: θ = {σ S, Σ A, ρ S, µ 1, µ 2 } (6.18) For a given set of such parameters, SMC 2 works by first generating N θ samples of θ using the prior distribution P (θ). For each of these samples of θ, SMC 2 samples N X samples of x 0 from its prior P (x 0 θ). Following standard practices, we use conjugate distributions [9] for the priors.

80 66 Algorithm 2: HQCD Changepoint Posterior estimation via SMC 2 Input : At time T, y T as give in equation 6.16 Parameters: Prior distributions P (θ) and P (x 0 θ) Hyperparameters for P (θ) and P (x 0 θ) Output : joint posterior P (Γ K, Γ S, Γ E D (T ) ) 1 Define x T as give in equation Define θ as give in equation 6.18 // Initialization 3 Sample N θ number of θ q using P (θ) 4 Sample N x number of x 0q,r using P (x 0 θ q ) 5 Update weights w(0) // See Appendix // Online Learning 6 for each T do // State Updates 7 for each q N θ do 8 for each r N x do 9 Update States: x Tq,r from x T 1q,r 10 Compute Importance weights w q,r (T ) 11 Compute observation probability P (y T y T 1, θ q ) // Incorporate observation at time T 12 Update Importance weight w q,r (T ) w q,r (T )P (y T y T 1, θ q ) // test premature convergence 13 Test degeneracy conditions using effective sample size 14 if degeneracy then // markov kernel jumps 15 Update x Tq,r by multiplying a markov Kernel K T // recomputing weights 16 exchange x Tq,r and set w qr 1 // Find joints 17 Return Update P ( Γ S, Γ K, Γ E D (T ) ) using equation 6.19

81 67 Table 6.2: (Synthetic data) comparing true changepoint (Γ) for targets against detected changepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additive detection delay (ADD). Each row represent a target and best detected changepoint is shown in bold whereas false alarms are shown in red. True GLRT WGLRT BOCPD RuLSIF HQCD HQCD w/o surr. Γ γ ADD γ ADD γ ADD γ ADD γ ADD γ ADD S S S S S At each time point T, the samples are perturbed using the model equations given in Section and associated with weights w to estimate the joint posteriors as: P (θ, x T y T ) = N θ N x w q,r δ(θ, x T ) q=1 r= 1 P ( ΓS, Γ K, Γ E D (T )) N θ N x w q,r δ( Γ S, Γ K, Γ E ) r= 1 q=1 (6.19) where, δ is the Kronecker-delta function. Algorithm 2 outlines the steps involved in this process. For more details on SMC 2 see Appendix. 6.3 Experiments We present experimental results for both synthetic and real-world datasets, and compare HQCD against several state-of-the-art online change detection methods (see Table 6.1), specifically, GLRT [54], W-GLRT [31], BOCPD [1] and RuLSIF [35]. To further analyze the effects of surrogates in detecting changepoints, we compare against HQCD without surrogates, where K(t 1) is dropped from equation 6.13 and ρ Si (t) is made static (i.e. independent of changepoints from surrogates) in equation Synthetic Data In this section, we validate against synthetic datasets with known changepoint parameters. For this, we pick 5 targets (I = 5) and 10 surrogates (J = 10). The surrogates were generated from i.i.d. Log-normal distributions (see equation 6.11) while the targets were generated using Poisson process (see equation 6.12). The changepoints for surrogates were

82 Target Target Target Target Target Figure 6.5: Comparison of HQCD against state-of-the-art on simulated target sources. X-axis represents time and Y-axis represents actual value. Solid blue lines refer to the true changepoint, solid green refers to the ones detected by HQCD and brown refers to HQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detection for most targets with low overall detection delay and false alarms. sampled from a fixed Gamma distribution (see 6.14) while the associated changepoints for target sources were simulated via equation Comparisons with state-of-the-art As true changepoints are known for the synthetic dataset, we can compare HQCD against the state-of-the-art methods for the detected changepoint as shown in Figure 6.5. Table 6.2 presents the results in terms of the false alarm (FA) and additive detection delay (ADD). From the table, we can see that HQCD is able to detect the changepoints with fewer false alarms. Also HQCD has the lowest delay across all methods for all targets except Target-1 for which HQCD without surrogates achieved better delay indicating the surrogates are not informative for this target source. Usefulness of Surrogates Our comparisons with the state-of-the-art shows significant improvements that were achieved by HQCD, both in terms of FA and ADD and showcase the importance of systematically admitting surrogate information to attain a quicker change detection with low false alarm. We compare HQCD with surrogates against HQCD without surrogates (Table 6.2) and find that admitting surrogates significantly improves average delay (2.5 compared to 4.2). We also plot the average false alarm rate against the detection delay in Figure 6.6 and find that HQCD results are in general the ones with the best trade-off between FA and ADD.

83 69 HQCD Without Surrogates HQCD False Alarm Detection Delay BOCPD RuLSIF W-GLRT GLRT Figure 6.6: False Alarm vs Delay trade-off for different methods. HQCD shows the best trade-off Real life case study In real-life scenarios, the true changepoint is typically unknown. One representative example could be seen w.r.t. the onset of major civil unrest related protests and uprisings. We present an analysis of three major uprisings: (i) in Brazil around mid 2013 (often termed as the Brazilian Spring), (ii) in Venezuela around early 2014 and, (iii) in Uruguay around late We first describe the data collection procedure and followup with a comparative analysis of detected changepoints. Weekly counts of civil unrest events from Nov to Dec were obtained as part of a database of discrete unrest events (Gold Standard Report - GSR) prepared by human analysts by parsing news articles for civil unrest content. Among other annotations, the GSR also classifies each event to one of 6 possible event types based on the reason ( why ) behind the protest. Each of these event types such as a) Employment and Wages, b) Housing, c) Energy and Resources, d) Other government, e) Other economic and f) Other, bears certain societal importance. We treat the weekly counts of each of these event-types as target sources (S) and the sum total of all protests for a week as the sum-of-targets (E). We also collected geo-fenced tweets for each country over the same time-period. We used a human-annotated dictionary of 962 such keywords/phrases that contains several identifiers of protest in the languages spoken in the countries of interest (similar to Ramakrishnan et.al. [48]). As most of these keywords could have similar trends, we cluster them using k-means into 30 clusters (i.e., we have J = 30 surrogates). To account for scaling effects while preserving temporal coherence, each keyword time series was normalized to zero-mean and unit variance.

84 Event Counts Jan 2013 Feb Mar Apr May Jun Jul Aug Sep (a) Brazil Total Protests Event Counts Jan Feb Mar (b) Venezuela Total Protests Event Counts Dec Jan 2014 (c) Uruguay Total Protests Figure 6.7: Comparison of detected changepoints at the sum-of-targets (all Protests). HQCD detections are shown in solid green while those from the state-of-the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines. HQCD detection is the closest to the traditional start date of Mass Protests in the three countries studied. Changepoint Across layers We show the changepoints detected by HQCD (bold green) and the state-of-the-art methods (dashed lines) for the sum-of-all protests in Figure 6.7 (see Figure C.1 in Appendix C for individual protest types). We can observe that HQCD, which uses the surrogate information sources and exploits the hierarchical structure, finds indicators of changes which are visually better as well as more aligned to the dates of major events (See demo at github.io/hqcd_supplementary). In contrast, the state-of-the-art methods can be argued to show significantly high false alarm rate. For such real world data sources, the notion of a true changepoint is difficult to ascertain, we can instead consider for example the onset of Brazilian spring protests ( ) as an underlying changepoint to compare at the sum-of-targets and interpret notions of false alarm. Table C.1 tabulates these inferences for the targets as well as the sum-of-targets. Although, a true changepoint is unknown, we note that for HQCD, the expected additive detection delay (EADD) can be estimated according to equation 6.2 (from P ( Γ D (T ) ) in Algorithm 2). Changepoint influence analysis The experiments presented in the previous section can be further analyzed to ascertain the nature of progression of significant events that lead to a protest. Here we present our analysis for Brazilian Spring. We found that detected changepoints (see Table C.1 in Appendix C) for Brazil reveal an interesting progression - significant changes in Energy related unrest (06/02) propagated to Housing/Other Govt. Unrest (06/16) and culminated in mass Employment related unrest (08/18). Interestingly, we can analyze the fitted parameters of the weight vector A i 0/1 of the rate updates (see 6.13) to quantize the changepoint influence of a source (target/surrogate) at time T 1 to time T. For each target S i, we can compute the average

71 Employment Energy and Resources Housing Other Other Economic Other Government 40 36 Employment 32 Energy and Resources 28 24 Housing 20 Other 16 12 Other Economic 8 Other Government 4 0 Employment

85 71 Employment Energy and Resources Housing Other Other Economic Other Government Employment 32 Energy and Resources Housing 20 Other Other Economic 8 Other Government 4 0 Employment Energy and Resources Housing Other Other Economic Other Government (a) Influence of lagged targets on current targets (b) Influence of lagged surrogates on current targets Figure 6.8: (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a); and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepoint influence. (a) shows presence of strong off-diagonal elements indicating strong cross-target changepoint information. (b) shows a mixture of uninformative and informative surrogates

86 72 value of the weight vector component of each target/surrogate separately. Let h 0 and h 1 denote these averages for one such source. Effectively, h 0 then measures the effect of the source at time t 1 on S i at t before change while h 1 captures the same post change. Their percentage relative change can then be used as a measure of the changepoint influence of a particular target/surrogate source on S i. We plot a heatmap of these percentages in Figure 6.8 for both targets and surrogates, separately. From Figure 6.8a, we can see that Other Economic and Employment related protests had strong influences from Housing related protests. Furthermore, from Figure 6.8b we can see Housing and Employment related protests were influenced by similar Twitter chatter clusters (cluster-01 and cluster- 26) - indicating that the interaction between these protest subtypes can be inferred from social domain. Conversely, Housing and Other Economic related protests are only weakly correlated through Twitter chatters - thus exhibiting the robustness of HQCD which can still detect interactions between targets when surrogates fail to explain the same. In general, for a particular target we can see linked pre-cursors in other targets (strong off-diagonal elements in Figure 6.8a) and highly specific informative surrogates (few strong cells for a row in Figure 6.8b). 6.4 Discussion We have shown HQCD to be an effective framework towards detecting changepoints in an online manner while accommodating multiple sources in a hierarchical framework. HQCD has been validated against both synthetic sources and real-life scenarios. In the next chapter, we will next present our efforts at utilizing these changepoints towards robust forecasting models. Supporting Information A demo of HQCD and the datasets used in this chapter can be found in Attached appendix provides additional details on SMC 2.

87 Chapter 7 Concept Drift Adaptation for Google Flu Trends Early detection of disease outbreaks can lead to prompt response strategies and effective implementation of counter-measures. Syndromic surveillance mechanisms hold great promise in improving lead-time to detection. Google Flu Trends (GFT) was one of the most celebrated example of syndromic surveillance and has emerged as one of the most popular mechanisms involving non-clinical data. Recent work, including at Google, has shown that systems like GFT, just like other surveillance and forecasting strategies, require periodic re-training and adaptation every year. In particular, GFT estimates tend to be locally spiky in nature, which often lead to difficulties in regression w.r.t CDC ILI surveillance data. In addition to local variations, we posit that the fundamental cause of major seasonal performance variations of GFT is due to dynamic patterns in user search behavior. Such a phenomenon can be analyzed under the framework of concept drift. Our proposed approach is to explicitly model concept drift to make such ILI estimates from surrogates sources such as GFT more robust and in an online manner. 7.1 Background Google Flu Trends first came into limelight with Ginsberg et al. s seminal work [24] on mining indicators for disease surveillance from social media activity. This work has spurred a flurry of research in this domain such as [38]. GFT ILI estimates were available for several countries and for several regions, which can be used epidemiologists to gain quick insight into the prevalent influenza state. However, as noted in some recent studies such as [6, 33, 32], GFT is under-performing against official surveillance data. In spite of updates to the GFT system, which attempt to rescale search query terms in response to sudden spikes in the search data, the drifting performance issue hasn t been completely resolved [32]. 73

88 74 GFT-Argentina Rolling mean Figure 7.1: Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), the corresponding 52-week rolling mean (right) exhibits a saddle point in early indicates a possible mean shift drift in GFT for Argentina. Part I and II have outlined our efforts at short-term and long-term forecasting of infectious diseases using surrogate sources. Such forecasts were also generated for the IARPA OSI nationwide challenge our winning team developed Early Model Based Event Recognition using Surrogates (EMBERS) [48] - an automated continuous surveillance and predictive system that monitors among other things, epidemic and rare disease outbreaks. During this effort we came to better understand the inherent drift in surrogate-target relationships for diseases and we had to continuously monitor and adapt our models focusing equal attention to robustness and efficacy. During this experience, we learned that the effective usage of open source data, in presence of ever-changing data patterns, necessitates incorporation of adaptivity in models. Specifically, for ILI we have been monitoring a set of keywords in several media such as Google search data, news and Twitter, and found evidences of evolving correlations of such keyword counts to surveillance data [12]. We have also been closely collaborating with CDC for the past three years and providing forecasts about US national and region level ILINet percentages as well as seasonal indicators such as peaks. Such efforts led us to run a market for flu predictions under Scicast (https: //scicast.org/flu). These experiences corroborate with our EMBERS observations and we have made similar observations about ILI disease surveillance in general. Focusing on GFT, we conducted experiments for six Latin American countries, namely Argentina, Bolivia, Chile, Mexico, Peru and Paraguay. Figure 7.1 shows the GFT data for Argentina (from ) and the corresponding rolling mean (over a 52 week window). As can be seen the rolling mean indicates that the average activity of flu trends showed a major shift around Apart from the major change, similar other local changes in mean shift can also be observed. Rolling statistics over standard deviations and Kurtosis provides similar insights. In general a combination of these measures indicate that the GFT data distribution is non-stationary. From a machine learning perspective, such non-stationarity

75 GST Data GFT Data Weather HealthMap Surveillance Data Concept Drift Detector Target/Source Mismatch Drift Adaptation Drift Adaptive Resampling Model Retargetting Robust Forecasts Figure 7.

Drift probabilities are next passed onto Drift Adaptation stage where robust predictions are generated using resampling based methods.

Concept drift is known to cause predictions to be less accurate over time and identification/handling of such drifts can show significant improvement in models.

89 75 GST Data GFT Data Weather HealthMap Surveillance Data Concept Drift Detector Target/Source Mismatch Drift Adaptation Drift Adaptive Resampling Model Retargetting Robust Forecasts Figure 7.2: Concept Drift Adaption Framework. Framework ingest target sources such as CDC ILI case count data and surrogate sources such as GFT and detects changepoints via Concept Drift Detector stage. Drift probabilities are next passed onto Drift Adaptation stage where robust predictions are generated using resampling based methods. in the independent variables leads to varying statistical correlation with the target variable (here official surveillance data) also referred to as concept drift. Concept drift is known to cause predictions to be less accurate over time and identification/handling of such drifts can show significant improvement in models. We observed similar trends for concept drift in GFT data for the other five Latin American countries. 7.2 Robust Models via Concept Drift Adaptation Concept drift is an actively studied problem and researchers have proposed many different methods to handle concept drifts [23]. Some of the more popular methods focuses on ensemble models where ensembles can either be created at model level or via random resampling of data points to constitute a drift adapted dataset which can next be passed on to machine learning algorithms. We focused on the random resampling approaches with an aim towards computationally inexpensive and generic approach and propose a two-step formalism to handle concept drift towards a Robust GFT estimate. First, we detect concept-drifts in the surrogate-target data relationships using an online nonparametric changepoint detection test (see Chapter 6). We used windowed GLRT approaches using Poisson Regression model from surrogate data sources to ILI surveillance data and analyze the regression errors (slack) for changes in distribution. Following classical CUSUM test and our experiences (see Chapter 6), we propose a rolling window over the series of slacks and identify change points based on log-likelihood ratios. These log-likelihood ratios can then be used as probabilities of concept drift for each time-point and we can use

90 76 weighted resampling of past data where the weights for sampling the time-point t can be given as: w t = 1 L drift(t) t (1 L drift(t)) (7.1) where L drift (t) quantifies the drift at time t in terms of likelihood of a change at the said time point. The second component involves fitting a Poisson Regression once more, but this time on the resampled dataset to find updated model parameters and generate the adapted GFT estimates. We use random resampling without replacement using drift probabilities from equation 7.1 and fit our Poisson regression model on the same. The framework can be roughly shown as given in Figure 2. We can also employ a feedback mechanism where past accuracies of adapted GFT to ILI surveillance data is used to update the computed Log-likelihood for drift Experimental evaluation and comparing Surrogate Sources The proposed method, outlined in the previous section, can capture drifts using aggregated surrogate activity. Similar to Part I and Part II, we intend to compare different surrogate sources for drift correction ability. Our results from Part II has indicated that long-term forecasts for Mexico were especially noisy. As such, we focus on Mexico for the following assay. We applied our drift adaptation framework for the season and Table 7.1 presents our findings. As can be seen, incorporation of surrogate sources via drift adapters significantly improves forecasting accuracy. GST contributes most significantly towards the drift adaptation while a combination of all sources produces the best overall forecasting accuracy. Significant drift adaptation could also be seen for HealthMap, however the absolute value of forecasting accuracy renders HealthMap source insignificant for the country of interest. source Table 7.1: Comparison of surrogate sources pre- and post-drift adaptation. Pre Drift Correction Post Drift Correction Percentage correction GST GFT HealthMap Weather All

91 77 We also plot the quality score and deviance plots for pre drift-corrected and post driftcorrected forecasts for GFT, GST, HealthMap, Weather and all sources in Figures 7.3, 7.4, 7.5, 7.6, 7.7, respectively. As can be seen, the quality score distribution of forecasts shows a marked improvement, both in terms of higher absolute value and tighter bounds, for postdrift corrected models. The figures also show the distribution of residual deviance. In terms of concept drift, a narrow distribution indicates a well fitted problem and hence better drift correction, whereas a more spread out deviance distribution indicates a sub-optimal correction. The deviance plots also exhibits the efficacy of our methods - especially GST and combined sources shows marked improvement indicating these are the best methods of correcting for drift. 7.3 Discussion We have proposed a computationally inexpensive method of drift adaptation for disease sources for Mexico. Our results indicate that significant improvement in forecasting, as well modeling, accuracy could be achieved by including surrogates via the proposed framework. Furthermore, a combination of all sources performs best in terms of drift adaptation, thus exhibiting the importance of considering diverse sources. In future, we would extend this analysis to more regions and ascertain relative importance of such sources wṙṫṫo the regions.

92 QS corrected Drift Adaptation uncorrected (a) Quality Score distribution of forecasts before and after drift correction 160 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency Frequency Deviance Deviance (b) Residual Deviance Distribution Figure 7.3: Drift Adaptation for Mexico using GFT

93 79 70 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency Frequency Deviance Deviance (a) Quality Score distribution of forecasts before and after drift correction 70 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency Frequency Deviance Deviance (b) Residual Deviance Distribution Figure 7.4: Drift Adaptation for Mexico using GST

94 QS corrected Drift Adaptation uncorrected (a) Quality Score distribution of forecasts before and after drift correction 140 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency Frequency Deviance Deviance (b) Residual Deviance Distribution Figure 7.5: Drift Adaptation for Mexico using HealthMap

95 QS corrected Drift Adaptation uncorrected (a) Quality Score distribution of forecasts before and after drift correction 120 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency 60 Frequency Deviance Deviance (b) Residual Deviance Distribution Figure 7.6: Drift Adaptation for Mexico using weather sources

96 QS corrected Drift Adaptation uncorrected (a) Quality Score distribution of forecasts before and after drift correction 250 Drift uncorrected Residual Deviance: Drift corrected Residual Deviance: Frequency Frequency Deviance Deviance (b) Residual Deviance Distribution Figure 7.7: Drift Adaptation for Mexico using All sources

97 Chapter 8 Conclusion We have presented the problem of time series prediction using surrogates and motivated our efforts by examining the particular case of influenza forecasting. We identified three major thrusts for this problem viz. (i) short-term forecasting, (ii) long-term forecasting and (iii) concept drift. We presented our approaches for each of these thrusts in this thesis and communicated our findings in [12, 62, 13]. Our results showcase the efficacy of using surrogates to forecast about disease characteristics. In the following section, we discuss the importance of surrogate information available from open source indicators for public health surveillance and conclude with some key insights on how such surrogates can be used towards an integrated surveillance mechanism. 8.1 Importance of Open Source Indicators for Public Health Our results indicate that open source indicators (OSI) are extremely useful for forecasting various facets of disease characteristics such as peak intensity and case counts in the short term. One of the key advantages of using surrogates could be attributed to the real-time nature of such sources as well as their ready availability. However, such surrogates are in general noisy and may exhibit changing relationships with the disease characteristics of interest. For example, the volume of search queries for the term flu may have been more indicative of ILI case counts in the population for the years preceding 2011 than post-2012, for the United States. Thus, this work motivates the use of algorithms that in principle are aware of the possibility of such changing patterns and more importantly, are adaptable to such circumstances. In general, surrogate sources, especially the non-physical ones, can be considered to be sensors of disease spread in population rather than actual indicators of disease characteristics. Surrogates from a particular source (see Part I) may contain information about a certain stage of the disease spread than other. For example, Figure

98 84 indicates that disease keywords from HealthMap news corpus are more indicative during the start of the season whereas search query volumes as accessed by Google Search Trends exhibit a sub-optimal but stable correlation throughout the season. Thus a single OSI source may not be suitable towards robust disease forecasting. However, as seen from Part I and Part II, combining multiple surrogates can lead to a more robust and stable forecasting framework. It can be argued that multiple surrogates may provide better coverage over the different stages of the season. Also, noises such as spikes in search query activity may be better compensated by using a variety of OSI sources and a consensus of increased/decreased activity may better inform a forecasting framework. Another crucial aspect of public health surveillance is the fact that ground truth information available at a particular point of time is subject to noise. Consequently, models for disease forecasting should be aware of such noises, which can often be systematic. Such flexibility in modeling is more important while using OSI sources as such sources itself may be subject to noise. In this work, we have shown that forecasting with an ability to model the surveillance uncertainty increases the final forecasting accuracy manifold. This work has focused more on influenza as a primer for endemic disease forecasting. One of the key advantages of using influenza as an application is the fact that its one of the most common infectious disease worldwide exhibiting evolving patterns over regions and time and, more important has significant public health impact. In this work, we have found physical sources such as Temperature and Humidity to be more useful in forecasting influenza conditions in the population. Non-physical sources were found to contribute to the overall forecasting accuracy with varied degrees wṙṫċountries and disease characteristics. For short-term forecasts, Twitter chatter and disease related news were found to be significantly useful for a number of countries of interest. For long-term forecasts, surrogates were found to be more useful at the initial stages of the disease season where disease information from traditional surveillance is sparse and more noisy. For the later part of the season, simulation based models worked better than data assimilation models and inclusion of surrogates improved the overall forecasting accuracy to lesser degree. 8.2 Guidelines for using surrogates for Health Surveillance In this section, we combine the insights presented in the previous section with our experience on forecasting infectious diseases into a list of guidelines that may be followed while using surrogates for disease surveillance, as follows: Surrogates are more useful for forecasting diseases in regions where historical data for such sources as well as surveillance data for the said diseases are available for atleast a few disease seasons. For emerging diseases, such surrogates may still be useful but may

case counts for Argentina as available from PAHO.

99 85 (a) HealthMap (b) GST Figure 8.1: Correlation of surrogate sources with disease incidence. Count of influenza related keywords from (a) HealthMap and (b) GST compared against influenza case counts for Argentina as available from PAHO. HealthMap keywords capture the start of the season more accurately, while GST keywords exhibit a sub-optimal but consistent correlation with PAHO counts.

Data Driven Methods for Disease Forecasting

Data Driven Methods for Disease Forecasting Prithwish Chakraborty 1,2, Pejman Khadivi 1,2, Bryan Lewis 3, Aravindan Mahendiran 1,2, Jiangzhuo Chen 3, Patrick Butler 1,2, Elaine O. Nsoesie 3,4,5, Sumiko