Detecting Devastating Diseases in Search Logs

Size: px

Start display at page:

Download "Detecting Devastating Diseases in Search Logs"

Stuart Bradford
6 years ago
Views:

1 Detecting Devastating Diseases in Search Logs John Paparrizos Columbia University Ryen W. White Microsoft Research Eric Horvitz Microsoft Research ABSTRACT Web search queries can o er a unique population-scale window onto streams of evidence that are useful for detecting the emergence of health conditions. We explore the promise of harnessing behavioral signals in search logs to provide advance warning about the presence of devastating diseases such as pancreatic cancer. Pancreatic cancer is often diagnosed too late to be treated e ectively as the cancer has usually metastasized by the time of diagnosis. Symptoms of the early stages of the illness are often subtle and nonspecific. We identify searchers who issue credible, first-person diagnostic queries for pancreatic cancer and we learn models from prior search histories that predict which searchers will later input such queries. We show that we can infer the likelihood of seeing the rise of diagnostic queries months before they appear and characterize the tradeo between predictivity and false positive rate. The findings highlight the potential of harnessing search logs for the early detection of pancreatic cancer and more generally for harnessing search systems to reduce health risks for individuals. Keywords Health screening; Search logs; Behavioral data; Digital disease detection; Pancreatic cancer; Temporal analysis 1. INTRODUCTION Web search is a primary resource for people concerned about the significance of health-related symptoms [16]. Researchers have studied symptom and illness-related searches in pursuit of insights about how people search about health concerns, including patterns of querying and review of information in pursuit of diagnoses [54], healthcare utilization signals [55], traces of therapeutic decision making for challenging illnesses [40], and identification of new adverse effects of medications [56, 57]. Prior studies have examined how population-level signals in social media can be used to detect the emergence of diseases [10, 19]. Work conducted during a Microsoft Research internship. John Paparrizos is an Alexander S. Onassis Foundation Scholar. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD 16, August 13-17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN /16/08... $15.00 DOI: Pancreatic cancer symptom searchers (C) Negative cases Pancreatic cancer searchers (A) Experiential pancreatic cancer searchers (B) Positive cases Searchers in A who query for symptoms but have no experiential queries. Figure 1: Venn diagram depicting sets of users employed in analyses: pancreatic cancer searchers (A), pancreatic cancer searchers exhibiting experiential diagnostic queries (B), and those who search for the symptoms of pancreatic cancer (C). A[C (i.e., number of users in original, pre-filtered dataset) is 9.2 million. Positives and negatives are sourced from B\C and C \ A, respectively. Relative set sizes are not to scale. We explore the prospect of harnessing anonymized longterm sequences of health-related search queries to yield information that could provide valuable signals for detection of illness in advance of traditional diagnosis. Leveraging online behavioral data to provide earlier detection of a disease or of the raised risk of illness on a large scale can make significant contributions to healthcare. Better outcomes can be achieved by earlier confirmation of illnesses and risks via gaining access to more timely diagnoses, treatments, and other proactive interventions. As an example, such capabilities might help to identify those at significant risk of suffering the onset of advance of chronic disease processes such as diabetes or heart disease or the rise of acute processes such as atrial fibrillation or more severe cardiac arrhythmias. Interventional programs ranging from changes in diet or exercise to taking (or avoiding) certain medications can yield significant health benefits. The diagnosis for certain medical conditions can be particularly devastating if the chances of survival are typically low at the time of diagnosis. Survival rates may be improved significantly via earlier detection and treatment. With many cancers, screening methods can be e ective for early detection and therapy [41], but they involve explicit testing prompted by policies around risk factors such as family history [33] or medical history [14, 32]. Ideally, health screening systems would be able to observe people passively as they engage in their normal activities and alert them (per their preferences on vigilance) to potential health risks with-

2 out requiring the investment of time and e ort in special screening activities. Individual-level postings on social media have been mined for this purpose [10, 46], but may not have representative coverage of symptoms associated with social stigma [11]. We present a study of the feasibility of doing early detection of devastating diseases based on large-scale logs of health-related Web search activity. We consider the content of queries over time, and the prospect that temporal relationships and patterns among queries over multiple sessions over several months provide subtle fingerprints of lurking illness. We focus on the early detection of the presence of pancreatic cancer, a devastating diagnosis given the typical progression of the disease to an inoperable situation by the time it is found. Our work showcases the potential value of innovative approaches for speeding up the time to diagnosis of this deadly disease. Pancreatic cancer is the fourth leading cause of cancer death in men and women in the United States and the sixth leading cause in Europe [36]. The illness is widely known as being di cult to detect and is frequently diagnosed too late to be treated e ectively [22, 30]. A recent study found that the progression of pancreatic cancer from stage I to stage IV happens in just over one year [59]. Approximately 75% of pancreatic cancer patients die within a year of diagnosis, and only about 4% survive for five years post diagnosis. Exploration of the possibility that a patient has pancreatic cancer involves a careful and costly consideration of history, labwork, and imaging studies (in contrast to the passive screening methods described in this paper) [45]. Screening is largely performed to detect the disease at an early phase (pre-invasive or early invasive) when it is still curable by surgical intervention and chemotherapy. Earlier diagnosis of pancreatic cancer improves the feasibility of discovering the illness at an earlier stage [7]. For patients diagnosed early who undergo curative surgery (e.g., a Whipple procedure), five-year survival rate is higher, but it remains less than 25% [58]. We take as a proxy for ground truth of a diagnosis of pancreatic cancer the detection of experiential diagnostic queries issued by searchers. Experiential queries show strong evidence of being linked to the actual presence of symptomatology or conditions versus less directly involved, more distant exploratory queries seeking information about symptoms or diseases [40]. Experiential diagnostic queries for pancreatic cancer are identified via consideration of the structure of queries and of patterns of information gathering over multiple users in search logs. Experiential queries often include first-person assertions such as [i wasjustdiagnosedwithpancreatic cancer], which when associated with prior queries about symptoms identifies the positive cases. We construct models to predict the future rise of experiential queries from longitudinal search data. Figure 1 shows the di erent subsets of users in our analysis, including people who search for pancreatic cancer (A), the subset of these searchers who issue experiential diagnostic queries (B), and those who search for a set of symptoms linked to pancreatic cancer (C). Those who only search for one or more related symptoms with no evidence of pancreatic cancer searching constitute the negative cases. We find that our methods can detect cases where people show evidence of being diagnosed with pancreatic cancer many months in advance of their experiential diagnostic queries. We make the following contributions with this research: Introduce early detection of diseases as a promising new application of search log mining and machine learning that scales to millions of searchers. Present a case study on the early detection of pancreatic cancer from longitudinal individual search activity. Forecast with significant lead times that users will later input experiential queries for pancreatic cancer. Explore the influence of di erent factors, such as the lead time or the presence of specific symptoms in the search activity, on the predictive performance of our learned models, including true positive rates when false positive rates are strictly controlled. Controlling false positives is especially important to reduce unnecessary costs and concerns given potential future applications such as providing early warnings and suggestions to searchers about undertaking more formal screenings. We now describe related research in this important area. 2. RELATED WORK Related research in a number of areas is relevant to our work. These include (i) health searching; (ii) large-scale analysis of search behavior; and (iii) methods for the early detection of disease, with a focus on pancreatic cancer. The Web is an important source of health-related information for many people. To better understand how people pursue health information, studies have examined online health search using a variety of methods, including interviews [42], surveys [49], and analyses of large-scale search log data [2, 5]. According to a 2013 survey, 59% of American adults had used the Web to find health information in the year preceding the survey, 35% of those adults engaged in self-diagnosis, and over half of these self-diagnosing searchers then discussed the matter with a clinician following the search [16]. Despite the potential benefits, concerns have been raised about the quality of online health information [8]. In a large-scale survey of the use of search for self-diagnosis, White and Horvitz [53] found that almost 40% of participants experienced increased anxiety from searching health information online. Studies have characterized problems with symptom search, including the influence of poor accounting for base rates of diseases and people s bias to focus on results covering serious illnesses versus more likely benign explanations. Such biases can lead to inappropriate anxiety [28, 53] and highlight the criticality of studying how patients use the Web, including the nature and dynamics of queries, and content delivered in response. There has been a large amount of research on the analysis of search behavior from search engine logs. Log analysis provides insights to understand how people engage in information seeking in online settings [51], while also having applications for tasks such as result ranking [1, 24], query suggestion [25], prediction of future search actions and interests [13, 27], and detection of real-world events and activities [44]. Given access to population-scale data on how people search for health information, this can be applied for important tasks such as the detection of influenza [19], the detection of adverse drug reactions [56], population-scale studies of nutrition [50], epidemiology [19], and studies of chronic medical conditions such as pregnancy [15]. Related to this research, but focused on activity post-diagnosis, are studies of cancer-related searching [3, 6, 21], some of which have revealed strong similarities between temporal patterns

3 in search logs and those in practice [38, 40]. Studies have leveraged online behavioral signals for early disease detection at the population level [19], and individually [11, 46]. Screening high-risk individuals for pancreatic cancer is the only practical approach to detect precancerous or cancerous changes in the pancreas at the phase in which surgical intervention will have a high chance of cure [26]. Risk level can be determined by factors such as race [9], family history [33], and a history of pancreatitis [32]. Imaging studies via methods such as endoscopic ultrasound, computer tomography scans, and magnetic resonance imaging [35, 37] have been useful to diagnose pancreatic cancer once the tumor is large enough to cause unusual, salient symptoms that induce people to seek medical attention (e.g., yellow eyes, changes in stool), but at this point the disease is more likely to be at an advanced and unresectable stage (i.e., locally advanced or metastatic, when it cannot be removed by surgery) [29]. Common, seemingly innocuous symptoms such as back pain, abdominal pain, itchy skin, unexplained weight loss and nausea (and combinations and temporal patterns of these and other symptoms) may also be observed in the query stream. Such symptom searches can provide patterns of symptoms that might one day be employed in new kinds of health surveillance systems. Such systems could be used to alert people who would otherwise not feel moved to see a healthcare professional. Active, explicit screening for early signs of pancreatic cancer is not cost e ective unless there is a reasonable probability of detecting invasive or pre-invasive disease (at least 16% according to one study [45]). A log-based methodology provides scale that is not achievable with more traditional epidemiological studies, which tend to be on the order of tens or hundreds of participants, e.g., [23, 43]. 3. DATASET CREATION We now describe the data used, starting with a description of the logs (Section 3.1). We then discuss the creation of an ontology with symptoms commonly experienced by people with pancreatic cancer (Section 3.2) and provide details on extracting pancreatic cancer and symptom searchers (Section 3.3). We review the augmentation, tagging, and filtering steps for our dataset (Section 3.4). Finally, we summarize the creation of query timelines for the positive cases (i.e., experiential diagnostic searchers who also search for pancreatic cancer symptoms) and the negative cases (i.e., those who only search for the symptoms) (Section 3.5). Since reliable labels cannot be determined for the non-experiential pancreatic cancer searchers, we exclude them to create a cleaner dataset for training and testing. We show later (see Section 5.6) that predictive performance is largely unchanged if these searchers are included as negative examples during the application of the model in a realistic scenario. 3.1 Anonymized Web Search Engine Logs Search engines track various characteristics during their interaction with users so as to better capture information needs, improve their responses, and personalize the content. Every such interaction corresponds to a log entry that includes a unique, anonymized user identifier based on a Web browser cookie. This enables the extraction of the search history comprising queries and clicks from an identifier for up to 18 months. Note that the identifier may comprise the search activity of multiple users on shared machines and does not consolidate activity from a user across multiple machines. We use the logs of a randomly-selected subset of Bing search engine users in the English-speaking United States locale from October 2013 to May 2015 inclusive. 3.2 Symptoms and Risk Factors Warning signs and symptoms for pancreatic cancer usually include generic, subtle signs and symptoms, such as abdominal and back pain, loss of appetite, and unexplained weight loss. We performed an extensive review of possible signs, symptoms, and risk factors associated with pancreatic cancer and developed an ontology with 21 categories of symptoms. This manually-curated ontology consists of two levels. The first level includes the names of the symptoms and the second level includes multiple names, synonyms, and expressions with which the corresponding symptom in the first level may appear in our data. We performed multiple iterations of refinements of this ontology to remove noise and to minimize erroneous query matches. Table 1 presents the 21 symptom categories with some representative examples of associated query expressions. Also shown are 12 risk factors and associated synonyms, derived from the literature (e.g., [31]), describing attributes, characteristics, or exposures that may increase the likelihood of pancreatic cancer. The symptoms and the risk factors are featurized in predictive models, and they are also used in policies to determine when predictive models should be applied (see Section 5.5). 3.3 Extraction of Searchers In order to identify positive and negative cases for the generation of our learned model, we built a dataset comprising two groups of users (Figure 1). The pancreatic cancer searchers group, denoted as A in the figure, includes all searchers with at least one query explicitly on pancreatic cancer (i.e., a query matches this expression [( pancreas OR pancreatic ) AND cancer ]). The symptom searchers group, denoted as C, includes all users with at least one query related to symptoms linked to pancreatic cancer, as captured by the symptoms and synonyms described in Section 3.2. Having unique identifiers for each user in the union of A and C (i.e., A [ C) permits the extraction of the full query histories of 9.2 million searchers. We first sought to remove searchers who are likely healthcare professionals (HCPs). To do this, we employed a proprietary Bing classifier that identifies health-related queries to remove users from the study for whom 20% or more of queries are health related. This threshold was based on a prior analysis of identifying health professionals in search logs [52]. 3.4 Dataset Augmentation Age and gender are important factors associated with developing pancreatic cancer [31, 36]. As such, we augmented the dataset with demographic information from proprietary search engine classifiers that estimate age (discretized as < 18, 18 24, 25 34, 35 50, or 50 85) and the gender for each user. The classifiers are trained on data where ground truth of demographic details are provided explicitly by users. The predictions are based on signals derived from searchers longterm search activity, including their search queries and Web domains of their clicked results. Since pancreatic cancer incidence rates vary by geographic location, we also annotated searchers with the U.S. state from which they searched most (based on reverse Internet provider (IP) lookup data).

4 Table 1: Ontology with symptoms, risk factors, and examples of associated synonyms for pancreatic cancer. Type Name Example synonyms back pain pain in lower back, lowback pain blood clot blood clots, thrombosis dark or tarry stool dark poop, tarry feces dark urine orange pee, brown urine enlarged gall bladder swollen gallbladder, inflamed gallbladder floating stool floating stool, floaters greasy stool greasy poop, oily feces high blood sugar frequent urination, sudden diabetes itchy skin skin itching, hands itchy yellow skin or eyes jaundice, yellow eyes Symptom light stool pale stool, white crap loss of appetite poor appetite, decreased appetite, not hungry nausea or vomiting throwing up, nauseous smelly stool stinky feces, smelly poop sudden weight loss unexplained weight loss, weight loss sudden taste changes changes in taste, dysgeusia loose stool loose feces, loose stool, diarrhea constipation constipated, backed up indigestion acid reflux, heartburn abdominal swelling or pressure swollen stomach, pressure abdomen abdominal pain belly pain, stomach ache alcoholism heavy drinking, alcoholics anonymous, alcoholic hepatitis hep b, hep c pancreatitis ulcers ulcer obesity obese, very fat, extremely fat Risk factor smoking smoker, cigarette, cigar chills or fever chills, fever multiple endocrine neoplasia men1 hereditary nonpolyposis colorectal cancer lynch syndrome, hnpcc von hippel-lindau syndrome hippel-lindau syndrome hereditary intestinal polyposis syndrome peutz-jeghers syndrome familial atypical multiple mole melanoma syndrome fammm, b-k mole syndrome Beyond the demographic information, we are also interested in the subject matter of the queries and results that were visited over searchers timelines. We augmented each query and corresponding clicked websites with their estimated Open Directory Project (ODP, dmoz.org) category. We used a text-based classifier, similar to [4], that uses logistic regression to predict the ODP categories. When optimized for the score in each category, this classifier has a micro-averaged F1 score of For queries, the ODP category is that of the top-ranked search result. The remaining users after the augmentation and filtering steps total 7.4million, from which 479, 787 are pancreatic cancer searchers. 3.5 Positive and Negative Cases We create query timelines for experiential pancreatic cancer searchers and experiential symptom searchers which we then featurize for the early detection task. Figure 2 summarizes the strategies for identifying positives and negatives. To avoid including users with very short histories, we filter out all users with less than five search sessions 1 spanning five di erent days. This reduced the population to 6.4 million users, with a mean total duration (time from first to last query for a user) of days, standard deviation (SD) of days, and interquartile range of 120 days. Positive Cases: To identify experiential pancreatic cancer users, we created a set of first-person diagnostic queries for pancreatic cancer (denoted Exp 0). Some examples of such diagnostic queries are [just diagnosed with pancreatic cancer ], [why did i get cancer in pancreas], and [i wastoldi have pancreatic cancer what to expect]. From the set of 479, 787 pancreatic cancer searchers, 3, 203 match the pattern of diagnostic queries. In order to consider them as experiential users, we require them to have searched at least for one symptom prior to the diagnosis query. This 1 Session is a query sequence with apple 30 minutes between queries [51]. Positive cases: S 0 Negative cases: S 0 α Symptom lookup period Search engine activity data over time α No pancreatic cancer searching Symptom lookup period Search engine activity data over time β Period of diagnosis Figure 2: Schematic illustration of the query timelines used in the selection of positive and negative cases. S 0 refers to the first symptom query and Exp 0 is the first experiential diagnostic query. is the duration of the symptom lookup period, which was approximately equal in the aggregate for the positives and negatives. is the duration of the period of diagnosis, set to one week in this study. step generates a set of 1, 072 query timelines of experiential searchers that contain periods of symptom lookup followed by the diagnostic query. The symptom lookup period starts when the first symptom is detected as matching terms represented in the symptom ontology. For positive cases, the symptom lookup period completes at least one week before diagnosis (i.e., we consider the week before the diagnostic query as the period of diagnosis in real life and do not count queries in that time since they may be polluted with ground truth signals). For reliability, especially given the need to compute Temporal features (Section 4.2), the minimum time period for which our features are computed is four weeks. Negative Cases: To generate the negative set, we sample from the users who searched for pancreatic cancer symptoms but did not search for pancreatic cancer anywhere in their timeline (i.e., C \ A), either before or after the symptom lookup period. Performing this additional check on data Exp 0

5 Table 2: Summary statistics, namely, mean (M), standard deviation (SD), and number of cases (N), of durations and numbers of queries in positive and negative datasets. Class Duration (days) Total # queries N M SD M SD Positives ,072 Negatives ,025,046 from outside the lookup period is required to increase the likelihood that the negative cases are indeed negative. Users who searched for pancreatic cancer and its symptoms, but did not issue an experiential query (the gray subset in Figure 1 representing (A \ C) \ B), were excluded since a label could not be reliably determined. In Section 5.6 we describe an additional experiment where pancreatic cancer searchers were included during model testing. We were concerned that rudimentary behavioral di erences that may reflect artifacts in the data could invalidate the learning task. For example, if our experiential users were just more active generally, then a feature that computed the total number of queries would have strong predictive value, yet would be uninteresting scientifically. We sought to address this by downsampling the negative cases to attain a similar distribution of symptom lookup periods in terms of the temporal duration and query volume as observed for the positive cases. 2 We did this by selecting users with a symptom lookup period duration within three standard deviations of the mean of the positive cases. This reduces the number of negative cases to 3,025,046. Table 2 presents the summary statistics on the symptom lookup periods in terms of the number of days and the number of queries in the two datasets. The table shows that the distributions for positive and negative cases (in terms of number of days and number of queries) are similar. The distributions are statistically indistinguishable using two-sample Kolmogorov-Smirnov tests for temporal duration (D = 0.005;p = ) and number of queries (D =0.003; p =0.7681), even though the latter was not a filtering criterion. We note that query timelines are not aligned: the absolute point in time where people issue the experiential diagnostic query, and the accompanying symptom lookup period can di er between searchers. 4. EARLY DETECTION We now present the problem, summarize features extracted from query timelines, and review the prediction model. 4.1 Problem Description We address the problem of early detection of experiential searchers for pancreatic cancer via anonymized Web search engine logs. We cast this as a binary classification task, where the model is trained on features extracted from search log query timelines of experiential pancreatic cancer searchers and symptom-only searchers. We focus on maintaining very low false-positive rates (i.e., 1 misprediction in 100k correctly identified cases) while keeping high the imbalance ratio of positive and negative cases (i.e., one thousand positives vs. millions of negative cases); these properties are important for potential future large-scale real-world applications such as an alerting mechanism in search engines. 2 We could also have addressed this by making all features relative percentages. Sampling gave us more flexibility in feature construction. As an additional check, we included features such as the number of queries in the symptom lookup period; those were found to carry little evidential weight in the learned model. 4.2 Features We now describe the features extracted from query timelines. We group our features into five di erent categories: (i) demographic information about the user; (ii) characteristics about user sessions, query classes, and URL classes; (iii) characteristics about symptoms; (iv) features that capture the temporal dynamics, and (v) risk factors. Demographics: Cancer statistics from the U.S. National Cancer Institute 3 show that pancreatic cancer is more common with increasing age, is slightly more common in men than in women, and varies by geographic location. As such, we develop features related to the demographics of the users. In particular, we use the estimated age bucket and gender (see Section 3.4) along with the classifier s probabilities as confidence values. The dominant location (U.S. state) of a searcher is also included as a feature. Search Characteristics: People express their information needs and preferences through queries and click behavior (i.e., the website visits). We extract various features to capture these search and retrieval activities. As we discussed previously, the queries, as well as the visited websites, were tagged with their ODP category in an attempt to identify domains of interest (see Section 3.4). A first set of features SearchHistory contains several generic statistics, such as counts, ratios, and percentages, which are characteristics of the global behavior of the user. For example, we compute the number of queries, sessions, and clicks, as well as ratios of clicks per query for each user. Then, we compute a large number of features with respect to the ODP categories of queries QueryTopic, clicked search results URLTopic, and the combination QueryURLTopic. These include compute counts and percentage of queries and sessions in each ODP category, the average time until queries appear in the same category, as well as the time of the day that queries appear in each category. Similar features are also computed for each category of the visited websites (e.g., counts, ratios, and percentages of visited websites that belong to each category). We additionally compute features to characterize the user sessions, including features that capture the click behavior of users associated with queries. For example, we compute counts and percentages of all the combinations of query categories that led to visits in website categories. Symptoms: Features described above attempt to capture generic characteristics from user sessions. However, for the problem of interest, we seek to also leverage features from queries containing terms captured in the symptom ontology for pancreatic cancer (Section 3.2). The symptom features are divided into two classes: (i) SymptomGeneric and (ii) SymptomSpecific. Generic symptom features contain counts and percentages for the queries and sessions matching symptoms in our ontology, the average time between symptom queries, as well as the average number of symptom queries that are issued daily. Specific symptom features are generated per symptom category. For example, for each symptom, we compute counts and percentages of appearance, the time between distinct symptoms, and the time of day such symptom queries are issued. As with the user session features, we combine symptoms to capture the click behavior and, hence, we compute counts and percentages of each symptom query leading to a visit on a website belonging to particular ODP categories. Finally, we define features that capture the se- 3

6 Table 3: Performance at four-week intervals for users where features can be computed from Exp 0 1weektoExp 0 21 weeks. Values averaged across the ten folds of crossvalidation. The significance of di erences in AUROC and TPR using paired t-tests for each week versus Exp 0 1 indicated * p<0.01, ** p<0.001, and *** p< Weeks denote lead time before Exp 0 ( in Figure 2). Weeks before TPR (as %) at FPRs ranging from AUROC Exp week weeks weeks * * * 13 weeks * * * * 17 weeks * * ** ** ** 21 weeks 6.528* 9.199* ** ** *** ** quence in which symptoms appear in query timelines. Temporal: All previous features produce aggregated statistics over the full time window under consideration. Following [34], we include a set of features to capture the temporal variation of these statistics over misaligned query timelines with noise and missing values. For every feature, we generate a time series with points that represent aggregated values for intervals of the time window. For example, each feature can be computed per month, per week, or per day, depending on the level of granularity we seek to capture. Since the occurrence of specific features can be sparse, we set the time window to four weeks for temporal features. For features that are not percentages or ratios, we also compute the cumulative time series. For each time series, we use the first coe cient of the linear least-square estimates to devise features that capture the trend (i.e., increasing, decreasing, and unchanged) and the rate of change (i.e., slope). Risk Factors: This class contains features related to the presence of terms representing risk factors in the symptom lookup period. For each risk factor, we note its presence or absence, and also the number of queries containing that risk factor and the fraction of all queries from that user that these risk factor queries represents. Total number of distinct risk factors in the symptom lookup period is also a feature. 4.3 Prediction Model The prediction model uses the features outlined in the previous section, computed for each searcher, to make predictions about the future occurrence of experiential diagnostic searches in each searcher s query timeline. We use gradient boosted trees [17], which employ an ensemble of decision trees to construct a better learned model. Advantages include the ability to capture non-linear relationships, model interpretability (e.g., a ranked list of important features is generated), facility for rapid training and testing, and robustness against noisy labels and missing values. We experimented with di erent learning algorithms, but gradient boosted trees yielded superior accuracy. 5. EXPERIMENTAL RESULTS We now present the findings of our experiments. We report the overall performance in Section 5.1 and the performance as we increase the lead time before the first experiential diagnostic query (Section 5.2). We then inspect the model to understand the contributions that each feature makes towards early detection (Section 5.3) and the performance of di erent feature classes (Section 5.4). We examine the e ect on model performance of conditioning on symptoms and risk factors (Section 5.5) and consider a realistic deployment scenario (Section 5.6). We use the area True positive rate False positive rate Weeks before 1 week 5 weeks 9 weeks 13 weeks 17 weeks 21 weeks Figure 3: Average partial ROC curves in the FPR range , for models learned using data up to 21 weeks before the first experiential diagnostic query (error bars excluded for clarity). Variance in FPR and TPR is minor. under the receiver operating characteristic curve (AUROC) and recall (TPR, true positive rate) at fixed, extremely low false positive rates (FPRs) as our primary evaluation metrics. We applied 10-fold cross validation, stratified by user to evaluate the generalizability of the model when applied to new users. Significance level is p<0.05 unless stated. 5.1 Overall The overall performance of the classifier in making predictions based on data up to the beginning of the period of diagnosis (i.e., Exp 0 1week)inAUROCis Given that low error rates would be vital in practice to avoid unnecessary patient alarm, we focus on the true positive rate (fraction of all positives that are recalled by the model) at low false positive rates (FPR). Focusing on FPR in the range , the model is able to recall 5 30% of the positive cases, depending on the specific FPR. We see this performance as promising given the limited information (primarily search-related activity) available to the model. 5.2 Performance by Week A key part of early detection is being able to predict the emergence of the disease well in advance. To understand how prediction performance changed as we move further back in time before the first experiential diagnostic query we selected the set of 337 positive searchers and 945,394 negative searchers who were still observed in the logs many weeks prior to the experiential diagnostic query. We report results from one week before the experiential diagnostic query, all the way up to 21 weeks before the diagnostic query. To count as being present at Exp 0 21 weeks, a searcher needs to have symptom queries extending back at least four weeks before that point (i.e., to Exp 0 25 weeks, or approximately six months before the first experiential diagnostic query). We trained a model for these users in the same way as we did for Section 5.1. The ratio between positive and negative searchers remains similar to that for all users (i.e., approximately 1:3000). Table 3 reports the TPR at di erent false positive rates for this same set of users at di erent four-week increments, as well as the AUROC. The general trend is that the performance drops fairly consistently as we increase the lead time, but even 20 or so weeks before the first experiential diagnostic query the predictive perfor-

7 Table 4: Top 10 features by evidential weight relative to the top feature. Positive or Negative direction means that a feature correlates positively or negatively, respectively, with the rise of experiential queries. Feature Weight Direction Class NumOfDistinctSymptoms Positive SymptomGeneral NumOfQueriesInHealthCategory Positive QueryTopic NumOfDistinctSymptomsVariants Positive SymptomGeneral AgeClassProbability Positive Demographic HasBackPain Negative SymptomSpecific HasIndigestion Negative SymptomSpecific HasIndigestionThenAbdominalPain Positive Temporal SlopeNumOfDistinctSymptoms Positive Temporal HasBackPainThenYellowSkinOrEyes Positive Temporal AgeClassProbability LT Negative Demographic mance is still quite strong (AUROC=0.8315, TPR=6.528% at FPR= ). Assuming that pancreatic cancer progresses steadily from stage I to stage IV in just over one year (as has been previously reported [59]), accurate predictions 20 weeks in advance of the diagnostic period could lead to a sizable increase in the five-year survival rate (e.g., moving the point of diagnosis from Stage III to Stage II could increase the survival rate from 3% to 5 7% [47]). Focusing on the FPR region from 0 to 0.01 (i.e., false positives occur less than 1 in 100 times) and visualizing that part of the ROC curve (Figure 3) we observe some clear di erences in the performance of the models in this important region. The average normalized partial AUROC ranges from (Exp 0 1week)to0.231(Exp 0 21weeks). All di erences in AUROC for Exp 0 5 or more weeks versus Exp 0 1 week are significant (p <0.01 using paired t-tests). 5.3 Feature Contributions In addition to understanding the overall performance, we are also interested in understanding the features that are most important in the learned model for predicting the future issuance of experiential queries. Table 4 shows the top 10 features with the highest weight, along with their weight relative to the top-ranked feature (NumOfDistinct- Symptoms) and the feature class. The direction is based on the correlation between the feature value and the labels in the training data, using Pearson biserial correlation or the phi coe cient, depending on whether or not the feature data is binary. Table 4 shows that there is a broad range of features. The number of distinct pancreatic cancer symptoms was the most important feature. Temporal features representing changes over time and sequence ordering of symptom pairs are also important. Age is important, and it is positively correlated if the searcher is older and is negatively correlated if they are younger. Individual symptom features related to back pain and indigestion are important but have a negative influence on predicting future experiential queries, likely because (i) there are many explanations for why these symptoms appear in a query timeline, and (ii) they are positive for many negative cases (16.7% of negatives search for back pain, 7.4% search for indigestion). 5.4 Feature Classes Beyond the individual features, we can also consider the accuracy of the models based on feature classes. This can be particularly important when some classes of features are easy to obtain in practice, e.g., the demographic features may be available for all searchers without the need to perform temporal modeling of query patterns. Table 5 presents the AUROC for models trained on each of the feature classes. The findings show that the Temporal class is particularly important, signifying the key role of temporal dynamics for this prediction task. The model is still accurate solely with access Table 5: Performance of individual feature classes (as AU- ROC and TPR at a FPR of ) averaged across the 10 experimental folds. The performance di erences against the Overall performance are statistically significant using paired t-tests at * p<0.01, ** p<0.001, and *** p< Feature Class AUROC TPR (as a %) at FPR= Demographic *** 0.280%*** RiskFactor *** 0.653%** SearchHistory *** 1.399%** QueryURLTopic ** 2.052%** SymptomGeneral ** 2.146%** URLTopic ** 2.332%* SymptomSpecific * 2.800%* QueryTopic * 2.892%* Temporal * 2.985%* Overall % to demographics and basic features about general searching. However, performance improves considerably if we consider the specifics of the symptoms searched (SymptomSpecific) or the topics of the queries and results clicked (QueryTopic). 5.5 Symptoms and Risk Factors We also considered the impact of the presence of symptoms and risk factors on the performance of the model. Symptoms: We filtered the positive and negative cases to those where a symptom was present in query timelines. Risk factors: These are risk factors corresponding to the presence of factors such as pancreatitis, smoking, and obesity, as well as cancer syndromes such as hereditary intestinal polyposis syndrome or familial atypical multiple mole melanoma syndrome (i.e., genetic disorders that predispose individuals to develop pancreatic cancer), all of which have been shown to lead to increased likelihood of developing pancreatic cancer [18, 20, 32, 48]. Recall that our cross-validation was stratified by user. During cross validation, we learned a model on the users in the training folds and then for testing we limited to users with evidence of the specific symptoms or risk factors in their search history prior to the experiential diagnostic query. In each case the number of positives and negatives is less than the full set. 4 Table 6 presents statistics on the performance for each model where the number of positive examples was at least 10 (to help ensure that AUROC calculations were meaningful). The table also presents TPRs at di erent false positive rates, as well as the percentage of positive or negative cases that have the symptom or risk factor searches. Finally, the last three columns shows the estimated number of true positives (capture) and false positives (cost) that would be observed, assuming a FPR of , and the associated capture-cost ratio. Ideal targets for rates of capture versus cost in a deployed service can be derived via a decision analysis that considers the net expected value of the early detection and the expected costs of unnecessary anxiety. Such an optimization would leverage a careful characterization of the value of early intervention and details of designs of methods for engaging people. Table 6 shows that focusing on users who search for risk factors such as smoking, hepatitis, and obesity leads to better overall performance. There were fewer than ten users searching for each of the cancer syndromes (e.g., hereditary nonpolyposis colorectal cancer) and, hence, they were excluded from Table 6. Focusing on the percentage of posi- 4 An alternative would be to train a separate model for symptom or risk factor. An issue with doing that is there are insu cient positive examples about each dataset with which to train a robust model.

8 Table 6: Performance of the models conditioned on a variety of symptom (S) and risk factors (RF). Values below the dashed line have a higher AUROC than Overall. Capture represents the number of TP cases in the cohort of positives [ negatives at FPR= Cost is computed as the target FPR (.00001) multiplied by the size of the negative set in each subgroup. Since this exact FPR may not be attainable in each subgroup, cost may not be an integer. A capture-cost ratio of > 1.0 means that more people would benefit from an alert than would be mistakenly alerted. Statistically significant di erences with Overall model (DeLong s test [12]) are marked ** p<0.001 and *** p< (where following a Bonferroni correction is 0.002). Symptom or Risk Factor TPR at FPRs ranging from AUROC # pos (%) # neg (%) FPR = Capture Cost Capture/Cost Dark or tarry stool (S) *** 13 (1.2%) 58,597 (1.9%) Abdominal swelling (S) *** 24 (2.2%) (1.5%) Ulcers (RF) *** 38 (3.5%) 16,081 (0.5%) Dark urine (S) ** 18 (1.7%) 51,236 (1.7%) Pancreatitis (RF) ** 33 (3.1%) 34,184 (1.1%) Abdominal pain (S) ** 130 (12.1%) 311,266 (10.3%) Enlarged gallbladder (S) ** 113 (10.5%) 98,454 (3.3%) Constipation (S) ** 85 (7.9%) 317,300 (10.5%) Smoking (RF) (2.4%) 27,817 (0.9%) Blood clot (S) (8.3%) 351,385 (11.6%) High blood sugar (S) (30.4%) 429,543 (14.2%) Nausea or vomiting (S) (11.7%) 639,502 (21.1%) Chills or fever (RF) (10.3%) 357,536 (11.8%) Loose stool (S) (6%) 74,720 (2.5%) Indigestion (S) (9.9%) 504,462 (16.7%) Itchy skin (S) (1.5%) 79,448 (2.6%) Back pain (S) (13.2%) 223,586 (7.4%) Yellow skin or eyes (S) (8.6%) 85,805 (2.8%) Hepatitis (RF) (3.6%) 25,158 (0.8%) Alcoholism (RF) ** 48 (4.5%) 32,333 (1.1%) Obesity (RF) ** 29 (2.7%) 22,153 (0.7%) Overall ,072 (100%) 3,025,046 (100%) tives and negatives that contain each of the symptoms or risk factors, we observe that there are some that are much more likely to occur in positives (e.g., pancreatitis and smoking are 5.9 and 3.6 times as likely, respectively). Focusing on the utility, we find that if we set the FPR to , overall we would find 52 positives in the union of positives and negatives at the expense of 30 negatives, who would be altered mistakenly. There are some symptoms and risk factors for which the capture-cost is more favorable. For example, in the case of alcoholism or obesity, we would find times as many TPs as FPs. There are others symptoms such as nausea or vomiting, or chills or fever, where the costs in mistakenly alerting users equal or outweigh the benefits. Presence of symptoms or risk factors could help decide whether to apply early detection models for a searcher. 5.6 Applying Learned Model in Practice Up to now, our model considers experiential diagnostic users as positives and symptom-only users as negatives. This is a clean dataset for algorithm training and testing but it ignores the symptom searchers who issue non-experiential pancreatic cancer searches (gray region in Figure 1). These users may have been diagnosed or may simply be exploring. Regardless, they should be considered in practice. We perform an additional experiment on a separate set of symptom searchers that included non-experiential pancreatic cancer searchers as negatives. We trained a model on all data described thus far and applied it to identify (i) experiential and (ii) experiential+treatment users in a new held out dataset in advance of their first experiential diagnostic query. We generated the test set from logs of a separate randomly selected subset of Bing users, over an 18-month period from August 2014 to January 2016 inclusive. There was no overlap in users with the set used for training. We identified positive cases as earlier and expanded the definition of negatives to include pancreatic cancer searchers. This resulted in 2.9 million negatives, including 48,221 nonexperiential pancreatic cancer searchers, and 945 experiential searchers with preceding symptom searches. To help target the identification of cases where experiential queries are issued, we created a subset of the positives who issued treatment-related queries following Exp 0 (e.g., whipple procedure, 5-fu); in total, 494 users (52%) met this requirement. Table 7: Average AUROC and average TPR (as %) at FPR= for identifying experiential users and experiential+treatment users from held out dataset. Di erences in AUROC and TPR between Exp 0 1 week and other weeks noted (*** p<0.001, ** p<0.001, and * p<0.01). Weeks before Experiential Experiential+Treatment Exp 0 AUROC TPR (%) AUROC TPR (%) 1week weeks * weeks ** 8.278* ** * 13 weeks ** 8.018** ** ** 17 weeks *** 7.666** *** *** 21 weeks *** 7.438*** *** 9.645*** The symptom lookup durations for positives and negatives were similar to Section 3.5. We randomly split the test data into ten equally-sized subsets for significance testing. Table 7 reports the predictive performance at di erent lead times. Table 7 shows that the performance of the model remains strong on this held-out set and is comparable to that reported in the earlier sections. The performance decreases with increased lead time as noted previously (see Table 3). Interestingly, the performance in identifying the subset of experiential diagnostic users who subsequently searched for treatments is higher than for the experiential-only set. This is promising confirmatory evidence as these users are assumed to have experienced a cancer diagnosis, per definitions of experiential queries [40] DISCUSSION AND CONCLUSIONS We studied the potential feasibility of learning from search engine logs to predict future issuance of experiential queries about pancreatic cancer at a low error rate. The success of these methods has implications for online methods that would provide passive screening of searchers to provide early warning about potential signs of pancreatic cancer and other devastating diseases. We discovered that conditionalization on di erent symptoms and risk factors can enhance predictive power. We found capture-cost tradeo s associated with 5 For completeness, we also trained a model on all data from earlier in the paper, including the non-experiential pancreatic cancer searchers as negative cases, and tested it on this held-out set. The performance is around 5% lower than reported in Table 7, for both AUROC and TPR, across all weeks. Including the non-experiential pancreatic cancer searchers may add noise to model training.

The incidence of pancreatic cancer is rising in India and is higher in the urban male population in the western and northern parts of India.

The incidence of pancreatic cancer is rising in India and is higher in the urban male population in the western and northern parts of India. Published on: 9 Jun 2015 Pancreatic Cancer What Is Cancer? The body is made up of cells, which grow and die in a controlled way. Sometimes, cells keep on growing without control, causing an abnormal growth