Detecting Devastating Diseases in Search Logs

Size: px
Start display at page:

Download "Detecting Devastating Diseases in Search Logs"

Transcription

1 Detecting Devastating Diseases in Search Logs John Paparrizos Columbia University Ryen W. White Microsoft Research Eric Horvitz Microsoft Research ABSTRACT Web search queries can o er a unique population-scale window onto streams of evidence that are useful for detecting the emergence of health conditions. We explore the promise of harnessing behavioral signals in search logs to provide advance warning about the presence of devastating diseases such as pancreatic cancer. Pancreatic cancer is often diagnosed too late to be treated e ectively as the cancer has usually metastasized by the time of diagnosis. Symptoms of the early stages of the illness are often subtle and nonspecific. We identify searchers who issue credible, first-person diagnostic queries for pancreatic cancer and we learn models from prior search histories that predict which searchers will later input such queries. We show that we can infer the likelihood of seeing the rise of diagnostic queries months before they appear and characterize the tradeo between predictivity and false positive rate. The findings highlight the potential of harnessing search logs for the early detection of pancreatic cancer and more generally for harnessing search systems to reduce health risks for individuals. Keywords Health screening; Search logs; Behavioral data; Digital disease detection; Pancreatic cancer; Temporal analysis 1. INTRODUCTION Web search is a primary resource for people concerned about the significance of health-related symptoms [16]. Researchers have studied symptom and illness-related searches in pursuit of insights about how people search about health concerns, including patterns of querying and review of information in pursuit of diagnoses [54], healthcare utilization signals [55], traces of therapeutic decision making for challenging illnesses [40], and identification of new adverse effects of medications [56, 57]. Prior studies have examined how population-level signals in social media can be used to detect the emergence of diseases [10, 19]. Work conducted during a Microsoft Research internship. John Paparrizos is an Alexander S. Onassis Foundation Scholar. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD 16, August 13-17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN /16/08... $15.00 DOI: Pancreatic cancer symptom searchers (C) Negative cases Pancreatic cancer searchers (A) Experiential pancreatic cancer searchers (B) Positive cases Searchers in A who query for symptoms but have no experiential queries. Figure 1: Venn diagram depicting sets of users employed in analyses: pancreatic cancer searchers (A), pancreatic cancer searchers exhibiting experiential diagnostic queries (B), and those who search for the symptoms of pancreatic cancer (C). A[C (i.e., number of users in original, pre-filtered dataset) is 9.2 million. Positives and negatives are sourced from B\C and C \ A, respectively. Relative set sizes are not to scale. We explore the prospect of harnessing anonymized longterm sequences of health-related search queries to yield information that could provide valuable signals for detection of illness in advance of traditional diagnosis. Leveraging online behavioral data to provide earlier detection of a disease or of the raised risk of illness on a large scale can make significant contributions to healthcare. Better outcomes can be achieved by earlier confirmation of illnesses and risks via gaining access to more timely diagnoses, treatments, and other proactive interventions. As an example, such capabilities might help to identify those at significant risk of suffering the onset of advance of chronic disease processes such as diabetes or heart disease or the rise of acute processes such as atrial fibrillation or more severe cardiac arrhythmias. Interventional programs ranging from changes in diet or exercise to taking (or avoiding) certain medications can yield significant health benefits. The diagnosis for certain medical conditions can be particularly devastating if the chances of survival are typically low at the time of diagnosis. Survival rates may be improved significantly via earlier detection and treatment. With many cancers, screening methods can be e ective for early detection and therapy [41], but they involve explicit testing prompted by policies around risk factors such as family history [33] or medical history [14, 32]. Ideally, health screening systems would be able to observe people passively as they engage in their normal activities and alert them (per their preferences on vigilance) to potential health risks with-

2 out requiring the investment of time and e ort in special screening activities. Individual-level postings on social media have been mined for this purpose [10, 46], but may not have representative coverage of symptoms associated with social stigma [11]. We present a study of the feasibility of doing early detection of devastating diseases based on large-scale logs of health-related Web search activity. We consider the content of queries over time, and the prospect that temporal relationships and patterns among queries over multiple sessions over several months provide subtle fingerprints of lurking illness. We focus on the early detection of the presence of pancreatic cancer, a devastating diagnosis given the typical progression of the disease to an inoperable situation by the time it is found. Our work showcases the potential value of innovative approaches for speeding up the time to diagnosis of this deadly disease. Pancreatic cancer is the fourth leading cause of cancer death in men and women in the United States and the sixth leading cause in Europe [36]. The illness is widely known as being di cult to detect and is frequently diagnosed too late to be treated e ectively [22, 30]. A recent study found that the progression of pancreatic cancer from stage I to stage IV happens in just over one year [59]. Approximately 75% of pancreatic cancer patients die within a year of diagnosis, and only about 4% survive for five years post diagnosis. Exploration of the possibility that a patient has pancreatic cancer involves a careful and costly consideration of history, labwork, and imaging studies (in contrast to the passive screening methods described in this paper) [45]. Screening is largely performed to detect the disease at an early phase (pre-invasive or early invasive) when it is still curable by surgical intervention and chemotherapy. Earlier diagnosis of pancreatic cancer improves the feasibility of discovering the illness at an earlier stage [7]. For patients diagnosed early who undergo curative surgery (e.g., a Whipple procedure), five-year survival rate is higher, but it remains less than 25% [58]. We take as a proxy for ground truth of a diagnosis of pancreatic cancer the detection of experiential diagnostic queries issued by searchers. Experiential queries show strong evidence of being linked to the actual presence of symptomatology or conditions versus less directly involved, more distant exploratory queries seeking information about symptoms or diseases [40]. Experiential diagnostic queries for pancreatic cancer are identified via consideration of the structure of queries and of patterns of information gathering over multiple users in search logs. Experiential queries often include first-person assertions such as [i wasjustdiagnosedwithpancreatic cancer], which when associated with prior queries about symptoms identifies the positive cases. We construct models to predict the future rise of experiential queries from longitudinal search data. Figure 1 shows the di erent subsets of users in our analysis, including people who search for pancreatic cancer (A), the subset of these searchers who issue experiential diagnostic queries (B), and those who search for a set of symptoms linked to pancreatic cancer (C). Those who only search for one or more related symptoms with no evidence of pancreatic cancer searching constitute the negative cases. We find that our methods can detect cases where people show evidence of being diagnosed with pancreatic cancer many months in advance of their experiential diagnostic queries. We make the following contributions with this research: Introduce early detection of diseases as a promising new application of search log mining and machine learning that scales to millions of searchers. Present a case study on the early detection of pancreatic cancer from longitudinal individual search activity. Forecast with significant lead times that users will later input experiential queries for pancreatic cancer. Explore the influence of di erent factors, such as the lead time or the presence of specific symptoms in the search activity, on the predictive performance of our learned models, including true positive rates when false positive rates are strictly controlled. Controlling false positives is especially important to reduce unnecessary costs and concerns given potential future applications such as providing early warnings and suggestions to searchers about undertaking more formal screenings. We now describe related research in this important area. 2. RELATED WORK Related research in a number of areas is relevant to our work. These include (i) health searching; (ii) large-scale analysis of search behavior; and (iii) methods for the early detection of disease, with a focus on pancreatic cancer. The Web is an important source of health-related information for many people. To better understand how people pursue health information, studies have examined online health search using a variety of methods, including interviews [42], surveys [49], and analyses of large-scale search log data [2, 5]. According to a 2013 survey, 59% of American adults had used the Web to find health information in the year preceding the survey, 35% of those adults engaged in self-diagnosis, and over half of these self-diagnosing searchers then discussed the matter with a clinician following the search [16]. Despite the potential benefits, concerns have been raised about the quality of online health information [8]. In a large-scale survey of the use of search for self-diagnosis, White and Horvitz [53] found that almost 40% of participants experienced increased anxiety from searching health information online. Studies have characterized problems with symptom search, including the influence of poor accounting for base rates of diseases and people s bias to focus on results covering serious illnesses versus more likely benign explanations. Such biases can lead to inappropriate anxiety [28, 53] and highlight the criticality of studying how patients use the Web, including the nature and dynamics of queries, and content delivered in response. There has been a large amount of research on the analysis of search behavior from search engine logs. Log analysis provides insights to understand how people engage in information seeking in online settings [51], while also having applications for tasks such as result ranking [1, 24], query suggestion [25], prediction of future search actions and interests [13, 27], and detection of real-world events and activities [44]. Given access to population-scale data on how people search for health information, this can be applied for important tasks such as the detection of influenza [19], the detection of adverse drug reactions [56], population-scale studies of nutrition [50], epidemiology [19], and studies of chronic medical conditions such as pregnancy [15]. Related to this research, but focused on activity post-diagnosis, are studies of cancer-related searching [3, 6, 21], some of which have revealed strong similarities between temporal patterns

3 in search logs and those in practice [38, 40]. Studies have leveraged online behavioral signals for early disease detection at the population level [19], and individually [11, 46]. Screening high-risk individuals for pancreatic cancer is the only practical approach to detect precancerous or cancerous changes in the pancreas at the phase in which surgical intervention will have a high chance of cure [26]. Risk level can be determined by factors such as race [9], family history [33], and a history of pancreatitis [32]. Imaging studies via methods such as endoscopic ultrasound, computer tomography scans, and magnetic resonance imaging [35, 37] have been useful to diagnose pancreatic cancer once the tumor is large enough to cause unusual, salient symptoms that induce people to seek medical attention (e.g., yellow eyes, changes in stool), but at this point the disease is more likely to be at an advanced and unresectable stage (i.e., locally advanced or metastatic, when it cannot be removed by surgery) [29]. Common, seemingly innocuous symptoms such as back pain, abdominal pain, itchy skin, unexplained weight loss and nausea (and combinations and temporal patterns of these and other symptoms) may also be observed in the query stream. Such symptom searches can provide patterns of symptoms that might one day be employed in new kinds of health surveillance systems. Such systems could be used to alert people who would otherwise not feel moved to see a healthcare professional. Active, explicit screening for early signs of pancreatic cancer is not cost e ective unless there is a reasonable probability of detecting invasive or pre-invasive disease (at least 16% according to one study [45]). A log-based methodology provides scale that is not achievable with more traditional epidemiological studies, which tend to be on the order of tens or hundreds of participants, e.g., [23, 43]. 3. DATASET CREATION We now describe the data used, starting with a description of the logs (Section 3.1). We then discuss the creation of an ontology with symptoms commonly experienced by people with pancreatic cancer (Section 3.2) and provide details on extracting pancreatic cancer and symptom searchers (Section 3.3). We review the augmentation, tagging, and filtering steps for our dataset (Section 3.4). Finally, we summarize the creation of query timelines for the positive cases (i.e., experiential diagnostic searchers who also search for pancreatic cancer symptoms) and the negative cases (i.e., those who only search for the symptoms) (Section 3.5). Since reliable labels cannot be determined for the non-experiential pancreatic cancer searchers, we exclude them to create a cleaner dataset for training and testing. We show later (see Section 5.6) that predictive performance is largely unchanged if these searchers are included as negative examples during the application of the model in a realistic scenario. 3.1 Anonymized Web Search Engine Logs Search engines track various characteristics during their interaction with users so as to better capture information needs, improve their responses, and personalize the content. Every such interaction corresponds to a log entry that includes a unique, anonymized user identifier based on a Web browser cookie. This enables the extraction of the search history comprising queries and clicks from an identifier for up to 18 months. Note that the identifier may comprise the search activity of multiple users on shared machines and does not consolidate activity from a user across multiple machines. We use the logs of a randomly-selected subset of Bing search engine users in the English-speaking United States locale from October 2013 to May 2015 inclusive. 3.2 Symptoms and Risk Factors Warning signs and symptoms for pancreatic cancer usually include generic, subtle signs and symptoms, such as abdominal and back pain, loss of appetite, and unexplained weight loss. We performed an extensive review of possible signs, symptoms, and risk factors associated with pancreatic cancer and developed an ontology with 21 categories of symptoms. This manually-curated ontology consists of two levels. The first level includes the names of the symptoms and the second level includes multiple names, synonyms, and expressions with which the corresponding symptom in the first level may appear in our data. We performed multiple iterations of refinements of this ontology to remove noise and to minimize erroneous query matches. Table 1 presents the 21 symptom categories with some representative examples of associated query expressions. Also shown are 12 risk factors and associated synonyms, derived from the literature (e.g., [31]), describing attributes, characteristics, or exposures that may increase the likelihood of pancreatic cancer. The symptoms and the risk factors are featurized in predictive models, and they are also used in policies to determine when predictive models should be applied (see Section 5.5). 3.3 Extraction of Searchers In order to identify positive and negative cases for the generation of our learned model, we built a dataset comprising two groups of users (Figure 1). The pancreatic cancer searchers group, denoted as A in the figure, includes all searchers with at least one query explicitly on pancreatic cancer (i.e., a query matches this expression [( pancreas OR pancreatic ) AND cancer ]). The symptom searchers group, denoted as C, includes all users with at least one query related to symptoms linked to pancreatic cancer, as captured by the symptoms and synonyms described in Section 3.2. Having unique identifiers for each user in the union of A and C (i.e., A [ C) permits the extraction of the full query histories of 9.2 million searchers. We first sought to remove searchers who are likely healthcare professionals (HCPs). To do this, we employed a proprietary Bing classifier that identifies health-related queries to remove users from the study for whom 20% or more of queries are health related. This threshold was based on a prior analysis of identifying health professionals in search logs [52]. 3.4 Dataset Augmentation Age and gender are important factors associated with developing pancreatic cancer [31, 36]. As such, we augmented the dataset with demographic information from proprietary search engine classifiers that estimate age (discretized as < 18, 18 24, 25 34, 35 50, or 50 85) and the gender for each user. The classifiers are trained on data where ground truth of demographic details are provided explicitly by users. The predictions are based on signals derived from searchers longterm search activity, including their search queries and Web domains of their clicked results. Since pancreatic cancer incidence rates vary by geographic location, we also annotated searchers with the U.S. state from which they searched most (based on reverse Internet provider (IP) lookup data).

4 Table 1: Ontology with symptoms, risk factors, and examples of associated synonyms for pancreatic cancer. Type Name Example synonyms back pain pain in lower back, lowback pain blood clot blood clots, thrombosis dark or tarry stool dark poop, tarry feces dark urine orange pee, brown urine enlarged gall bladder swollen gallbladder, inflamed gallbladder floating stool floating stool, floaters greasy stool greasy poop, oily feces high blood sugar frequent urination, sudden diabetes itchy skin skin itching, hands itchy yellow skin or eyes jaundice, yellow eyes Symptom light stool pale stool, white crap loss of appetite poor appetite, decreased appetite, not hungry nausea or vomiting throwing up, nauseous smelly stool stinky feces, smelly poop sudden weight loss unexplained weight loss, weight loss sudden taste changes changes in taste, dysgeusia loose stool loose feces, loose stool, diarrhea constipation constipated, backed up indigestion acid reflux, heartburn abdominal swelling or pressure swollen stomach, pressure abdomen abdominal pain belly pain, stomach ache alcoholism heavy drinking, alcoholics anonymous, alcoholic hepatitis hep b, hep c pancreatitis ulcers ulcer obesity obese, very fat, extremely fat Risk factor smoking smoker, cigarette, cigar chills or fever chills, fever multiple endocrine neoplasia men1 hereditary nonpolyposis colorectal cancer lynch syndrome, hnpcc von hippel-lindau syndrome hippel-lindau syndrome hereditary intestinal polyposis syndrome peutz-jeghers syndrome familial atypical multiple mole melanoma syndrome fammm, b-k mole syndrome Beyond the demographic information, we are also interested in the subject matter of the queries and results that were visited over searchers timelines. We augmented each query and corresponding clicked websites with their estimated Open Directory Project (ODP, dmoz.org) category. We used a text-based classifier, similar to [4], that uses logistic regression to predict the ODP categories. When optimized for the score in each category, this classifier has a micro-averaged F1 score of For queries, the ODP category is that of the top-ranked search result. The remaining users after the augmentation and filtering steps total 7.4million, from which 479, 787 are pancreatic cancer searchers. 3.5 Positive and Negative Cases We create query timelines for experiential pancreatic cancer searchers and experiential symptom searchers which we then featurize for the early detection task. Figure 2 summarizes the strategies for identifying positives and negatives. To avoid including users with very short histories, we filter out all users with less than five search sessions 1 spanning five di erent days. This reduced the population to 6.4 million users, with a mean total duration (time from first to last query for a user) of days, standard deviation (SD) of days, and interquartile range of 120 days. Positive Cases: To identify experiential pancreatic cancer users, we created a set of first-person diagnostic queries for pancreatic cancer (denoted Exp 0). Some examples of such diagnostic queries are [just diagnosed with pancreatic cancer ], [why did i get cancer in pancreas], and [i wastoldi have pancreatic cancer what to expect]. From the set of 479, 787 pancreatic cancer searchers, 3, 203 match the pattern of diagnostic queries. In order to consider them as experiential users, we require them to have searched at least for one symptom prior to the diagnosis query. This 1 Session is a query sequence with apple 30 minutes between queries [51]. Positive cases: S 0 Negative cases: S 0 α Symptom lookup period Search engine activity data over time α No pancreatic cancer searching Symptom lookup period Search engine activity data over time β Period of diagnosis Figure 2: Schematic illustration of the query timelines used in the selection of positive and negative cases. S 0 refers to the first symptom query and Exp 0 is the first experiential diagnostic query. is the duration of the symptom lookup period, which was approximately equal in the aggregate for the positives and negatives. is the duration of the period of diagnosis, set to one week in this study. step generates a set of 1, 072 query timelines of experiential searchers that contain periods of symptom lookup followed by the diagnostic query. The symptom lookup period starts when the first symptom is detected as matching terms represented in the symptom ontology. For positive cases, the symptom lookup period completes at least one week before diagnosis (i.e., we consider the week before the diagnostic query as the period of diagnosis in real life and do not count queries in that time since they may be polluted with ground truth signals). For reliability, especially given the need to compute Temporal features (Section 4.2), the minimum time period for which our features are computed is four weeks. Negative Cases: To generate the negative set, we sample from the users who searched for pancreatic cancer symptoms but did not search for pancreatic cancer anywhere in their timeline (i.e., C \ A), either before or after the symptom lookup period. Performing this additional check on data Exp 0

5 Table 2: Summary statistics, namely, mean (M), standard deviation (SD), and number of cases (N), of durations and numbers of queries in positive and negative datasets. Class Duration (days) Total # queries N M SD M SD Positives ,072 Negatives ,025,046 from outside the lookup period is required to increase the likelihood that the negative cases are indeed negative. Users who searched for pancreatic cancer and its symptoms, but did not issue an experiential query (the gray subset in Figure 1 representing (A \ C) \ B), were excluded since a label could not be reliably determined. In Section 5.6 we describe an additional experiment where pancreatic cancer searchers were included during model testing. We were concerned that rudimentary behavioral di erences that may reflect artifacts in the data could invalidate the learning task. For example, if our experiential users were just more active generally, then a feature that computed the total number of queries would have strong predictive value, yet would be uninteresting scientifically. We sought to address this by downsampling the negative cases to attain a similar distribution of symptom lookup periods in terms of the temporal duration and query volume as observed for the positive cases. 2 We did this by selecting users with a symptom lookup period duration within three standard deviations of the mean of the positive cases. This reduces the number of negative cases to 3,025,046. Table 2 presents the summary statistics on the symptom lookup periods in terms of the number of days and the number of queries in the two datasets. The table shows that the distributions for positive and negative cases (in terms of number of days and number of queries) are similar. The distributions are statistically indistinguishable using two-sample Kolmogorov-Smirnov tests for temporal duration (D = 0.005;p = ) and number of queries (D =0.003; p =0.7681), even though the latter was not a filtering criterion. We note that query timelines are not aligned: the absolute point in time where people issue the experiential diagnostic query, and the accompanying symptom lookup period can di er between searchers. 4. EARLY DETECTION We now present the problem, summarize features extracted from query timelines, and review the prediction model. 4.1 Problem Description We address the problem of early detection of experiential searchers for pancreatic cancer via anonymized Web search engine logs. We cast this as a binary classification task, where the model is trained on features extracted from search log query timelines of experiential pancreatic cancer searchers and symptom-only searchers. We focus on maintaining very low false-positive rates (i.e., 1 misprediction in 100k correctly identified cases) while keeping high the imbalance ratio of positive and negative cases (i.e., one thousand positives vs. millions of negative cases); these properties are important for potential future large-scale real-world applications such as an alerting mechanism in search engines. 2 We could also have addressed this by making all features relative percentages. Sampling gave us more flexibility in feature construction. As an additional check, we included features such as the number of queries in the symptom lookup period; those were found to carry little evidential weight in the learned model. 4.2 Features We now describe the features extracted from query timelines. We group our features into five di erent categories: (i) demographic information about the user; (ii) characteristics about user sessions, query classes, and URL classes; (iii) characteristics about symptoms; (iv) features that capture the temporal dynamics, and (v) risk factors. Demographics: Cancer statistics from the U.S. National Cancer Institute 3 show that pancreatic cancer is more common with increasing age, is slightly more common in men than in women, and varies by geographic location. As such, we develop features related to the demographics of the users. In particular, we use the estimated age bucket and gender (see Section 3.4) along with the classifier s probabilities as confidence values. The dominant location (U.S. state) of a searcher is also included as a feature. Search Characteristics: People express their information needs and preferences through queries and click behavior (i.e., the website visits). We extract various features to capture these search and retrieval activities. As we discussed previously, the queries, as well as the visited websites, were tagged with their ODP category in an attempt to identify domains of interest (see Section 3.4). A first set of features SearchHistory contains several generic statistics, such as counts, ratios, and percentages, which are characteristics of the global behavior of the user. For example, we compute the number of queries, sessions, and clicks, as well as ratios of clicks per query for each user. Then, we compute a large number of features with respect to the ODP categories of queries QueryTopic, clicked search results URLTopic, and the combination QueryURLTopic. These include compute counts and percentage of queries and sessions in each ODP category, the average time until queries appear in the same category, as well as the time of the day that queries appear in each category. Similar features are also computed for each category of the visited websites (e.g., counts, ratios, and percentages of visited websites that belong to each category). We additionally compute features to characterize the user sessions, including features that capture the click behavior of users associated with queries. For example, we compute counts and percentages of all the combinations of query categories that led to visits in website categories. Symptoms: Features described above attempt to capture generic characteristics from user sessions. However, for the problem of interest, we seek to also leverage features from queries containing terms captured in the symptom ontology for pancreatic cancer (Section 3.2). The symptom features are divided into two classes: (i) SymptomGeneric and (ii) SymptomSpecific. Generic symptom features contain counts and percentages for the queries and sessions matching symptoms in our ontology, the average time between symptom queries, as well as the average number of symptom queries that are issued daily. Specific symptom features are generated per symptom category. For example, for each symptom, we compute counts and percentages of appearance, the time between distinct symptoms, and the time of day such symptom queries are issued. As with the user session features, we combine symptoms to capture the click behavior and, hence, we compute counts and percentages of each symptom query leading to a visit on a website belonging to particular ODP categories. Finally, we define features that capture the se- 3

6 Table 3: Performance at four-week intervals for users where features can be computed from Exp 0 1weektoExp 0 21 weeks. Values averaged across the ten folds of crossvalidation. The significance of di erences in AUROC and TPR using paired t-tests for each week versus Exp 0 1 indicated * p<0.01, ** p<0.001, and *** p< Weeks denote lead time before Exp 0 ( in Figure 2). Weeks before TPR (as %) at FPRs ranging from AUROC Exp week weeks weeks * * * 13 weeks * * * * 17 weeks * * ** ** ** 21 weeks 6.528* 9.199* ** ** *** ** quence in which symptoms appear in query timelines. Temporal: All previous features produce aggregated statistics over the full time window under consideration. Following [34], we include a set of features to capture the temporal variation of these statistics over misaligned query timelines with noise and missing values. For every feature, we generate a time series with points that represent aggregated values for intervals of the time window. For example, each feature can be computed per month, per week, or per day, depending on the level of granularity we seek to capture. Since the occurrence of specific features can be sparse, we set the time window to four weeks for temporal features. For features that are not percentages or ratios, we also compute the cumulative time series. For each time series, we use the first coe cient of the linear least-square estimates to devise features that capture the trend (i.e., increasing, decreasing, and unchanged) and the rate of change (i.e., slope). Risk Factors: This class contains features related to the presence of terms representing risk factors in the symptom lookup period. For each risk factor, we note its presence or absence, and also the number of queries containing that risk factor and the fraction of all queries from that user that these risk factor queries represents. Total number of distinct risk factors in the symptom lookup period is also a feature. 4.3 Prediction Model The prediction model uses the features outlined in the previous section, computed for each searcher, to make predictions about the future occurrence of experiential diagnostic searches in each searcher s query timeline. We use gradient boosted trees [17], which employ an ensemble of decision trees to construct a better learned model. Advantages include the ability to capture non-linear relationships, model interpretability (e.g., a ranked list of important features is generated), facility for rapid training and testing, and robustness against noisy labels and missing values. We experimented with di erent learning algorithms, but gradient boosted trees yielded superior accuracy. 5. EXPERIMENTAL RESULTS We now present the findings of our experiments. We report the overall performance in Section 5.1 and the performance as we increase the lead time before the first experiential diagnostic query (Section 5.2). We then inspect the model to understand the contributions that each feature makes towards early detection (Section 5.3) and the performance of di erent feature classes (Section 5.4). We examine the e ect on model performance of conditioning on symptoms and risk factors (Section 5.5) and consider a realistic deployment scenario (Section 5.6). We use the area True positive rate False positive rate Weeks before 1 week 5 weeks 9 weeks 13 weeks 17 weeks 21 weeks Figure 3: Average partial ROC curves in the FPR range , for models learned using data up to 21 weeks before the first experiential diagnostic query (error bars excluded for clarity). Variance in FPR and TPR is minor. under the receiver operating characteristic curve (AUROC) and recall (TPR, true positive rate) at fixed, extremely low false positive rates (FPRs) as our primary evaluation metrics. We applied 10-fold cross validation, stratified by user to evaluate the generalizability of the model when applied to new users. Significance level is p<0.05 unless stated. 5.1 Overall The overall performance of the classifier in making predictions based on data up to the beginning of the period of diagnosis (i.e., Exp 0 1week)inAUROCis Given that low error rates would be vital in practice to avoid unnecessary patient alarm, we focus on the true positive rate (fraction of all positives that are recalled by the model) at low false positive rates (FPR). Focusing on FPR in the range , the model is able to recall 5 30% of the positive cases, depending on the specific FPR. We see this performance as promising given the limited information (primarily search-related activity) available to the model. 5.2 Performance by Week A key part of early detection is being able to predict the emergence of the disease well in advance. To understand how prediction performance changed as we move further back in time before the first experiential diagnostic query we selected the set of 337 positive searchers and 945,394 negative searchers who were still observed in the logs many weeks prior to the experiential diagnostic query. We report results from one week before the experiential diagnostic query, all the way up to 21 weeks before the diagnostic query. To count as being present at Exp 0 21 weeks, a searcher needs to have symptom queries extending back at least four weeks before that point (i.e., to Exp 0 25 weeks, or approximately six months before the first experiential diagnostic query). We trained a model for these users in the same way as we did for Section 5.1. The ratio between positive and negative searchers remains similar to that for all users (i.e., approximately 1:3000). Table 3 reports the TPR at di erent false positive rates for this same set of users at di erent four-week increments, as well as the AUROC. The general trend is that the performance drops fairly consistently as we increase the lead time, but even 20 or so weeks before the first experiential diagnostic query the predictive perfor-

7 Table 4: Top 10 features by evidential weight relative to the top feature. Positive or Negative direction means that a feature correlates positively or negatively, respectively, with the rise of experiential queries. Feature Weight Direction Class NumOfDistinctSymptoms Positive SymptomGeneral NumOfQueriesInHealthCategory Positive QueryTopic NumOfDistinctSymptomsVariants Positive SymptomGeneral AgeClassProbability Positive Demographic HasBackPain Negative SymptomSpecific HasIndigestion Negative SymptomSpecific HasIndigestionThenAbdominalPain Positive Temporal SlopeNumOfDistinctSymptoms Positive Temporal HasBackPainThenYellowSkinOrEyes Positive Temporal AgeClassProbability LT Negative Demographic mance is still quite strong (AUROC=0.8315, TPR=6.528% at FPR= ). Assuming that pancreatic cancer progresses steadily from stage I to stage IV in just over one year (as has been previously reported [59]), accurate predictions 20 weeks in advance of the diagnostic period could lead to a sizable increase in the five-year survival rate (e.g., moving the point of diagnosis from Stage III to Stage II could increase the survival rate from 3% to 5 7% [47]). Focusing on the FPR region from 0 to 0.01 (i.e., false positives occur less than 1 in 100 times) and visualizing that part of the ROC curve (Figure 3) we observe some clear di erences in the performance of the models in this important region. The average normalized partial AUROC ranges from (Exp 0 1week)to0.231(Exp 0 21weeks). All di erences in AUROC for Exp 0 5 or more weeks versus Exp 0 1 week are significant (p <0.01 using paired t-tests). 5.3 Feature Contributions In addition to understanding the overall performance, we are also interested in understanding the features that are most important in the learned model for predicting the future issuance of experiential queries. Table 4 shows the top 10 features with the highest weight, along with their weight relative to the top-ranked feature (NumOfDistinct- Symptoms) and the feature class. The direction is based on the correlation between the feature value and the labels in the training data, using Pearson biserial correlation or the phi coe cient, depending on whether or not the feature data is binary. Table 4 shows that there is a broad range of features. The number of distinct pancreatic cancer symptoms was the most important feature. Temporal features representing changes over time and sequence ordering of symptom pairs are also important. Age is important, and it is positively correlated if the searcher is older and is negatively correlated if they are younger. Individual symptom features related to back pain and indigestion are important but have a negative influence on predicting future experiential queries, likely because (i) there are many explanations for why these symptoms appear in a query timeline, and (ii) they are positive for many negative cases (16.7% of negatives search for back pain, 7.4% search for indigestion). 5.4 Feature Classes Beyond the individual features, we can also consider the accuracy of the models based on feature classes. This can be particularly important when some classes of features are easy to obtain in practice, e.g., the demographic features may be available for all searchers without the need to perform temporal modeling of query patterns. Table 5 presents the AUROC for models trained on each of the feature classes. The findings show that the Temporal class is particularly important, signifying the key role of temporal dynamics for this prediction task. The model is still accurate solely with access Table 5: Performance of individual feature classes (as AU- ROC and TPR at a FPR of ) averaged across the 10 experimental folds. The performance di erences against the Overall performance are statistically significant using paired t-tests at * p<0.01, ** p<0.001, and *** p< Feature Class AUROC TPR (as a %) at FPR= Demographic *** 0.280%*** RiskFactor *** 0.653%** SearchHistory *** 1.399%** QueryURLTopic ** 2.052%** SymptomGeneral ** 2.146%** URLTopic ** 2.332%* SymptomSpecific * 2.800%* QueryTopic * 2.892%* Temporal * 2.985%* Overall % to demographics and basic features about general searching. However, performance improves considerably if we consider the specifics of the symptoms searched (SymptomSpecific) or the topics of the queries and results clicked (QueryTopic). 5.5 Symptoms and Risk Factors We also considered the impact of the presence of symptoms and risk factors on the performance of the model. Symptoms: We filtered the positive and negative cases to those where a symptom was present in query timelines. Risk factors: These are risk factors corresponding to the presence of factors such as pancreatitis, smoking, and obesity, as well as cancer syndromes such as hereditary intestinal polyposis syndrome or familial atypical multiple mole melanoma syndrome (i.e., genetic disorders that predispose individuals to develop pancreatic cancer), all of which have been shown to lead to increased likelihood of developing pancreatic cancer [18, 20, 32, 48]. Recall that our cross-validation was stratified by user. During cross validation, we learned a model on the users in the training folds and then for testing we limited to users with evidence of the specific symptoms or risk factors in their search history prior to the experiential diagnostic query. In each case the number of positives and negatives is less than the full set. 4 Table 6 presents statistics on the performance for each model where the number of positive examples was at least 10 (to help ensure that AUROC calculations were meaningful). The table also presents TPRs at di erent false positive rates, as well as the percentage of positive or negative cases that have the symptom or risk factor searches. Finally, the last three columns shows the estimated number of true positives (capture) and false positives (cost) that would be observed, assuming a FPR of , and the associated capture-cost ratio. Ideal targets for rates of capture versus cost in a deployed service can be derived via a decision analysis that considers the net expected value of the early detection and the expected costs of unnecessary anxiety. Such an optimization would leverage a careful characterization of the value of early intervention and details of designs of methods for engaging people. Table 6 shows that focusing on users who search for risk factors such as smoking, hepatitis, and obesity leads to better overall performance. There were fewer than ten users searching for each of the cancer syndromes (e.g., hereditary nonpolyposis colorectal cancer) and, hence, they were excluded from Table 6. Focusing on the percentage of posi- 4 An alternative would be to train a separate model for symptom or risk factor. An issue with doing that is there are insu cient positive examples about each dataset with which to train a robust model.

8 Table 6: Performance of the models conditioned on a variety of symptom (S) and risk factors (RF). Values below the dashed line have a higher AUROC than Overall. Capture represents the number of TP cases in the cohort of positives [ negatives at FPR= Cost is computed as the target FPR (.00001) multiplied by the size of the negative set in each subgroup. Since this exact FPR may not be attainable in each subgroup, cost may not be an integer. A capture-cost ratio of > 1.0 means that more people would benefit from an alert than would be mistakenly alerted. Statistically significant di erences with Overall model (DeLong s test [12]) are marked ** p<0.001 and *** p< (where following a Bonferroni correction is 0.002). Symptom or Risk Factor TPR at FPRs ranging from AUROC # pos (%) # neg (%) FPR = Capture Cost Capture/Cost Dark or tarry stool (S) *** 13 (1.2%) 58,597 (1.9%) Abdominal swelling (S) *** 24 (2.2%) (1.5%) Ulcers (RF) *** 38 (3.5%) 16,081 (0.5%) Dark urine (S) ** 18 (1.7%) 51,236 (1.7%) Pancreatitis (RF) ** 33 (3.1%) 34,184 (1.1%) Abdominal pain (S) ** 130 (12.1%) 311,266 (10.3%) Enlarged gallbladder (S) ** 113 (10.5%) 98,454 (3.3%) Constipation (S) ** 85 (7.9%) 317,300 (10.5%) Smoking (RF) (2.4%) 27,817 (0.9%) Blood clot (S) (8.3%) 351,385 (11.6%) High blood sugar (S) (30.4%) 429,543 (14.2%) Nausea or vomiting (S) (11.7%) 639,502 (21.1%) Chills or fever (RF) (10.3%) 357,536 (11.8%) Loose stool (S) (6%) 74,720 (2.5%) Indigestion (S) (9.9%) 504,462 (16.7%) Itchy skin (S) (1.5%) 79,448 (2.6%) Back pain (S) (13.2%) 223,586 (7.4%) Yellow skin or eyes (S) (8.6%) 85,805 (2.8%) Hepatitis (RF) (3.6%) 25,158 (0.8%) Alcoholism (RF) ** 48 (4.5%) 32,333 (1.1%) Obesity (RF) ** 29 (2.7%) 22,153 (0.7%) Overall ,072 (100%) 3,025,046 (100%) tives and negatives that contain each of the symptoms or risk factors, we observe that there are some that are much more likely to occur in positives (e.g., pancreatitis and smoking are 5.9 and 3.6 times as likely, respectively). Focusing on the utility, we find that if we set the FPR to , overall we would find 52 positives in the union of positives and negatives at the expense of 30 negatives, who would be altered mistakenly. There are some symptoms and risk factors for which the capture-cost is more favorable. For example, in the case of alcoholism or obesity, we would find times as many TPs as FPs. There are others symptoms such as nausea or vomiting, or chills or fever, where the costs in mistakenly alerting users equal or outweigh the benefits. Presence of symptoms or risk factors could help decide whether to apply early detection models for a searcher. 5.6 Applying Learned Model in Practice Up to now, our model considers experiential diagnostic users as positives and symptom-only users as negatives. This is a clean dataset for algorithm training and testing but it ignores the symptom searchers who issue non-experiential pancreatic cancer searches (gray region in Figure 1). These users may have been diagnosed or may simply be exploring. Regardless, they should be considered in practice. We perform an additional experiment on a separate set of symptom searchers that included non-experiential pancreatic cancer searchers as negatives. We trained a model on all data described thus far and applied it to identify (i) experiential and (ii) experiential+treatment users in a new held out dataset in advance of their first experiential diagnostic query. We generated the test set from logs of a separate randomly selected subset of Bing users, over an 18-month period from August 2014 to January 2016 inclusive. There was no overlap in users with the set used for training. We identified positive cases as earlier and expanded the definition of negatives to include pancreatic cancer searchers. This resulted in 2.9 million negatives, including 48,221 nonexperiential pancreatic cancer searchers, and 945 experiential searchers with preceding symptom searches. To help target the identification of cases where experiential queries are issued, we created a subset of the positives who issued treatment-related queries following Exp 0 (e.g., whipple procedure, 5-fu); in total, 494 users (52%) met this requirement. Table 7: Average AUROC and average TPR (as %) at FPR= for identifying experiential users and experiential+treatment users from held out dataset. Di erences in AUROC and TPR between Exp 0 1 week and other weeks noted (*** p<0.001, ** p<0.001, and * p<0.01). Weeks before Experiential Experiential+Treatment Exp 0 AUROC TPR (%) AUROC TPR (%) 1week weeks * weeks ** 8.278* ** * 13 weeks ** 8.018** ** ** 17 weeks *** 7.666** *** *** 21 weeks *** 7.438*** *** 9.645*** The symptom lookup durations for positives and negatives were similar to Section 3.5. We randomly split the test data into ten equally-sized subsets for significance testing. Table 7 reports the predictive performance at di erent lead times. Table 7 shows that the performance of the model remains strong on this held-out set and is comparable to that reported in the earlier sections. The performance decreases with increased lead time as noted previously (see Table 3). Interestingly, the performance in identifying the subset of experiential diagnostic users who subsequently searched for treatments is higher than for the experiential-only set. This is promising confirmatory evidence as these users are assumed to have experienced a cancer diagnosis, per definitions of experiential queries [40] DISCUSSION AND CONCLUSIONS We studied the potential feasibility of learning from search engine logs to predict future issuance of experiential queries about pancreatic cancer at a low error rate. The success of these methods has implications for online methods that would provide passive screening of searchers to provide early warning about potential signs of pancreatic cancer and other devastating diseases. We discovered that conditionalization on di erent symptoms and risk factors can enhance predictive power. We found capture-cost tradeo s associated with 5 For completeness, we also trained a model on all data from earlier in the paper, including the non-experiential pancreatic cancer searchers as negative cases, and tested it on this held-out set. The performance is around 5% lower than reported in Table 7, for both AUROC and TPR, across all weeks. Including the non-experiential pancreatic cancer searchers may add noise to model training.

The incidence of pancreatic cancer is rising in India and is higher in the urban male population in the western and northern parts of India.

The incidence of pancreatic cancer is rising in India and is higher in the urban male population in the western and northern parts of India. Published on: 9 Jun 2015 Pancreatic Cancer What Is Cancer? The body is made up of cells, which grow and die in a controlled way. Sometimes, cells keep on growing without control, causing an abnormal growth

More information

From web search to healthcare utilization: privacy-sensitive studies from mobile data

From web search to healthcare utilization: privacy-sensitive studies from mobile data From web search to healthcare utilization: privacy-sensitive studies from mobile data Ryen White, Eric Horvitz Research and applications Microsoft Research, Redmond, Washington, USA Correspondence to Dr

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

OVARIAN CANCER STATISTICS

OVARIAN CANCER STATISTICS bioprognos OncoOVARIAN Dx Non-invasive blood test useful to suggest a possible diagnosis in patients with suspected malignancy in the ovarians, reduce inappropriate diagnostic tests, days of hospitalization,

More information

What Is Pancreatitis?

What Is Pancreatitis? What Is Pancreatitis? Pancreatitis is inflammation (swelling) of the pancreas that is most often caused by gallstones or alcohol abuse. There are other causes that your gastroenterologist will look for,

More information

X-Plain Pancreatic Cancer Reference Summary

X-Plain Pancreatic Cancer Reference Summary X-Plain Pancreatic Cancer Reference Summary Introduction Pancreatic cancer is the 4th leading cause of cancer deaths in the U.S. About 37,000 new cases of pancreatic cancer are diagnosed each year in the

More information

PANCREAS DUCTAL ADENOCARCINOMA PDAC

PANCREAS DUCTAL ADENOCARCINOMA PDAC CONTENTS PANCREAS DUCTAL ADENOCARCINOMA PDAC I. What is the pancreas? II. III. IV. What is pancreas cancer? What is the epidemiology of Pancreatic Ductal Adenocarcinoma (PDAC)? What are the risk factors

More information

What is Liver Cancer? About the Liver

What is Liver Cancer? About the Liver Your liver is important and it has many functions. The top three are that it cleans your blood of toxins, gives you energy and produces bile for digestion. What is Liver Cancer? Cancer starts when cells

More information

Predictive Diagnosis. Clustering to Better Predict Heart Attacks x The Analytics Edge

Predictive Diagnosis. Clustering to Better Predict Heart Attacks x The Analytics Edge Predictive Diagnosis Clustering to Better Predict Heart Attacks 15.071x The Analytics Edge Heart Attacks Heart attack is a common complication of coronary heart disease resulting from the interruption

More information

Anamnesis via the Internet - Prospects and Pilot Results

Anamnesis via the Internet - Prospects and Pilot Results MEDINFO 2001 V. Patel et al. (Eds) Amsterdam: IOS Press 2001 IMIA. All rights reserved Anamnesis via the Internet - Prospects and Pilot Results Athanasios Emmanouil and Gunnar O. Klein Centre for Health

More information

Information for You and Your Family

Information for You and Your Family Information for You and Your Family What is Prevention? Cancer prevention is action taken to lower the chance of getting cancer. In 2017, more than 1.6 million people will be diagnosed with cancer in the

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

More skilled internet users behave (a little) more securely

More skilled internet users behave (a little) more securely More skilled internet users behave (a little) more securely Elissa Redmiles eredmiles@cs.umd.edu Shelby Silverstein shelby93@umd.edu Wei Bai wbai@umd.edu Michelle L. Mazurek mmazurek@umd.edu University

More information

11. NATIONAL DAFNE CLINICAL AND RESEARCH DATABASE

11. NATIONAL DAFNE CLINICAL AND RESEARCH DATABASE 11. NATIONAL DAFNE CLINICAL AND RESEARCH DATABASE The National DAFNE Clinical and Research database was set up as part of the DAFNE QA programme (refer to section 12) to facilitate Audit and was updated

More information

FAQ ON QUANTUM HEALTH ANALYZER

FAQ ON QUANTUM HEALTH ANALYZER FAQ ON QUANTUM HEALTH ANALYZER What is Quantum Health Analyzer? The Quantum Health Analyzer measures the degree and type of response of a matter under test, and by comparison with reference matter it assists

More information

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? Richards J. Heuer, Jr. Version 1.2, October 16, 2005 This document is from a collection of works by Richards J. Heuer, Jr.

More information

Philippine Cancer Society Forum: Cancer can be cured!

Philippine Cancer Society Forum: Cancer can be cured! Philippine Cancer Society Forum: Cancer can be cured! Throughout history, doctors and scientists have extensively studied Their researchers have not only yielded a wealth of information on the disease,

More information

Liver Cancer And Tumours

Liver Cancer And Tumours Liver Cancer And Tumours What causes liver cancer? Many factors may play a role in the development of cancer. Because the liver filters blood from all parts of the body, cancer cells from elsewhere can

More information

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests Weight Adjustment Methods using Multilevel Propensity Models and Random Forests Ronaldo Iachan 1, Maria Prosviryakova 1, Kurt Peters 2, Lauren Restivo 1 1 ICF International, 530 Gaither Road Suite 500,

More information

Treatment for cancer of the gall bladder

Treatment for cancer of the gall bladder Treatment for cancer of the gall bladder Hepatobiliary Services Information for Patients Liver i Stomach Pancreas Gall bladder Introduction The aim of this booklet is to help you understand more about

More information

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U T H E U N I V E R S I T Y O F T E X A S A T D A L L A S H U M A N L A N

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

The diagnosis of Chronic Pancreatitis

The diagnosis of Chronic Pancreatitis The diagnosis of Chronic Pancreatitis 1. Background The diagnosis of chronic pancreatitis (CP) is challenging. Chronic pancreatitis is a disease process consisting of: fibrosis of the pancreas (potentially

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Combatting Pancreatic Cancer: Keys to Early Recognition and Diagnosis

Combatting Pancreatic Cancer: Keys to Early Recognition and Diagnosis Transcript Details This is a transcript of an educational program accessible on the ReachMD network. Details about the program and additional media formats for the program are accessible by visiting: https://reachmd.com/programs/clinicians-roundtable/combatting-pancreatic-cancer-keys-earlyrecognition-and-diagnosis/7286/

More information

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University,

More information

BlueBayCT - Warfarin User Guide

BlueBayCT - Warfarin User Guide BlueBayCT - Warfarin User Guide December 2012 Help Desk 0845 5211241 Contents Getting Started... 1 Before you start... 1 About this guide... 1 Conventions... 1 Notes... 1 Warfarin Management... 2 New INR/Warfarin

More information

BIOSTATISTICAL METHODS

BIOSTATISTICAL METHODS BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH PROPENSITY SCORE Confounding Definition: A situation in which the effect or association between an exposure (a predictor or risk factor) and

More information

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX The Impact of Relative Standards on the Propensity to Disclose Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX 2 Web Appendix A: Panel data estimation approach As noted in the main

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING. Learning Objectives for this session:

OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING. Learning Objectives for this session: OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING Learning Objectives for this session: 1) Know the objectives of a screening program 2) Define and calculate

More information

The 5 Warning Signs That Could Mean Cancer in Your Body

The 5 Warning Signs That Could Mean Cancer in Your Body The 5 Warning Signs That Could Mean Cancer in Your Body I hope this article is a wake up call for anyone who s ever dealt with: Heartburn Difficulty digesting certain foods Acid reflux Inflammation in

More information

Colon Cancer Screening and Surveillance. Louis V. Antignano, M.D. Wilson Gastroenterology October 11, 2011

Colon Cancer Screening and Surveillance. Louis V. Antignano, M.D. Wilson Gastroenterology October 11, 2011 Colon Cancer Screening and Surveillance Louis V. Antignano, M.D. Wilson Gastroenterology October 11, 2011 Colorectal Cancer Preventable cancer Number 2 cancer killer in the USA Often curable if detected

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

The Long Tail of Recommender Systems and How to Leverage It

The Long Tail of Recommender Systems and How to Leverage It The Long Tail of Recommender Systems and How to Leverage It Yoon-Joo Park Stern School of Business, New York University ypark@stern.nyu.edu Alexander Tuzhilin Stern School of Business, New York University

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

Ebele C. Chira, MD 1055 Clarksville Street, Suite 190, Paris, TX Phone (903) Fax (903)

Ebele C. Chira, MD 1055 Clarksville Street, Suite 190, Paris, TX Phone (903) Fax (903) Ebele C. Chira, MD 1055 Clarksville Street, Suite 190, Paris, TX 75460 Phone (903) 905-4609 Fax (903) 905-4611 Enclosed are forms for you to complete prior to your appointment. Please bring these completed

More information

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us SIIM 2016 Scientific Session Quality and Safety Part 1 Thursday, June 30 8:00 am 9:30 am Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us Linda C. Kelahan, MD, Medstar

More information

International Journal of Research in Science and Technology. (IJRST) 2018, Vol. No. 8, Issue No. IV, Oct-Dec e-issn: , p-issn: X

International Journal of Research in Science and Technology. (IJRST) 2018, Vol. No. 8, Issue No. IV, Oct-Dec e-issn: , p-issn: X CLOUD FILE SHARING AND DATA SECURITY THREATS EXPLORING THE EMPLOYABILITY OF GRAPH-BASED UNSUPERVISED LEARNING IN DETECTING AND SAFEGUARDING CLOUD FILES Harshit Yadav Student, Bal Bharati Public School,

More information

Colorectal Cancer How to reduce your risk

Colorectal Cancer How to reduce your risk Prevention Series Colorectal Cancer How to reduce your risk Let's Make Cancer History 1 888 939-3333 cancer.ca Colorectal Cancer How to reduce your risk Colorectal cancer is the third most commonly diagnosed

More information

Together, putting patients first

Together, putting patients first The Role of a Gastroenterologist in the Diagnosis and Management of Pancreatic Cancer Sarah Jowett, Consultant Gastroenterologist Bradford Teaching Hospitals Trust Leeds Regional Study Day, 12 September

More information

Hernia. emoryhealthcare.org

Hernia. emoryhealthcare.org Hernia Have you noticed a bulge or pain in your abdominal wall or groin? If so you may have a hernia. You may be in the process of confirming this diagnosis with your Primary Care Physician or already

More information

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Devendra Singh Dhami 1, Ameet Soni 2, David Page 3, and Sriraam Natarajan 1 1 Indiana University Bloomington 2 Swarthmore College

More information

Cirrhosis. A Chronic Liver Problem

Cirrhosis. A Chronic Liver Problem Cirrhosis A Chronic Liver Problem What Is Cirrhosis? Cirrhosis is a chronic (long-lasting) liver problem. It results from damaged and scarred liver tissue. Cirrhosis can t be cured, but it can be treated.

More information

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing Chapter IR:VIII VIII. Evaluation Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing IR:VIII-1 Evaluation HAGEN/POTTHAST/STEIN 2018 Retrieval Tasks Ad hoc retrieval:

More information

For the Patient: USMAVNIV

For the Patient: USMAVNIV For the Patient: USMAVNIV Other Names: Treatment of Unresectable or Metastatic Melanoma Using Nivolumab U = Undesignated (requires special approval) SM = Skin and Melanoma AV = AdVanced NIV = NIVolumab

More information

CSE 255 Assignment 9

CSE 255 Assignment 9 CSE 255 Assignment 9 Alexander Asplund, William Fedus September 25, 2015 1 Introduction In this paper we train a logistic regression function for two forms of link prediction among a set of 244 suspected

More information

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b Accidental sampling A lesser-used term for convenience sampling. Action research An approach that challenges the traditional conception of the researcher as separate from the real world. It is associated

More information

READ THIS FOR SAFE AND EFFECTIVE USE OF YOUR MEDICINE PATIENT MEDICATION INFORMATION

READ THIS FOR SAFE AND EFFECTIVE USE OF YOUR MEDICINE PATIENT MEDICATION INFORMATION READ THIS FOR SAFE AND EFFECTIVE USE OF YOUR MEDICINE PATIENT MEDICATION INFORMATION BAVENCIO (buh-ven-see-oh) Avelumab for Injection Solution for Intravenous Infusion Read this carefully before you start

More information

The Social Norms Review

The Social Norms Review Volume 1 Issue 1 www.socialnorm.org August 2005 The Social Norms Review Welcome to the premier issue of The Social Norms Review! This new, electronic publication of the National Social Norms Resource Center

More information

PDRF About Propensity Weighting emma in Australia Adam Hodgson & Andrey Ponomarev Ipsos Connect Australia

PDRF About Propensity Weighting emma in Australia Adam Hodgson & Andrey Ponomarev Ipsos Connect Australia 1. Introduction It is not news for the research industry that over time, we have to face lower response rates from consumer surveys (Cook, 2000, Holbrook, 2008). It is not infrequent these days, especially

More information

General Biostatistics Concepts

General Biostatistics Concepts General Biostatistics Concepts Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawai i at Mānoa Outline 1. What is Biostatistics? 2. Types of Measurements

More information

Visualizing Temporal Patterns by Clustering Patients

Visualizing Temporal Patterns by Clustering Patients Visualizing Temporal Patterns by Clustering Patients Grace Shin, MS 1 ; Samuel McLean, MD 2 ; June Hu, MS 2 ; David Gotz, PhD 1 1 School of Information and Library Science; 2 Department of Anesthesiology

More information

Chronic Pancreatitis (1 of 4) i

Chronic Pancreatitis (1 of 4) i Chronic Pancreatitis (1 of 4) i If you need this information in another language or medium (audio, large print, etc) please contact the Customer Care Team on 0800 374 208 email: customercare@ salisbury.nhs.uk.

More information

Pancreatic Cancer (1 of 5)

Pancreatic Cancer (1 of 5) i If you need your information in another language or medium (audio, large print, etc) please contact Customer Care on 0800 374 208 or send an email to: customercare@ salisbury.nhs.uk You are entitled

More information

BowelGene. How do I know if I am at risk? Families with hereditary bowel cancer generally show one or more of the following clues:

BowelGene. How do I know if I am at risk? Families with hereditary bowel cancer generally show one or more of the following clues: BowelGene BowelGene What is hereditary bowel cancer? Bowel cancer (also known as colorectal cancer) is the fourth most common cancer in the UK. Unfortunately 1 in 19 women and 1 in 14 men will develop

More information

Assessment of Reliability of Hamilton-Tompkins Algorithm to ECG Parameter Detection

Assessment of Reliability of Hamilton-Tompkins Algorithm to ECG Parameter Detection Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Assessment of Reliability of Hamilton-Tompkins Algorithm to ECG Parameter

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

LAPAROSCOPIC GALLBLADDER SURGERY

LAPAROSCOPIC GALLBLADDER SURGERY LAPAROSCOPIC GALLBLADDER SURGERY Treating Gallbladder Problems with Laparoscopy A Common Problem If you ve had an attack of painful gallbladder symptoms, you re not alone. Gallbladder disease is very common.

More information

Learning to Cook: An Exploration of Recipe Data

Learning to Cook: An Exploration of Recipe Data Learning to Cook: An Exploration of Recipe Data Travis Arffa (tarffa), Rachel Lim (rachelim), Jake Rachleff (jakerach) Abstract Using recipe data scraped from the internet, this project successfully implemented

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Measuring Focused Attention Using Fixation Inner-Density

Measuring Focused Attention Using Fixation Inner-Density Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute

More information

RAG Rating Indicator Values

RAG Rating Indicator Values Technical Guide RAG Rating Indicator Values Introduction This document sets out Public Health England s standard approach to the use of RAG ratings for indicator values in relation to comparator or benchmark

More information

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington

More information

Take on IPF progression with OFEV

Take on IPF progression with OFEV Every breath matters Take on IPF progression with OFEV Learn more at www.ofev.com IPF=idiopathic pulmonary fibrosis. Please see throughout Understand how IPF affects you Idiopathic pulmonary fibrosis (IPF)

More information

GHUK BowelGene_2017.qxp_Layout 1 22/02/ :22 Page 3 BowelGene

GHUK BowelGene_2017.qxp_Layout 1 22/02/ :22 Page 3 BowelGene GHUK BowelGene_2017.qxp_Layout 1 22/02/2017 10:22 Page 3 BowelGene BowelGene What is hereditary bowel cancer? Bowel cancer (also known as colorectal cancer) is the fourth most common cancer in the UK.

More information

Tobacco Use Percent (%)

Tobacco Use Percent (%) Tobacco Use 1 8 6 2 23 25 27 Lifetime cigarette use 64.8 62. 59.9 Current cigarette smoker 3.2 25.7 24.2 Current cigar smoker 19.4 21.3 18.9 First cigarette before age 13 24.7 2. 18. Current spit tobacco

More information

Asthma Surveillance Using Social Media Data

Asthma Surveillance Using Social Media Data Asthma Surveillance Using Social Media Data Wenli Zhang 1, Sudha Ram 1, Mark Burkart 2, Max Williams 2, and Yolande Pengetnze 2 University of Arizona 1, PCCI-Parkland Center for Clinical Innovation 2 {wenlizhang,

More information

MN 400: Research Methods. PART II The Design of Research

MN 400: Research Methods. PART II The Design of Research MN 400: Research Methods PART II The Design of Research 1 MN 400: Research Methods CHAPTER 6 Research Design 2 What is Research Design? A plan for selecting the sources and types of information used to

More information

Predicting About-to-Eat Moments for Just-in-Time Eating Intervention

Predicting About-to-Eat Moments for Just-in-Time Eating Intervention Predicting About-to-Eat Moments for Just-in-Time Eating Intervention CORNELL UNIVERSITY AND VIBE GROUP AT MICROSOFT RESEARCH Motivation Obesity is a leading cause of preventable death second only to smoking,

More information

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview Innovative Risk and Quality Solutions for Value-Based Care Company Overview Meet Talix Talix provides risk and quality solutions to help providers, payers and accountable care organizations address the

More information

Pathway Project Team

Pathway Project Team Semantic Components: A model for enhancing retrieval of domain-specific information Lecture 22 CS 410/510 Information Retrieval on the Internet Pathway Project Team Susan Price, MD Portland State University

More information

Pancreatic Cancer. What is pancreatic cancer?

Pancreatic Cancer. What is pancreatic cancer? Scan for mobile link. Pancreatic Cancer Pancreatic cancer is a tumor of the pancreas, an organ that is located behind the stomach in the abdomen. Pancreatic cancer does not always cause symptoms until

More information

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ OBJECTIVES Definitions Stages of Scientific Knowledge Quantification and Accuracy Types of Medical Data Population and sample Sampling methods DEFINITIONS

More information

Family history of pancreatic cancer

Family history of pancreatic cancer Family history of pancreatic cancer _ This fact sheet is for people who want to find out about the hereditary risk of pancreatic cancer. You can speak to our specialist nurses on our Support Line about

More information

NHS Training for Physiotherapy Support Workers. Workbook 13 The digestive system

NHS Training for Physiotherapy Support Workers. Workbook 13 The digestive system NHS Training for Physiotherapy Support Workers Workbook 13 The digestive system Contents Workbook 13 The digestive system 1 13.1 Aim 3 13.2 Learning outcomes 3 13.3 Digestive system 4 13.4 The endocrine

More information

Anxiety and Information Seeking: Evidence From Large-Scale Mouse Tracking

Anxiety and Information Seeking: Evidence From Large-Scale Mouse Tracking Anxiety and Information Seeking: Evidence From Large-Scale Mouse Tracking ABSTRACT Brit Youngmann Microsoft Research Herzliya, Israel t-bryoun@microsoft.com People seeking information through search engines

More information

How preferred are preferred terms?

How preferred are preferred terms? How preferred are preferred terms? Gintare Grigonyte 1, Simon Clematide 2, Fabio Rinaldi 2 1 Computational Linguistics Group, Department of Linguistics, Stockholm University Universitetsvagen 10 C SE-106

More information

Social Determinants of Health

Social Determinants of Health FORECAST HEALTH WHITE PAPER SERIES Social Determinants of Health And Predictive Modeling SOHAYLA PRUITT Director Product Management Health systems must devise new ways to adapt to an aggressively changing

More information

Chapter 2: Identification and Care of Patients With CKD

Chapter 2: Identification and Care of Patients With CKD Chapter 2: Identification and Care of Patients With CKD Over half of patients in the Medicare 5% sample (aged 65 and older) had at least one of three diagnosed chronic conditions chronic kidney disease

More information

Diagnoses, Decisions, and Outcomes: Web Search as Decision Support for Cancer

Diagnoses, Decisions, and Outcomes: Web Search as Decision Support for Cancer Diagnoses, Decisions, and Outcomes: Web Search as Decision Support for Cancer Michael J. Paul, Johns Hopkins University Ryen W. White and Eric Horvitz, Microso> Research Decisions, Decisions People frequently

More information

Driving forces of researchers mobility

Driving forces of researchers mobility Driving forces of researchers mobility Supplementary Information Floriana Gargiulo 1 and Timoteo Carletti 1 1 NaXys, University of Namur, Namur, Belgium 1 Data preprocessing The original database contains

More information

JOHN MICHAEL ROACH, MD

JOHN MICHAEL ROACH, MD GASTROENTEROLOGY JOHN MICHAEL ROACH, MD 520 N. 4 TH AVE. PASCO, WA 99301 Phone: (509) 546-8383 Name: Date of Birth: First Middle (full) Last m/d/yr Primary care provider: Referring physician: Local Pharmacy:

More information

Public Reporting and Self-Regulation: Hospital Compare s Effect on Sameday Brain and Sinus Computed Tomography Utilization

Public Reporting and Self-Regulation: Hospital Compare s Effect on Sameday Brain and Sinus Computed Tomography Utilization Public Reporting and Self-Regulation: Hospital Compare s Effect on Sameday Brain and Sinus Computed Tomography Utilization Darwyyn Deyo and Danny R. Hughes December 1, 2017 Motivation Widespread concerns

More information

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor Text mining for lung cancer cases over large patient admission data David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor Opportunities for Biomedical Informatics Increasing roll-out

More information

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE SAKTHI NEELA.P.K Department of M.E (Medical electronics) Sengunthar College of engineering Namakkal, Tamilnadu,

More information

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION OPT-i An International Conference on Engineering and Applied Sciences Optimization M. Papadrakakis, M.G. Karlaftis, N.D. Lagaros (eds.) Kos Island, Greece, 4-6 June 2014 A NOVEL VARIABLE SELECTION METHOD

More information

FREQUENTLY ASKED QUESTIONS MINIMAL DATA SET (MDS)

FREQUENTLY ASKED QUESTIONS MINIMAL DATA SET (MDS) FREQUENTLY ASKED QUESTIONS MINIMAL DATA SET (MDS) Date in parentheses is the date the question was added to the list or updated. Last update 6/25/05 DEFINITIONS 1. What counts as the first call? (6/24/05)

More information

QUESTIONNAIRE USER'S GUIDE

QUESTIONNAIRE USER'S GUIDE QUESTIONNAIRE USER'S GUIDE OUTBREAK INVESTIGATION Arthur L. Reingold's article "Outbreak Investigations--A Perspective" 1 provides an excellent synopsis of the procedures, challenges, and unique attributes

More information

Pancreatic Cancer Treatment

Pancreatic Cancer Treatment Scan for mobile link. Pancreatic Cancer Treatment What is pancreatic cancer? Pancreatic cancer begins in the pancreas, an organ located deep in the abdomen behind the stomach. The pancreas releases hormones

More information

Redirecting People with Complex Conditions to Effective Care

Redirecting People with Complex Conditions to Effective Care Redirecting People with Complex Conditions to Effective Care October 6, 2016 Johnson County Services Mental Health Center EMS Jail, Court, Probation County Services 127,000 575,000 people Proportion of

More information

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Sum of Neurally Distinct Stimulus- and Task-Related Components. SUPPLEMENTARY MATERIAL for Cardoso et al. 22 The Neuroimaging Signal is a Linear Sum of Neurally Distinct Stimulus- and Task-Related Components. : Appendix: Homogeneous Linear ( Null ) and Modified Linear

More information

Introduction. Lecture 1. What is Statistics?

Introduction. Lecture 1. What is Statistics? Lecture 1 Introduction What is Statistics? Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain information and understanding from data. A statistic

More information

This is a licensed product of Ken Research and should not be copied

This is a licensed product of Ken Research and should not be copied 1 TABLE OF CONTENTS 1. The US Diabetes Care Devices Market Introduction 1.1. What is Diabetes and it s Types? 2. The US Diabetes Care Devices Market Size, 2007-2013 3. The US Diabetes Care Devices Market

More information

Gallstones Information Leaflet THE DIGESTIVE SYSTEM. Gutscharity.org.uk

Gallstones Information Leaflet THE DIGESTIVE SYSTEM.  Gutscharity.org.uk THE DIGESTIVE SYSTEM http://healthfavo.com/digestive-system-for-kids.html This factsheet is about gallstones Gall is an old-fashioned word for bile, a liquid made in the liver and stored in the gall bladder

More information

About Reading Scientific Studies

About Reading Scientific Studies About Reading Scientific Studies TABLE OF CONTENTS About Reading Scientific Studies... 1 Why are these skills important?... 1 Create a Checklist... 1 Introduction... 1 Abstract... 1 Background... 2 Methods...

More information