Using Internet data to learn in the health domain Carla Teixeira Lopes - ctl@fe.up.pt SSIM, MIEIC, 2016/17 Based on slides from Yom-Tov et al. (2015)
Agenda Internet data for health research Data sources Research works
Internet data
Internet data When should we use it for health research? Why is it useful? Any advantages over the data collected in the physical world?
Advantages of Internet data Easier to collect than in the physical world Larger sample More trustworthy than surveys
Easier to collect http://blog.okcupid.com/index.php/the-biggest-lies-in-online-dating/
Easier to collect (Pelleg et al., 2012)
Larger sample
Survey problem On a survey we depend on the quality of the answers.
Associations are hard to predict
Data sources
Data sources Web search General social media: Twitter, Facebook, Flickr Medical social media: ehealthme, PatientsLikeMe Medical Internet aggregators: HealthMap Actively collecting data: crowdsourcing, online advertisements, online surveys Other data: Smartphone interaction, Fitness monitors
Web search http://www.internetlivestats.com/google-search-statistics/
Health web search Searching for health information online is the third most popular activity online (Fox, 2011), being done by 72% of American Internet users (Fox and Duggan, 2013) (Fox and Duggan, 2013)
Health web search http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/
Obtaining a search log Company Crowdsourcing Use other datasets
General social media http://healthdecide.orcahealth.com/2012/12/10/how-health-consumers-use-the-web-infographic/
General social media Small scale data is generally available (e.g.: in collated datasets or through crawl) http://anadouglas.com/which-social-media-platform-are-you-on/
Medical social media People gathering to discuss their specific predicament Examples: ehealthme, PatientsLikeMe Truthfulness is usually high Data availability can be a (legal) problem
Medical Internet aggregators: HealthMap
Crowdsourcing
Online advertisements
Online surveys To validate findings
Other data Smartphone interaction Fitness monitors Internet of Things (IoT) http://healthdecide.orcahealth.com/2012/12/10/howhealth-consumers-use-the-web-infographic/
Characteristics of data sources Truthfulness Are people providing real information? Anonymity and usefulness What do people say on each? What do they feel comfortable discussing? Personal interest (news, gossip) versus personal medical need Real or imagined? Metadata Demographics, medical diagnosis, etc. Explicit vs. implicit creation Patient groups versus location data Accessibility for research
Summary Source Truthfulness Anonymity and usefulness Metadata Creation Accessibility for research Web search High High Rare Implicit Within companies or via toolbars General social media Medical social media Medical internet aggregators Smartphone interaction Actively collecting data Low Low-medium Available Explicit Through hoses or scraping Medium-High High Common Explicit Usually via scraping High Medium -- Explicit? High Medium None Implicit Very difficult Variable Medium Available Explicit Easy Make your own! (Yom-Tov et al., 2015)
Research works
Postmarket drug safety surveillance via search queries Why? Current postmarket drug surveillance mechanisms depend on patient reports Hard to identify if an adverse reaction happens after the drug is taken for a long period Hard to identify if several medications are taken at the same time Therefore, Could we complement this process by looking at search queries? (Yom-Tov and Gabrilovich, 2013)
Postmarket drug safety surveillance via search queries Data queries submitted to Yahoo search engine during 6 months in 2010 176 unique million users (search logs anonymized) Drugs under investigation: 20 top-selling drugs (in the US) Symptoms lexicon 195 symptoms from the International Statistical Classification of Diseases (ICD) and related health problems (WHO) filtered by Wikipedia (http://en.wikipedia.org/wiki/list_of_medical_symptoms ) expanded with synonyms acquired through an analysis of the most frequently returned web page when a symptom was forming the query Aim quantify the prevalence of adverse drug reports (ADR) for a given drug (Yom-Tov and Gabrilovich, 2013)
Postmarket drug safety surveillance via search queries groundtruth : reports to repositories for safety surveillance for approved drugs mapped to same list of symptoms score of drug-symptom pair n ij : how many times a symptom was searched Day 0: first day user searched for a drug D if the user has not searched for a drug, then day 0 is the midpoint of his history (Yom-Tov and Gabrilovich, 2013)
Postmarket drug safety surveillance via search queries Comparison of drug-symptom scores based on query logs and groundtruth Which symptoms reduce this correlation the most? (most discordant ADRs) discover previously unknown ADRs that patients do not tend to report (Yom-Tov and Gabrilovich, 2013)
Predicting depression via social media Mental illness leading cause of disability worldwide 300 million people suffer from depression (WHO, 2001) Services for identifying and treating mental illnesses: NOT adequate Can content from social media (Twitter) assist? Focus on Major Depressive Disorder (MDD) low mood low self-esteem loss of interest or pleasure in normally enjoyable activities (De Choudhury et al., 2013)
Predicting depression via social media Data set formation crowdsourcing a depression survey, share Twitter username determine a depression score via a formalized questionnaire (Center for Epidemiologic Studies Depression Scale; CES-D): from 0 (no symptoms) to 60 476 people diagnosed with depression with onset between September 2011 and June 2012 agreed to monitor their public Twitter profile 36% with CES-D > 22 (definite depression) Twitter feed collection ~ 2.1 million tweets depression-positive users (from onset and one year back) depression-negative users (from survey date and one year back) (De Choudhury et al., 2013)
Predicting depression via social media Examples of feature categories (overall 47) Engagement daily volume of tweets, proportion of @replyposts, retweets, links, question-centric posts, normalized difference between night and day posts (insomnia index) Social network properties (ego-centric) followers, followees, reciprocity (average number of replies of U to V divided by number of replies from V to U), graph density (edges / nodes in a user s ego-centric graph) Linguistic Inquiry and Word Count (LIWC - http://www.liwc.net) features for emotion: positive/negative affect, activation, dominance features for linguistic style: functional words, negation, adverbs, certainty Depression lexicon Mental health in Yahoo! Answers Pointwise-Mutual-Information + Likelihood-ratio between depress* and all other tokens (top 1%) TF-IDF of these terms in Wikipedia to remove very frequent terms:1,000 depression words Anti-depression language lexicon of antidepressant drug names (De Choudhury et al., 2013)
Predicting depression via social media Depressive patterns decrease in user engagement (volume and replies) higher Negative Affect (NA) low activation (loneliness, exhaustion, lack of energy, sleep deprivation) Depression class Non-depression class (De Choudhury et al., 2013)
Predicting depression via social media Depressive patterns increased presence of 1st person pronouns decreased for 3rd person pronouns use of depression terms higher (examples: anxiety, withdrawal, fun, play, helped, medication, side-effects, home, woman) Depression class Non-depression class (De Choudhury et al., 2013)
Other works using social media Twitter HIV detection Modeling influenza rates Modeling health topics Modeling disease spread Flickr Pro-anorexia and prorecovery Google Flu Trends Forecasting influenza Wikipedia Nowcasting and forecasting diseases
Does Sustained Participation in an Online Health Community Affect Sentiment? Large breast cancer community Impact of different factors on post sentiment Time since joining the community, posting activity, age, cancer stage (Zhang et al, 2014)
Does Sustained Participation in an Online Health Community Affect Sentiment? Dataset breastcancer.org 291,528 posts in 31,034 threads published by 12,819 community members between May 2004 and September 2010 Metadata including user profiles were also extracted Automated Sentiment Analysis Built a classifier 1,000 posts were manually annotated (positive or negative) (Zhang et al, 2014)
Does Sustained Participation in an Online Health Community Affect Sentiment? For each post, a sentiment score (probability of post being positive) was calculated. Significant increase in sentiment of posts through time Different patterns for initial posts and reply posts Factors play a role (Zhang et al, 2014)
A global compendium of human dengue virus occurrence Database comprising occurrence data linked to point or polygon locations. Goal Generate a global risk map and associate burden estimates. Data collection Search by dengue in PubMed, ISI Web of Science and ProMED Publications between 1960 and 2012 Data from HealthMap (Messina et al, 2014)
A global compendium of human dengue virus occurrence Geo-positioning of the data Location extracted from the articles Latitudinal and longitudinal coordinates determined using Google Maps (Messina et al, 2014)
A global compendium of human dengue virus occurrence (Messina et al, 2014)
Tracking Flu-Related Searches on the Web for Syndromic Surveillance Campaign using a keyword-triggered sponsored link in Google Adsense, for Canadian searchers Keywords: flu or flu symptoms Number of impressions roughly proportional to the number of searches containing the keywords Daily statistics on impressions and clicks aggregated to match the time periods of the FluWatch reports. (Eysenbach, 2006)
Tracking Flu-Related Searches on the Web for Syndromic Surveillance (Eysenbach, 2006) (Eysenbach, 2006)
Measuring the impact of epidemic alerts on human mobility using cell-phone network data Measure the impact that the alerts issued by the Mexican government had during the H1N1 flu outbreak in 2009 Mobility characterized using anonymized Call Detail Records (CDRs) traces (Frias-Martinez et al., 2012)
Measuring the impact of epidemic alerts on human mobility using cell-phone network data (Frias-Martinez et al., 2012)
How the Napa earthquake affected Bay Area sleepers https://jawbone.com/blog/napa-earthquake-effect-on-sleep/
Topics for SSIM
Topics for SSIM The use of Wikipedia for automatic translation in the health domain Using a set of Portuguese health queries, the goal of this work is to evaluate if and how well can Wikipedia be used to automatically translate Portuguese medical expressions to the English language. It is also a goal of this work to compare the Wikipedia approaches to other well-established approaches.
Topics for SSIM Assessing and comparing the readability of online topics Using a set of search queries previously classified into topics, the goal of this work is to analyze and compare the readability of the initial documents retrieved with those queries. Evaluation of query expansion approaches using the CLEF ehealth 2016 test collection The goal of this work is to evaluate the query expansion approaches that were proposed in a previous work using a newly-formed test collection. The evaluation should focus on the relevance, understandability and credibility of the obtained results.
Topics for SSIM The use of Data Mining to understand behaviour dynamics in online health forums: state of the art Do a survey and write a scientific article on the use of Data Mining to understand behaviour dynamics in online health forums. Automatic text simplification in the health domain: state of the art Do a survey and write a scientific article on current techniques for automatic text simplification in the health domain.
References Dan Pelleg, Elad Yom-Tov, Yoelle Maarek (2012). Can you believe an anonymous contributor? On truthfulness in Yahoo! Answers Elad Yom-Tov, Evgeniy Gabrilovich (2013). Postmarket Drug Surveillance Without Trial Costs: Discovery of Adverse Drug Reactions Through Large-Scale Analysis of Web Search Queries Elad Yom-Tov; Ingemar Cox; Vasileios Lampos (2015). Learning about health and medicine from Internet data. Gunther Eysenbach (2006). Tracking flu-related searches on the Web for syndromic surveillance Jane P Messina, Oliver J Brady, David M Pigott, John S Brownstein, Anne G Hoen & Simon I Hay (2014). A global compendium of human dengue virus occurrence Munmun De Choudhury, Michael Gamon, Scott Counts and Eric Horvitz (2013). Predicting depression via social media Shaodian Zhang, Erin Bantum, Jason Owen, Noémie Elhadad (2014). Does Sustained Participation in an Online Health Community Affect Sentiment? Susannah Fox (2011). Health Topics. Pew Internet Project. Susannah Fox and Maeve Duggan (2013). Health Online 2013. Pew Internet Project. Vanessa Frias-Martinez, Alberto Rubio, Enrique Frias-Martinez (2012). Measuring the impact of epidemic alerts on human mobility using cell-phone network data