Seminar Talk at the Chadwick Building 102, University College London (UCL), London, Date: July 8 (Fri), 3pm. 2016. Big Data and Human Dynamics: A Human Centered Approach to Analyze Human Activities and Movements with Social Media and GIS data Prof. Ming-Hsiang Tsou Twitter @mingtsou mtsou@mail.sdsu.edu, Director of the Center for Human Dynamics in the Mobile Age Professor, Department of Geography, San Diego State University
The Center for Human Dynamics in the Mobile Age at San Diego State University Advancing Transdisciplinary Research on Big Data, Human Dynamics, and the Social Web http://humandynamics.sdsu.edu/
HDMA Team Members Dr. Ming-Hsiang Tsou (Director) Dr. Heather Corliss Dr. John Elder Dr. Lourdes Martinez Jiue-An (Jay) Yang Jessica Dozier Dr. Sheldon Zhang Dr. Atsushi Nara Su Yeon Han Dr. Brian Spitzberg Dr. Jay Lee Alejandra Coronado Dr. Jean Mark Gawron Dr. Xuan Shi Dr. Piotr Jankowski Dr. Eric Buhi Chris Allen Hao Zhang Joey Lee Rich Zhang Stephanie Nowinski Jared Jashinsky Dr. Michael Peddecord NSF Projects Dr. Xinyue Ye Dr. Bruce Appleyard HDMA Center Dr. Joseph Gibbons Elias Issa Graduate Students
HDMA Center has been hosting two large NSF projects (CDI and IBSS): 1. Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber-Enabled Discovery and Innovation (CDI) program. Award # 1028177. $1.3 million (2010-2015) http://mappingideas.sdsu.edu/ 2. Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks. Funded by Interdisciplinary Behavioral and Social Science Research (IBSS) program. Award#: 1416509. $1 million (2014-2019). Http://socialmedia.sdsu.edu Principal Investigator: Dr. Ming-Hsiang Tsou mtsou@mail.sdsu.edu, (Geography), Co-PIs: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography). Dr. Jay Lee (Kent State, Geography), Dr. Ruoming Jin (Kent State, Computer Science), Dr. Xinyue Ye (Kent State, Geography, Dr. Heather Corliss (Public Health, SDSU), Dr. Xuan Shi (Geoscience, U of Arkansas). San Diego State University, Kent State University, University of Arkansas, USA.
What is Human Dynamics? Smart Phones - 2007 (The Mobile Age) The most important scientific instrument in the 21 st Century. (2014 Sales: 1.2 billion units) Human Dynamics -- is a transdisciplinary research field focusing on the understanding of dynamic patterns, relationships, narratives, changes, and transitions of human activities, behaviors, and communications.
Animated Image created by the HDMA Center (Hao Zhang).
Big Data is Human-Centered Data Big Data is a large dynamic dataset created by or derived from human activities, communications, movements, and behaviors. (Tsou, 2015). The term, Big Data, refers to big ideas, big impacts, and big changes for our society in addition to a big volume of datasets. Tsou, M. H. (2015). Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science, 42:sup1, 70-74. doi: 10.1080/15230406.2015.1059251. http://www.tandfonline.com/doi/full/10.1080/15230406.2015.1059251#.vecvyplvhbd
The Challenge of Big Data Analytics: Big Data are very Messy, Noisy, and Unstructured! Image Source: http://www.contentverse.com/office-pains/10-messy-desks-successful-people/ Require collaboration efforts from linguistics, geographers (GIS experts), computer scientists, data mining experts, statisticians, physicists, modelers, and domain experts. Human Dynamic in the Mobile Age (HDMA)
Geography (place and time) is the KEY for Understanding and Integrating Big Data Big Data (information) Time Place (Tsou and Leitner, 2013) KDC (Knowledge Discovery in Cyberspace) framework Tsou, M. H. and Leitner, M. (2013). Editorial: Visualization of Social Media: Seeing a Mirage or a Message? In Special Content Issue: "Mapping Cyberspace and Social Media". Cartography and Geographic Information Science. 40(2), pp. 55-60. DOI: 10.1080/15230406.2013.776754
Data Integration / Data Fusion Explore their spatiotemporal relationships in both network space (cyberspace) and geographical space (real world). Health or Disaster Data Layer Image provided by Dr. Atsushi Nara (Associate Director of HDMA Center).
Big Data Category (Tsou, 2015). Social life data: social media services (Twitter, Flickr, Snapchat, YouTube, Foursquare, etc.), online forums, online video games, and web blogs. Health data: electronic medical records (EMR) from hospitals and health centers, cancer registry data, disease outbreak tracking and epidemiology data. Business and commercial data: credit card transactions, online business reviews (such as Yelp and Amazon reviews), supermarket membership records, shopping mall transaction records, credit card fraud examination data, enterprise management data, and marketing analysis data. Transportation and human traffic data: GPS tracks (from taxi, buses, Uber, bike sharing programs, and mobile phones), traffic censor data (from subways, trolleys, buses, bike lanes, highways), and mobile phone data (from data transmission records and cellular network data). Scientific research data include earthquakes sensors, weather sensors, satellite images, crowd sourcing data for biodiversity research (inaturalist), volunteered geographic information, and census data. Geography (place and time) is the KEY for understanding Big Data!
Research Showcase #1: Geo-Targeted Social Media (Twitter) Analytics for Tracking Flu Outbreaks in U.S.
Data Filtering, Mining, and Visualization Application Programming Interfaces (API) Geo-Targeting Data Collection (Twitter APIs) Filter Machine Learning Trend Analysis Spatial Analysis Analysis SMART Dashboard Visualization
What we can get from Twitter data? Where to find geospatial information? Example: Use Twitter Search API to search for keyword HIV test or HIV testing Only 1% - 7% of Tweets have X, Y GEO-coordinates (from GPS or Geo-tagged devices). But 50% - 60% Tweets have city-level locations provided by their User Profiles. 80% Tweets have Time Zone (limited spatial meaning)
The HDMA Center has built our own Internal GeoCoder Engine for User Location Profile: using GeoNames.org gazetteers (Creative Commons Data).+ User defined rules. Enable Flexible or Self-defined Geo-Target Boundaries (California, Santa Barbara, Los Angeles, San Diego bounding boxes, or State boundaries) Geocoding Engine for Social Media
SMART Dashboard Social Media Analytic and Research Testbed http://vision.sdsu.edu/hdma/smart/ YouTube Video for 3 Mins Real-time social media analytics (Trend Analysis, Word Clouds, Top URL, web pages, Top Hashtags/Mentions/Stories).
Collect Tweets from Top 31 U.S. Cities (17 miles radius) 31 different cities across the United States (chosen based on their population sizes): Atlanta, Austin, Baltimore, Boston, Chicago, Cleveland, Columbus, Dallas, Denver, Detroit, El Paso, Fort Worth, Houston, Indianapolis, Jacksonville, Los Angeles, Memphis, Milwaukee, Nashville-Davidson, New Orleans, New York, Oklahoma City, Philadelphia, Phoenix, Portland, San Antonio, San Diego, San Francisco, San Jose, Seattle, and Washington, D.C. Human Dynamic in the Mobile Age (HDMA)
Filter and Refine Big Data (Remove Noises) Number of tweets 10,678 5,398 4,947 4,944 3279 Machine Learning Total Flu tweets collected: 307,070. Final valid flu tweets: 88,979.
Real-Time Monitoring of Flu Outbreaks in U.S. (National Scale combined 31 Cities), 2013 2014 flu season RED Line: National ILI data (Influenza-like illness) (provided by CDC) Purple Line: Weekly Tweeting Rate (two weeks earlier than CDC data) ILI: Influenza-like Illness (R) value = 0.8494
Trend Analysis at the Municipal Scale (San Diego) with the Lab-tested confirmed flu cases San Diego: Lab confirmed Flu Cases vs Tweeting Rate: (R) value = 0.9331
Two research papers in the Journal of Medical Internet Research 2013 2014 Human Dynamic in the Mobile Age (HDMA)
Tracking Flu Outbreaks in 2014/2015 Flu Season # of Filtered ILI Tweets, Top 30 US Cities, as of February 9, 2015 (from SMART dashboard) Only 1% -4% tweets has Geo-tagged coordinates. Problems!!! Twitter broke its Search APIs on 11/20/2014 and only returned Geo-tagged tweets only. (Reduce 90% -95% of tweets collected) CDC Influenza Positive Tests, National Data Summary, through Weeks 40-3, 2014-2015 Season
2014-2015 Comparison between ILI and Geo-tagged-only Tweets (4%) among 30 U.S. Cities R= 0.90559 Human Dynamic in the Mobile Age (HDMA)
2016 Flu Tweets vs CDC ILI data R= 0.5566 The comparison between National ILI Rate and the 32 Cities Tweeting Rate, with prediction up to Week 15. Red National ILI, Purple Tweet Rate for 2015-2016.
How to Build a Flu Prediction Model? Daily Patterns or Weekly Patterns? Time Scale: Daily, Weekly, Monthly. 2015 Flu Tweets (Daily Pattern) 2016 Flu Tweets (Daily Pattern)
SMART will be OPEN SOURCE soon! Available in GitHub Soon. But need to install both the SMART and the Geocoding Engine (MongoDB and GeoName.ORG gazetteer databases). New Journal Paper: http://bds.sagepub.com/content/3/1/2053951716652914 SMART Dashboard Won the BEST METHOD PAPER Award in the 2015 International Conference of Social Media and Society, Toronto, Canada. http://dl.acm.org/citation.cfm?doid=2789187.2789196
SMART Dashboard Client/Server System Design Framework Next Step: Open Source Initiative (GPL license copy-left)
Research Showcase #2: Real-time Situation Awareness Viewer for Monitoring Disaster Impacts Using Location-Based Social Media Messages (Twitter). CBS8 News Video (4 mins) San Diego County: Office of Emergency Services (OES)
Geo-targeted Event Observation (GEO) Viewer 1.0 http://vision.sdsu.edu/hdma/wildfire/ Provide dynamic map display about event ground truth observation -- linking GPS locations, texts, photos, and time. (Geo-tagged Tweets) Monitor Disaster impacts, Recovery Activities and Victims Needs
Spatial Clustering (Wildfire Tweets) Spatial clusters (hotspots) of tweets are nearby the actual locations of wildfires (events)
GeoViewer Tool v.2.2 (Video demo) EC2: http://vision.sdsu.edu/ec2/geoviewer/sandiego (Live) Real Time (Streaming API), Hot Spots: Kernel Density Estimation (KDE) method, Auto-Three-Keywords labels with Cluster Maps.
Monitoring Emergency Responses and Rescue Efforts How to find out critical information from thousands of tweets? Nepal Earthquake Example: (keyword search: trap ) One Possible Solution: Manual labeling (first 1000 tweets by volunteers) + Machine Learning Classification (built-in). Human Dynamic in the Mobile Age (HDMA)
Digital Volunteers may help us identify and select important Tweets (for machine learning) during and after the disaster events. Need Some programming and design help from OES, RedCross, and 211: 1. How to combine multiple volunteers Inputs and Integration Systems (ranking system). 2. Which category and color schemes/labels should we use for each types of disasters (flooding, wildfires, earthquake, hurricanes). 3. Which tags might be useful? 4. Who are the target users? What kinds of Output system should we create? (for OES staff? For RedCross staff?) 5. Other suggestions? Human Dynamic in the Mobile Age (HDMA)
Donald Trump visited San Diego on May 27, 2016. (GeoViewer search for Trump from 5/27-5/28)
Spatial Analysis Geo-tagged Tweets (in GeoViewer) with keyword Drunk in San Diego
Big Data Fusion (Integration) Comparing Spatial Cluster of DUI Records (Red dots, Left side) and Tweets with Drunk keyword (Right side). GIS Map with DUI Records GeoViewer (Search drunk for two months) Similar Spatial Pattern? (Dynamic monitoring in REAL TIME?)
Errors and Noises in the Geo-tagged Tweets Detect robot tweets or advertisement tweets (noises) in geotagged tweets by examining the source metadata field. The portion of data noises is significant (29.42%) in our case study. The number of Tweets produced by different platforms inside San Diego Bounding Box during the month of November, 2015.
2014 San Diego Wildfire Tweets Social Network Analysis (SNA) @10News @ReadySanDiego @SanDiegoCounty @KPBSnews @UTsandiego Identify the network influence for each individuals (who are the opinion leaders?) Predicting the Spreads (Speed, Scale, and Range) of Social Media Messages in Different Social Networks. (following, retweets, and mentions relationships)
Hyperlocal Relationship From Online Connections to Offline Locations Master Thesis by Jessica Dozier, 2016
2015 Nepal Earthquake Master Thesis by Jessica Dozier, 2016
[ReadySD Social] Mobile App Development Recruit 1000 local volunteers for Broadcasting Disaster Information
Use Ionic platform for both Android and ios Apps. https://ionic.io/
Users will receive Push Notifications. Then they can decide to RETWEET the message or not.
Gamification + Peer Pressure
Download from the Android Play Store Search for readysd
ios: App Store It will take a few weeks to apply for being an App in App Store. Current side-loading method: TestFlight. (sending invitation installation email to test users)
The Limitations and Challenges of Social Media and Big Data Research
Social Media User Profiles Social Media messages can NOT represent all population, but it can provide warning signals and real-time updates. 2014 Survey (Business Insider) Twitter Users are Young (60% are between 16 34 years old). More Urban residents than rural Higher adoption% in African Americans Many Journalists and Mass Media staff. 20% are not real human beings (robots): many advertisement and marketing activities. Using Different Keywords can get different demographic groups: #Healthcare: include more senior people (Very few teenagers will tweet about healthcare ). (We need more background study). Keywords could be used as a sampling tool for social media users.
Who are the Users? Humans or robots (bots)? Use SMART dashboard to track E-cigarette topics High Peak on Feb 11, 2016 (Why?)
1,553 Twitter Accounts Said the Exact Sentence! In One Day (2/11/2016), From to 11114 9561 = 1553 (Mummy or Ghost Twitter Accounts?) for Advertisement?
Who are they? How they post the messages? Are They Mummies and Ghosts (Zombie)?
User Privacy Issue Concerns about Big Brother. Although all the tweets collected from APIs are public tweets (everyone can search them and retrieve them). Some content of tweets may contain personal private information (real names, locations of homes, offices, private conversations, medical situations, etc.) * HDMA center conceals tweet locations by randomly selecting a coordinate in a 100m radius of the original location to protect Twitter users' privacy.
Ethical and Sensitive Issues SMART Dashboard for Marijuana Legalization Issues http://vision.sdsu.edu/hdma/smart/marijuana Public Available Data (Twitter Data) may include many sensitive personal information. How to deal with these information? Orignal Big Data (not sensitive) Process + Filter + Noise Reducing Sensitive Data
Final Remark: Big Data = Transdisciplinary Geospatial Technology is essential for Big Data Science. We will transform Science and Technology in the age of Big Data -- from isolated instruments (disciplines) into an epic orchestra (collaboration). Image source: wikipedia.org Human Dynamic in the Mobile Age (HDMA)
http://humandynamics.sdsu.edu/ Thank You Q & A Director: Dr. Ming-Hsiang (Ming) Tsou mtsou@mail.sdsu.edu Twitter @mingtsou Funded by NSF Cyber-Enabled Discovery and Innovation (CDI) program. Award # 1028177. (2010-2015) http://mappingideas.sdsu.edu/ NSF Interdisciplinary Behavioral and Social Science (IBSS) Program, Award #1416509 (2014-2018): Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks. http://socialmedia.sdsu.edu/ Human Dynamic in the Mobile Age (HDMA)
Why Choose Twitter? 80% academic researchers are using Twitter APIs to get their social media data. 1. Free and Open Access Data from APIs (you can write a program in your desktop to download Twitter data (tweets) automatically). But the free APIs has the 1% data limit. 2. Large User Base (+500 million users) and very popular in U.S., Europe, and Japan. But not in China, Taiwan, and Korea (China has a similar platform called Weibo ). 3. Easy to program in Python or PHP (Tweepy, TwitterSearch, etc.). Many available API libraries to use now. 4. Historical data and 100% data can be purchased from Twitter (but very expensive). 5. Rich [Metadata] tags in each tweet (time stamp, user, follower, platform, time zone, text, URL, Retweet, language, devices). Other possible social media APIs: Flickr, Instagram, Foursquare, Yelp, YouTube. Why not Facebook? (Facebook Graph APIs are VERY LIMITED and PROTECTIVE. No Public data feed). You need to have internal connections to Facebook staff to conduct research.
Next Step: Syndromic Surveillance (Underdevelopment) (tracking multiple Symptoms: fever, cold, cough, vomiting, etc. ) http://vision.sdsu.edu/hdma/smart/syndromic Designed for Early Detection of unknown disease outbreaks, such as Swine Flu and SARS
Other Examples Building a transformative research agenda for Big Data Science Examples of Research Topics: 1. Use Social Media as Human Sensors to Monitor Our Environment (Air Pollution, Water Quality, Temperature) and Social Movement (Depression, Social Problems, Election, etc.). 2. Visualizing dynamic spatiotemporal patterns of geographic awareness 3. Mapping Dynamic Urban Land Use Patterns using Points of Interests (POI) and Social Media. 4. Building a Comprehensive Food Environment Databases (using Yelp, Google Places, and Foursquare Check-Ins). 5. Building a Dynamic Urban Population Model (Twitter, Instagram, and Flickr).
Use Social Media as Human Sensors to Monitor Air Pollution in China. http://journals.plos.org/plosone/article?id=10. 1371/journal.pone.0141185 Chinese Twitter: Sina Weibo 微博 The worst month of air quality (April, 2012) generated the highest correlation coefficient between the filtered social media messages and AQI.
Mapping Dynamic Urban Land Use Patterns (using Temporal Pattern Cluster Analysis K-means) Mapping Dynamic Urban Land Use Patterns with Crowdsourced Geo-tagged Social Media (Sina-Weibo) and Commercial Points of Interest (POI) Collections in Beijing, China (submitted to the journal of CaGIS, under review now). Hourly Temporal Patterns (24 hours) Residential Areas + College Dormitory
Building a Comprehensive Food Environment Databases Yelp, Foursquare, Flickr, and Google Places.
Building a Dynamic Urban Population Model: Building a hourly-based population density model in Urban area (for disaster responses and management) (Su Han, Ph.D. student in the HDMA Center).