Big Data in Healthcare: motivation, current state and specific use cases Alejandro Rodríguez González Centro de Tecnología Biomédica Universidad Politécnica de Madrid <alejandro.rg@upm.es>
Who we are? MEDAL laboratory 2
Our expertise What we do now: Data analytics: Expertise in applying data mining to several domains with special emphasis in the biomedical domain. Natural language processing: Retrieving knowledge from medical texts and Electronic Health Records (mainly) but also applied to other domains (for example: legal domain). Image analysis: Automatic analysis and annotation of medical images (mainly) but also focused on geospatial imaging and other fields.
Big Data in health domain It's far more important to know what person the disease has than what disease the person has. Hipocrates 4
Big Data Source: http://www.terem.com.au/blog/big-data-needs-small-data/ 5
Big Data Source: http://bit.ly/2g7vmp8 6
Big Data Source: http://andressilvaa.tumblr.com/post/87206443764/big-data-refers-to-5vs-volume 7
Who generates Big Data? Scientific instruments (generate any kind of data) Mobile devices (continuous device tracking) Social media (we all generate data) Sensor technology and networks (continuous measuring) 8
Who generates Big Data? 9
Who generates Big Data? 10
Big Data in Biomedicine Two main objectives: Genomic-driven data (next generation sequencing, genetic expression, ) Payer-provider (electronic health records, drug prescription, insurance prescription,..) 11
Big Data in Biomedicine Source: http://www.nature.com/scitable/topicpage/genomic-data-resourceschallenges-and-promises-743721# 12
Big Data in Biomedicine Average size of this data vary between (~ 200GB 4TB) for a single individual. With thousand of individuals this data reach size in the order of petabytes. 13
Motivation In 2012, worldwide digital healthcare data was estimated to be equal to 500 petabytes and is expected to reach 25,000 petabytes in 2020. Can we learn from the past to become better in the future? Healthcare Data is becoming more complex: several types of data, unstructured, structured, 14
Motivation The problem: Millions of reports, tasks, incidents, events, images, Availability Lack of protocols and structure Organization oriented processes From information to knowledge 15
Ejemplos Clopidogrel (Plavix) es un fármaco para prevenir coágulos de sangre que puedan causar ataques cardíacos o ictus. Existía preocupación de que otros fármacos (inhibidores de las bombas de protones; fármacos para reducir el ácido gástrico) pudieran interferir con la activación del Clopidogrel. Medco usó su base de datos para buscar diferencias en dos estudios de cohorte: aquellos que usaban uno de los fármacos y aquellos en los que los dos podían interactuar. El estudio reveló que los pacientes que tomaban ambos tenían un porcentaje un 50% mayor de sufrir los episodios que paliaba el Clopidogrel. 16
Ejemplos Otro estudio similar mostró que los antidepresivos bloqueaban la efectividad del tamoxifen, un fármaco usado para prevenir la reaparición del cáncer de mama, de forma que los usuarios que tomaban ambos productos duplicaban las posibilidades de que el cáncer reapareciera. En cualquier caso, estos ejemplos se basan en hipótesis. Uno de los objetivos del análisis de Big Data es poder realizar estudios sin hipótesis dadas. 17
Ejemplos 18
Ejemplos 19
Más ejemplos http://www.journalofbigdata.com/content/1/1/2 20
Use case 1: predictive medicine Provide right intervention to the right patient at the right time. ACQUIRE, PROCESS, ANALYZE UNDERSTAND PREDICT 21
What clinicians aim: evidence based medicine Correlations, associations of symptoms, familiar antecedents, habits, diseases. Impact of certain biomedical factors (genome structure, clinical variables) on the evolution of certain diseases. 22
What clinicians aim: evidence based medicine Automatic classification of images (prioritization of RX images to help diagnosis). Automatic annotation of images. Natural language (google style) based diagnose aid tools. 23
What researchers aim Find early indicators of diseases. Design of clinical trials. Automatic search in bibliography using not only keywords but also analyzing the text of the papers. Use of analytics services available on the web. Use data and services of the cloud for in order to obtain knowledge from of other hospitals/countries/... 24
At what stage do we compute prediction? Healthy Expenses on health care Healthy Low risk At risk High Risk Early symptoms 20% of the population generates 80% of the costs Analyze data so to act as soon as possible - Early detection - Personalized evidence
26
From data to knowledge Data Acquisition Data processing Modelling Validation Apply 27
1 st step: Data acquisition EHR: Structured data: Lab tests (LOINC) Many lab systems still use local dictionaries to encode labs Diverse numeric scales on different labs Missing data Clinical and demographic data (ICD): ICD stands for International Classification of Diseases ICD is a hierarchical terminology of diseases, signs, symptoms, and procedure codes maintained by the World Health Organization (WHO) Pros: Universally available Cons: medium recall and medium precision for characterizing patients Non-structured data: Images Clinical notes 28
Standards MeSH (Medical Subject Headings) - A thesaurus for indexing articles for PubMed. UMLS (Unified Medical Language System) - Integrates key terminology among different coding standards. SNOMED CT - Standard for clinical terminology. DICOM (Digital Imaging and Communications in Medicine) - Standard for processing medical images. 29
Standards GS1 standards - Used to identify uniquely different medical products. LOINC (Logical Observation Identifiers Names and Codes) - Standard for identifying laboratory and clinical observations. RxNORM - Standard normalizing names for pharmacy & drugs products. 30
Text Processing: NLP 31
NLP 32
Some specific challenges Acronyms Entity recognition for diagnosis terms Numbers and metrics 33
Acronyms meaning How to decide the meaning of acronyms: Context dependant Use UMLS Use Machine learning to learn 34
Example of Acronym disambiguation 35
Proposed solution Clinical Notes Learning algorithms Detect and expand Acronyms (context dependant) Enriched Clinical notes Feature Extraction BD Models 36
Comparison of results Precision Recall F1 0.968 0.967 0.966 0.965 0.964 0.963 0.962 0.961 Aproximación 1 Aproximación 2 Aproximación 3 Aproximación 4 37
Diagnosis terms The identification of diagnosis elements in medical texts is a crucial task. Mainly used for the development of medical diagnosis systems. Other relevant uses can be found in the construction of human symptoms disease networks, a challenging area of research where this information is very important. 38
ctakes and MetaMap Compare the accuracy regarding the extraction of generalist medical terms that only affect to terms used in the diagnosis context. The experiment was performed using: our framework in which Apache ctakes is used as NER. We have manually analyzed the results and made a comparison between MetaMap and Apache ctakes.
Comparison of results 40
Numbers and metrics. Treatments: ibuprofen 2 cp/d Laboratory tests glucose 140mg Blood pressure measure 140/92 mmhg. Numbers and metrics Dates: Absolute MRI on14/02/2016 Relative patient suffered from headache two days ago 41
Use case 3: Data Analytics (the real insight) Source: https://akshaykher.wordpress.com/2015/08/18/how-to-start-a-career-inanalytics-for-free-3/ 42
Data Science 43
Numbers and metrics INTERPRETATION AND EVALUATION DATA MINING Knowledge CODIFICATION Models CLEANING Transformed data SELECTION Processed data Data Objective data
Some interesting facts for data science [4] 45
Profile of respondents of the survey 46
Finding1: There s a Still a Shortage of Data Scientists (And it Might Be Getting Worse) 47
Finding 2 Data Scientists Love Their Jobs 48
How a Data Scientist Spends Their Day 49
Why That s a Problem Simply put, data wrangling isn t fun. It takes forever. In fact, a few years back, the New York Times estimated that up to 80% of a data scientist s time is spent doing this sort of work. 50
Do Data Scientists Have What They Need? 51
The Top 10 In-Demand Data Science Skills We looked at nearly 4,000 data science job postings on LinkedIn to find out what skills organizations wanted from their new hires. We ran those job postings through the CrowdFlower platform and had our contributors mark which skills showed up in which jobs. 52
What s Next for Data Science? to put it simply, is machine learning 53
Projects: NDMonitor Integral low-cost platform for the monitoring and help of patients with neurodegenerative diseases in mental capabilities Track patients movements at home. Monitor and analyse their behaviours and actions. Track via GPS when patient leaves home. Allow therapist and families to see reports.
Frailty care and well function Projects: FACET Integration of human phenotypic data. Early detect of frailty. Focus on intervention. Prevent or delay disability. Estimated impact in quality of life of 13.05 million people. Integration with already developed data lakes and services layer by GMV and BULL.
Projects: PAPHOS Platform for advance prescriptive health operational system ICT to support both professionals and patients. Two use cases: medical imaging (cellular mitosis in lung cancer) and data analytics (apnoea improvement with CPAP). Fully-functional platform which includes interoperability capabilities. Security layer to protect the data. Integration with already developed data lakes and services layer by GMV and BULL.
Use case 3: Large-scale HSDN Analysis of disease-networks in a large scale perspective Extraction of phenotypical knowledge from multiple sources (text mining/nlp). Identification and use of well-known biological databases. Creation of large disease-networks. Mapping and analysis of common features for drug repurposing.
Human disease complex networks DIAPOSITIVA 58 Source: [3]
Human disease complex networks DIAPOSITIVA 59
Conclusions Big Data is more than BIG data: not only the size. Unstructured extraction (such as text) implies a huge part of Big Data in ehealth. Analytics is the main point where everything convey. 60
Any question? 61
References 1. Harkema et al. (2009). ConText: An algorithm for determining negation,experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics. Vol, 42(5), pp. 839-851. 2. Goh et al. (2007). The human disease network. Proceedings of the National Academy of Sciences. Vol. 104(21). pp. 8665-8690. 3. Zhou et al. (2014). Human symptoms-disease network. Nature Communications. Vol. 5. 4. https://visit.crowdflower.com/2015-data-scientist-report 62