Session 35: Text Analytics: You Need More than NLP Eric Just Senior Vice President Health Catalyst
Learning Objectives Why text search is an important part of clinical text analytics The fundamentals of how search works How clinical text search can be refined with natural language processing (NLP) and other techniques 2
Poll Question #1 For my organization, I see text analytics as: a) Completely unnecessary for analytics b) A nice to have for analytics c) Very important for a few key areas of analytics d) Mission critical across nearly all areas of analytics e) Unsure or not applicable
High-Risk Population: Peripheral Arterial Disease PAD PAD affects over 3 million patients per year Narrowed arteries reduce blood flow to limbs Patients with PAD are considered high risk For organizations trying to understand their risk, not being able to find high-risk patients is a problem. Natural Language Processing (NLP) N=41,741 Peripheral artery disease Claudication Rest pain Ischemic Limb ICD/CPT N=9,592 Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016. 4
Analytics has a problem Most organizations ignore text analytics because it is expensive and difficult Up to 80% of clinical data stored in text Most text analytics requires advanced technical skillsets 5
Typical Scenario As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes. Data scientist develops PAD text mining algorithm Algorithm validated The Results patient returned cohort to is investigator defined 6
Better Scenario As a healthcare system administrator, I want to understand my high-risk population better. I want to find all patients with peripheral arterial disease (PAD). I know there are more patients than I was able to find by simply querying diagnosis and procedure codes. Data scientist develops PAD algorithm Algorithm validated PAD algorithm run nightly and stored in data warehouse PAD algorithm output combined with coded data to create PAD registry 7
How to Best Leverage Text Analytics PAD Algorithm Diabetes Algorithm Ejection Fraction Algorithm Pre- Diabetes Algorithm CHF Algorithm Hypertension Breast Cancer 8
Poll Question #2 I see text analytics being most important in the area of: (Choose 3, if applicable) a) Clinical care improvement b) Regulatory reporting c) Research d) Operational improvement e) Financial analytics f) Unsure or not applicable
Google 10
Why do we love Google? Simple, effective interface Fast Accurate 11
c How Would You Build Google For Clinical Text? 12
The Basis of Text Search: The Inverted Index Document 0 Patient is a 67 year old female with NIDDM and hypertension. Words Document Inverted Index 67 0 {(0,3)} diabet 1,2 {(1,4),(2,3),(2,7)} Document 1 The patient has no diabetes or hypertension. Document 2 Patient s mother is diabetic. Patient s sister is diabetic. female 0 {(0,6)} hyperten 0,1 {(0,8),(1,6)} mother 2 {(2,1)} niddm 0 {(0,8)} no 1 {(1,3)} old 0 {(0,5)} patient 0,1,2 {(0,0),(1,1), (2,0),(2,4)} sister 2 {(2,5)} year 0 {(0,4)} 13
Tools To Quickly Index Text and Provide Search Capability Originally written in 2004 Open Source Enterprise Search Built on Lucene Scalability (distributed indexes) REST APIs Plugin architecture Additional features over Lucene Originally written in 2010 Open Source Enterprise Search Built on Lucene Scalability (distributed indexes) REST APIs Plugin architecture Additional features over Lucene Originally written in 1999 Open-Source Java API Create index Maintain index Search index Hit ranking Result sorting.. Much more Provides the foundation for more advanced search engine capabilities. Most users use through SOLR or ElasticSearch. Used directly by Twitter. 14
diabetes Go Results: 2 records, 0.0 ms Document 2: Patient s mother is diabetic. Patient s sister is diabetic. Document 1: The patient has no diabetes or hypertension. Found both diabetes and diabetic (word stemming) Missed mention of NIDDM (synonyms) Neither result is relevant to a medical cohort query for diabetics (context)
What Works? Simple, familiar interface Using inverted index means fast results What Doesn t Results display not optimized for use cases Need better ability to view aggregate results Want more results! Medical language has many synonyms. (How do we find NIDDM?) Want less results! Context matters for different search types (How do we exclude no diabetes ) 16
Showing the results Many users are more interested in exploring aggregate results than reviewing individual records Aggregating results opens up to users without access to PHI 17
Get More Results: Synonyms When you say diabetes what do you really mean? "diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus" 18
Leveraging Medical Terminologies 19
Expanding Search With Terminologies 20
A more complex example: Diabetic patients who are on an ACE/ARB or who had their microalbumin checked during the calendar year Queries free text for all reports that contain Diabetes AND (ace OR arb) AND microalbumin Filtered for reports within the last year note: terms are selected by synonym finder, or grouped terms of all trade name, generic name, or active medication ingredients ("diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulindependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2 diabetes mellitus") AND ( ("benazepril" OR "lotensin" OR "captopril" OR "capoten" OR "enalapril" OR "vasotec" OR "epaned" OR "fosinopril" OR "monopril" OR "lisinopril" OR "prinivil" OR "zestril" OR "moexipril" OR "univasc" OR "perindopril" OR "aceon" OR "quinapril" OR "accupril" OR "ramipril" OR "altace" OR "trandolapril" OR "mavik") OR ("azilsartan" OR "edarbi" OR "candesartan" OR "atacand" OR "eprosartan" OR "teveten" OR "irbesartan" OR "avapro" OR "telmisartan" OR "micardis" OR "valsartan" OR "diovan" OR "losartan" OR "cozaar" OR "olmesartan" OR "benicar") ) AND ("albumin urine" OR "urine microalbumin" OR "urine microalbumin present ) 21
Get Less Results: ConText Matters ConText is a NLP pattern matching algorithm published in 2009 To be useful for clinical applications such as looking for genotype/phenotype correlations, retrieving patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical conditions in the text is not sufficient information described in the context of the clinical condition is critical for understanding the patient s state. J Biomed Inform. 2009 Oct; 42(5): 839 851. Detects conditions and whether they are Negated (e.g., ruled out pneumonia ) Historical ( past history of pneumonia ) Experienced by someone else (e.g., family history of pneumonia ) 22
ConText Algorithm Wendy W. Chapman, David Chu, John N. Dowling J Biomed Inform. 2009 Oct; 42(5): 839 851. Chest tightness Negation: affirmed Experiencer: patient Temporality: recent CHF Negation: affirmed Experiencer: patient Temporality: recent ConText Chest tightness Negation: negated Experiencer: patient Temporality: historical CHF Negation: affirmed Experiencer: other Temporality: historical No history of chest tightness but family history of CHF. Negation trigger Historical trigger Condition Termination Historical trigger Condition Termination Other experiencer trigger 23
ConText: Negation The patient had no diabetes or hypertension. Experiencer Negation trigger Clinical conditions Termination Diabetes Negation: negated Experiencer: patient Temporality: recent Hypertension Negation: negated Experiencer: patient Temporality: recent 24
ConText: Experiencer Patient s mother has diabetes. Experiencer Clinical conditions Termination Patient s sister has hypertension. Experiencer Clinical conditions Termination Diabetes Negation: affirmed Experiencer: other Temporality: recent Hypertension Negation: affirmed Experiencer: other Temporality: recent 25
How? Analysis of context uses a sentence as an operand Identifying sentences in clinical text is not straightforward Have you ever seen punctuation in a clinical note? An NLP analysis pipeline ties it all together Search results Sentence detection Entity recognition (i.e. diabetes) Context Algorithm Present user with additional filters NLP Pipeline Frameworks Apache Unstructured Information Management Architecture (UIMA) General Architecture for Text Engineering (GATE) Natural Language Toolkit (NLTK) 26
ConText: Apply to Search Results Filter Diabetes Results 27
28
Other Pieces to the NLP Pipeline: Extract Values ef_phrase qualifiers ef_low ef_high ef_mid ef_word ejection fraction is at least 70-75 is at least 70 75 72.5 NULL ejection fraction of about 20 of about 20 20 20 NULL ejection fraction of 60 of 60 60 60 NULL ejection fraction of greater than 65 of greater than 65 65 65 NULL ejection fraction of 55 of 55 55 55 NULL ejection fraction by visual inspection is 65 by visual inspection is 65 65 65 NULL LVEF is normal is NULL NULL NULL normal \b(((lv)?ef) (Ejection\s+Fraction))\s+(?<qualifiers>([^\s\d]+\s+){0,5})\(?(((?<ef_low>\d+)- (?<ef_high>\d+)) (?<ef_mid_txt>\d+) (?<ef_word>([^\s]*?normal) (moderate) (severe))) 29
Other Extraction Projects Aortic Root Size Blood Pressure Breast Cancer ER Biomarker Cancer Staging, TNM, and stage Abdominal fistula Height/Weight/BMI Hypoglycemia with low blood sugars Microalbumin Ankle Brachial Index 30
High-risk Population: Peripheral Arterial Disease PAD PAD affects over 3 million patients per year Narrowed arteries reduce blood flow to limbs Patients with PAD are considered high risk Measured by Ankle Brachial Index (ABI) This is a precise patient registry! Natural Language Processing (NLP) N=41,741 ABI < 0.9 N=4,349 Peripheral artery disease Claudication Rest pain Ischemic Limb ICD/CPT N=9,592 Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL, April 2016. 31
Validation Build studies to review results of query Assign to team members to review results Randomly selects records to represent study Highlights key words for easy chart review 32
Text Analytics Must Be Interoperable! Validated Text Analytics Diabetes Cohort PAD Cohort Ejection Fraction Tumor Sizes Data Warehouse Population Analytics Care Improvement Operational Improvement Financial Improvement Research 33
A Late Binding Approach to Text Analytics Context Filtering Search: Easy starting point Synonym Finding Excludes negated concepts Good for cohort queries Regular Expression Extraction of discrete values: Ejection Fractions ABI Validation Expert review of algorithm output Performance measurement Integration Operationalize algorithm Incorporate into analytics Uses terminologies Allow user to find synonyms Many More Techniques Section tagging Entity recognition N-gram analysis Document clustering 34
Final Thought To leverage the power of text analytics Make the data accessible first! 36
Lessons Learned Using search technology for clinical text is an engaging and accessible entry point for text analytics problems. Searching clinical text is powered by an inverted index that catalogs words present in the documents, which documents they are present in, and their position in the documents. Medical terminologies provide a dictionary of relevant terms, synonyms, and logical structures that can enhance clinical text exploration. NLP algorithms that are based on the context surrounding clinical terms can identify when the term is negated ( no evidence of pneumonia ) or applies to another person ( patient's grandmother had breast cancer ). Regular expressions can be applied to text to identify patterns and extract discrete values, like ejection fraction and ankle brachial index, that are stored in text. Text analytics should be validated and integrated with an enterprise data warehouse where the information extracted from text can be combined with discrete, coded data. 37
Analytic Insights Questions & AnswersA 38
What You Learned Write down the key things you ve learned related to each of the learning objectives after attending this session 39
Thank You 40