Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval Enhanced Cohort Identification and Retrieval S105 Tracy Edinger, ND, MS Oregon Health & Science University Twitter: #AMIA2017
Co-Authors Dina Demner-Fushman, MD, PhD (National Library of Medicine) Aaron Cohen, MD, MS (Oregon Health & Science University) Steven Bedrick, PhD (Oregon Health & Science University) William Hersh, MD (Oregon Health & Science University) 2
Acknowledgements NLM 2 T15 LM 7088-21 National Library of Medicine NLM Scientists, Staff, and Fellows OHSU DMICE Faculty, Staff, and Students 3
Disclosure I and my spouse/partner have no relevant relationships with commercial interests to disclose. 4
Learning Objectives After participating in this session the learner should be better able to: Understand the importance of identifying document section headings for natural language processing Understand rule-based identification of document section headings 5
Use of Clinical Data Secondary use of EHR data Quality improvement Disease surveillance Regulatory reporting Research To use this data, it is important to be able to retrieve specific patient cohorts Image from http://epidemiologystudy.com/study.php 6
Structured and Unstructured Data for Cohort Retrieval Structured data including diagnosis and procedure codes are commonly used to identify clinical cohorts Relying solely on structured data may not retrieve the full cohort Patients who had colonoscopies during the last 10 years Denny JC (2012) Chapter 13: Mining Electronic Health Records in the Genomics Era. PLoS Comput Biol 8(12): e1002823. doi:10.1371/journal.pcbi.1002823 7
Cohort Retrieval from Clinical Text Cohort retrieval from clinical text is difficult Terminology and spelling differences Multiple meanings for terms Temporality Negation References to illnesses in other people Clinical text may provide clues to help resolve some of these issues 8
Structure of Clinical Text SOAP Format S: Patient reports not much sleep last night; no complaints this morning. O: T 99 F, HR 68, RR 16, BP 107/75 Chest CTA, bilateral breath sounds CV RRR without murmur A: Ovarian carcinoma POD #1 for staging laparotomy. Adequate UOP, incision in good condition. P: Clear liquids today. D/C foley catheter. 9
Structure of Clinical Text Chief Complaint: Sent from NWH with left sided hemorrhage History of Present Illness: The pt is a 44 year-old right handed woman with no significant PMH and family history significant for stroke (father, paternal uncle and sister @ 46 years) who was transferred from [**Hospital 1771**] Hospital with a left sided intraparenchymal hemorrhage. The patient was in her USOH... Past Medical History: Had an ulcer at age 10 Social History: Works at the [**Last Name (un) 10457**] Laboratories in [**Location (un) 2997**]. Married. Has a son. No ETOH, TOBACCO, or Drugs. Family History: Father died of multiple strokes at age 63. Paternal Uncle died of stroke. Patient sister died of stroke at age 46. 10
Facilitating Retrieval by Segmenting Clinical Text Past Medical History: Had an ulcer at age 10 Family History: Father died of multiple strokes at age 63. Paternal Uncle died of stroke. Patient sister died of stroke at age 46. Sections provide clues that may avoid some retrieval issues - Temporal differences - References to illnesses in other people Several algorithms have been published that segment clinical documents - Segmenting was validated - No published studies evaluate whether segmenting improves recall and precision 11
Project Overview Segmented a set of clinical documents Developed topics for several patient cohorts Developed queries with and without sections Judged a subset of documents for performance Analyzed results 12
Methods - Data MIMIC-II database neonatal and adult patients De-identified ICU records developed by MIT, Philips Medical Systems, and Beth Israel Deaconess Medical Center Relational database containing structured data and unstructured documents Discharge summaries MD notes Radiology reports 25,000 patients Nursing notes 13
Methods Segmenting Documents Identified section indicators Admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] Sex: M Service: SURGERY Allergies: Penicillin Attending:[**First Name3 (LF) 2679**] Addendum: Pt is discharged to Admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] Sex: M Service: SURGERY Allergies - penicillin Attending:[**First Name3 (LF) 2679**] Addendum: Pt is discharged to Admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] Sex: M Service: SURGERY Allergic to penicillin Attending:[**First Name3 (LF) 2679**] Addendum: Pt is discharged to Searched for indicators and inserted XML tags Admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] Sex: M Service: SURGERY <allergies>allergic to penicillin</allergies> Attending:[**First Name3 (LF) 2679**] Addendum: Pt is discharged to 14
Methods Segmenting Documents <TEXT>Admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] Date of Birth: [**3312-11-5**] Sex: M Service: SURGERY Allergies: Penicillin <TEXT> Attending:[**First Name3 (LF) 2679**] Addendum: Pt <preamble>admission Date: [**3391-5-21**] Discharge Date: [**3391-6-1**] discharged to [**Hospital3 **] Hospital [**3391-6-1**]. Date of Birth: [**3312-11-5**] Sex: M Service: SURGERY</preamble> This is an updated medication list, which has been <allergies>allergies: Penicillin</allergies> faxed to [**Hospital3 **]. Discharge Medications: 1. <addendum>addendum: Pt is discharged to [**Hospital3 **] Hospital [**3391- Acetaminophen 325 mg Tablet Sig: 1-2 Tablets PO Q6H 6-1**]. This is an updated medication list, which has been faxed to (every 6 hours) as needed. 2. Atorvastatin 20 mg Tablet [**Hospital3 **]. </addendum> Sig: One (1) Tablet PO DAILY (Daily). 3. Insulin Lispro <dc_meds>discharge Medications: 1. Acetaminophen 325 mg Tablet Sig: 1-2 100 unit/ml Solution Sig: One (1) injection Tablets PO Q6H (every 6 hours) as needed. 2. Atorvastatin 20 mg Tablet Subcutaneous ASDIR (AS DIRECTED). Discharge Sig: One (1) Tablet PO DAILY (Daily). 3. Insulin Lispro 100 unit/ml Disposition: Extended Care Facility: [**Hospital6 694**] Solution Sig: One (1) injection Subcutaneous ASDIR (AS DIRECTED). [((Location (un) 695**] [**First Name11 (Name </dc_meds> Pattern1) 531**] [**Last Name (NamePattern1) 2684**] <dc_disposition>discharge Disposition: Extended Care Facility: [**Hospital6 MD [**MD Number 2685**]</TEXT> 694**] [((Location (un) 695**] [**First Name11 (Name Pattern1) 531**] [**Last Name (NamePattern1) 2684**] MD [**MD Number 2685**] </dc_disposition> </TEXT> Original format Segmented text 15
Methods Search Engine NLM s Essie Developed to facilitate searching of medical literature by non-clinicians through use of UMLS UMLS relates terms by concept Allows matching even if different words used Maps text corpus to the UMLS and indexes the corpus on these concepts Maps the search concepts to the UMLS Returns a ranked, scored list of documents 16
Methods Clinical Topics Began with topics from TRECMed 2012 and adapted them to the MIMIC ICU data Modified or eliminated topics that retrieved few documents 17
Methods Clinical Topic Examples Patients who develop thrombocytopenia in pregnancy Patients taking atypical antipsychotics without a diagnosis of schizophrenia or bipolar depression Patients with delirium, hypertension, and tachycardia Patients with thyrotoxicosis treated with beta-blockers Final set included 22 topics 18
Methods Query Development Developed initial query without sections Ran queries against data Examined retrieved documents to refine query Rewrote query using sections Ran queries against data Examined retrieved documents to refine query Ran all queries and recorded documents returned and scores 19
Methods Query Development Topic: Patients with diabetes who also have thrombocytosis Baseline query diabetes AND thrombocytosis With sections we could avoid Family History thrombocytosis AND AREA[AdmissionDiagnosis] diabetes OR AREA[ChiefComplaint] diabetes OR AREA[Course] diabetes 20
Methods Document Sampling Samples selected for each topic based on difference in scores Segmented Documents 0-10 docs 0-10 high 0-10 low Whole Document 0-10 docs Total sample size was 574 documents Sample sizes ranged from 10 to 40 Average sample size 26 documents 21
Methods Document Evaluation 1. Was the document relevant to the topic? 2. Why were non-relevant documents retrieved? 3. Did segmentation help retrieval and why? 22
Results Document Relevance 574 Documents Analyzed Queries of Segmented Documents 26 328 220 Queries of Whole Documents 23
Results Document Relevance 343 Relevant Documents Segmented Documents 246 20 77 Whole Document 231 Non-relevant Documents Segmented Documents 6 82 143 Whole Document 24
Results Reasons for Retrieving Non-relevant Documents Non-relevant reference to condition 84 Past or possible future condition 70 Condition mentioned but not diagnosed 23 Condition denied or ruled out 22 Issue with term mapping 20 Query issue 11 25
Results Effect of Segmenting on Document Retrieval Segmenting avoided retrieval of non-relevant document by avoiding specific sections Segmenting allowed retrieval of relevant document by focusing on specific sections 132 20 Performance unrelated to segmenting 320 Query error did not look in the right section 80 Document not segmented correctly 18 Condition included in incorrect section of notes 1 26
Results Segmenting avoided retrieval of non-relevant documents Patients who develop thrombocytopenia in pregnancy Issue: Neonatal notes often document mother s pregnancy history Solution: Look in sections containing the patient s diagnosis 27
Results Segmenting allowed retrieval of relevant documents by focusing on specific sections Patients taking atypical antipsychotics without a diagnosis of schizophrenia or bipolar depression Issue: Need to ignore mentions of these conditions in family members Solution: Look in sections containing the patient s diagnosis; avoid family-history section 28
Quantitative Analysis Correlation to indicate whether querying the segmented documents impacted performance Precision and recall 29
Analysis Matthews Correlation Coefficient Segmented score higher than base Segmented score lower than base Document relevant True Positive False Negative Document not relevant False Positive True Negative MCC = TP x TN FP x FN ((TP + FP)(TP + FN)(TN + FP)(TN + FN)) Values range from -1 to 1 30
Analysis Matthews Correlation Coefficient Average * p<0.05 p<0.01 ** ** ** ** ** -0.2 0 0.2 0.4 0.6 0.8 1 ** * ** 31
Analysis Recall and Precision Recall = Number of relevant documents retrieved All relevant documents judged Precision = Number of relevant documents retrieved All documents judged Values range from 0 to 1 32
Analysis - Recall 1 0.8 0.6 0.4 0.2 0 Whole Document Segmented Document Avg 33
Analysis - Precision 1 0.8 0.6 0.4 0.2 0 Whole Document Segmented Document Avg 34
Discussion Queries of segmented documents retrieved fewer documents These documents were more likely to be relevant and less likely to be non-relevant Some queries performed better Some documents were easier to segment accurately 35
Limitations Small sample size Only one person writing queries and doing relevance judgments Inaccuracies in identifying note segments Some queries did not perform well 36
Future Work Use validated algorithm to segment text Use larger sample and independent relevance judges Develop queries for specific type of clinical note Identify specific types of information that benefit from searching specific sections Search unstructured and structured data together to reflect real-world EHR data use 37
AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. @AMIAInformatics @AMIAinformatics Official Group of AMIA @AMIAInformatics #WhyInformatics 38
Thank you! Email me at: edingert@ohsu.edu