Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Similar documents
Informatics methods in Infection and Using computers to help find infection Syndromic Surveillance

A Simple Pipeline Application for Identifying and Negating SNOMED CT in Free Text

Annotating Temporal Relations to Determine the Onset of Psychosis Symptoms

Shades of Certainty Working with Swedish Medical Records and the Stockholm EPR Corpus

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview

Retrieving disorders and findings: Results using SNOMED CT and NegEx adapted for Swedish

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

How to Advance Beyond Regular Data with Text Analytics

Standardize and Optimize. Trials and Drug Development

TeamHCMUS: Analysis of Clinical Text

Text Mining of Patient Demographics and Diagnoses from Psychiatric Assessments

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Asthma Surveillance Using Social Media Data

Detecting Patient Complexity from Free Text Notes Using a Hybrid AI Approach

Chapter 12 Conclusions and Outlook

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

READ-BIOMED-SS: ADVERSE DRUG REACTION CLASSIFICATION OF MICROBLOGS USING EMOTIONAL AND CONCEPTUAL ENRICHMENT

CLAMP-Cancer an NLP tool to facilitate cancer research using EHRs Hua Xu, PhD

QUANTITATIVE IMAGING ANALYTICS

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

How can Natural Language Processing help MedDRA coding? April Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

Not all NLP is Created Equal:

Identifying Adverse Drug Events from Patient Social Media: A Case Study for Diabetes

Semantic Alignment between ICD-11 and SNOMED-CT. By Marcie Wright RHIA, CHDA, CCS

Automatic coding of death certificates to ICD-10 terminology

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Factuality Levels of Diagnoses in Swedish Clinical Text

Improved Intelligent Classification Technique Based On Support Vector Machines

Automatic Identification & Classification of Surgical Margin Status from Pathology Reports Following Prostate Cancer Surgery

Clinical decision support (CDS) and Arden Syntax

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track

Distillation of Knowledge from the Research Literatures on Alzheimer s Dementia

WikiWarsDE: A German Corpus of Narratives Annotated with Temporal Expressions

COMPARISON OF BREAST CANCER STAGING IN NATURAL LANGUAGE TEXT AND SNOMED ANNOTATED TEXT

Running Head: AUTOMATED SCORING OF CONSTRUCTED RESPONSE ITEMS. Contract grant sponsor: National Science Foundation; Contract grant number:

On-time clinical phenotype prediction based on narrative reports

Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval

Application of Automated Pathology Reporting Concepts to Radiology Reports

Visual and Decision Informatics (CVDI)

Rebooting Cancer Data Through Structured Data Capture GEMMA LEE NAACCR CONFERENCE JUNE, 2017

Microblog Retrieval for Disaster Relief: How To Create Ground Truths? Ribhav Soni and Sukomal Pal

Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports

Problem-Oriented Patient Record Summary: An Early Report on a Watson Application

Modeling Annotator Rationales with Application to Pneumonia Classification

Truth Versus Truthiness in Clinical Data

Big Data Phenomics in the VA. Outline

Colorectal Cancer Screening Rates in Health Centers

Data Mining in Bioinformatics Day 4: Text Mining

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

An Intelligent Writing Assistant Module for Narrative Clinical Records based on Named Entity Recognition and Similarity Computation

IBM Research Report. Automated Problem List Generation from Electronic Medical Records in IBM Watson

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Deep Learning based Information Extraction Framework on Chinese Electronic Health Records

Clinical Event Detection with Hybrid Neural Architecture

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Schema-Driven Relationship Extraction from Unstructured Text

Drug side effect extraction from clinical narratives of psychiatry and psychology patients

Social Media Mining for Toxicovigilance

The Impact of Belief Values on the Identification of Patient Cohorts

Pneumonia identification using statistical feature selection

An Ontology for Healthcare Quality Indicators: Challenges for Semantic Interoperability

Evaluating Classifiers for Disease Gene Discovery

A Deep Learning Approach to Identify Diabetes

arxiv: v1 [cs.lg] 4 Feb 2019

Multifaceted Approach to CT Dose Reduction for Rule-Out Aortic Dissection. Exhibit ID

Building a framework for handling clinical abbreviations a long journey of understanding shortened words "

A comparative study of different methods for automatic identification of clopidogrel-induced bleeding in electronic health records

Clinical Decision Support Technologies for Oncologic Imaging

Big Data in Healthcare: motivation, current state and specific use cases

Clinician-Driven Automated Classification of Limb Fractures from Free-Text Radiology Reports

A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD)

CHAPTER 2 MAMMOGRAMS AND COMPUTER AIDED DETECTION

Wikipedia-Based Automatic Diagnosis Prediction in Clinical Decision Support Systems

Application of AI in Healthcare. Alistair Erskine MD MBA Chief Informatics Officer

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Large-scale Histopathology Image Analysis for Colon Cancer on Azure

An unsupervised machine learning model for discovering latent infectious diseases using social media data

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

Mansourvar, Marjan; Andersen-Ranberg, Karen; Nøhr, Christian; Wiil, Uffe Kock

George Cernile Artificial Intelligence in Medicine Toronto, ON. Carol L. Kosary National Cancer Institute Rockville, MD

Session 35: Text Analytics: You Need More than NLP. Eric Just Senior Vice President Health Catalyst

Extraction of Adverse Drug Effects from Clinical Records

Medical Knowledge Attention Enhanced Neural Model. for Named Entity Recognition in Chinese EMR

Automatic Pathology Software for Diagnosis of Non-Alcoholic Fatty Liver Disease

Using Electronic Medical Records to Identify Complex Health Outcomes

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

Cerner COMPASS ICD-10 Transition Guide

Prevalence of adrenal incidentaloma a methodologic comparison of EMR query strategies

National Academies Next Generation SAMPLE Researchers TITLE Initiative HERE

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

ITERATIVELY TRAINING CLASSIFIERS FOR CIRCULATING TUMOR CELL DETECTION

Does Machine Learning. In a Learning Health System?

Symbolic rule-based classification of lung cancer stages from free-text pathology reports

Tweet Location Detection

Leveraging Expert Knowledge to Improve Machine-Learned Decision Support Systems

Icd 9 code for small cell lung cancer

Characteristics of Inpatient Fever

Improving Patients' Understanding of Radiology Reports: Comparing Coverage of a Lay-Language Radiology Glossary to MedlinePlus

Transcription:

Text mining for lung cancer cases over large patient admission data David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Opportunities for Biomedical Informatics Increasing roll-out of EHR and HI systems creating analysis opportunities in Biomed Informatics and ehealth Prediction: factors in disease and effective treatment Detection: observables indicating of disease Prevention: what factors circumvent those related to prediction Linked hospital data allows multiple sources to be leveraged for complex analytic tasks Radiology Pathology Pharmacology Admission Emergency Room etc

Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support: REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number

Alfred Health s REASON Discovery Platform

Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number Large-scale Data architecture (parts of )14+ years of data from Cerner HI implementations 171,000+ updates to Clinical Events table each day 62.4 million updates per annum

Language Technology and Decision Support Much information remains and will remain in text form Inter-departmental reports Clinical notes and narratives In-patient and discharge summaries EHRs/EMRs Such data can t be leveraged without language technology/nlp Tasks: Monitoring (adverse) clinical events, surveillance Providing best up-to-date evidence Creating knowledge bases Converting text into actionable data that can be mined

Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding

Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding Disease: Lung Cancer Identified by ICD-10 code C34: Malignant neoplasm of bronchus and lung

Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding Disease: Lung Cancer Identified by ICD-10 code C34: Malignant neoplasm of bronchus and lung Classification Task: assign code to patient admission record

Hybrid Text Mining Processing Framework Machine learning algorithm Annotated Data Set Classification Model Language processing Words and Linguistic structure; Names of entities; Context features; Domain concepts Biomedical knowledge sources

Previous Applications Fungal infection surveillance by classifying CT scan reports Extracting information from pathology reports

Method Data: radiology reports for 2 (financial) years (2011--2013) 756,502 reports, plus associated metadata Each report linked to an admission record Metadata: ICD-10 (manually assigned) used as ground truth; demographics, reason for admission, etc Extracted from REASON platform Data pre-processed to remove ICD-10 codes and extract features Challenge: real distribution highly skewed data: only 0.8% of data are positive for lung cancer

Method Features: Bags-of-Words from reports text Bags-of-Phrases identified by MetaMap Negative context identified by NegEx Metadata associated with linked admission record NAME, DOB, SEX, MARITALSTATUS, RELIGION,...: ADMISSIONREASON, ADMISSIONUNIT, ADMISSIONTYPE: ALLERGIES, DRUGCODE, DRUGDESC:...

Method Machine learning algorithms Support Vector Machines (as implemented in Weka toolkit) Correlation-based feature subset selection filter Baseline: term-matching approach lung cancer, lung malignancy, lung malignant, lung neoplasm, lung tumour, lung carcinoma

Results: different systems / configurations Evaluation: stratified 10-fold cross-validation Classifier Precision Recall F-score Text features only 0.855 0.800 0.825 Full feature set (including metadata) 0.871 0.820 0.843 Term-matching baseline 0.643 0.742 0.689 * Results not using feature selection, which reduced performance

Results: temporal-based training sets

Conclusions Promising results using straightforward machine-learning-based approach over heavily-skewed distributions Volume of data means training time can be lengthy mainly feature extraction, due to MetaMap however, this is off-line and record-classification is fast Use of metadata improved performance Data linking helps

Conclusions Promising results using straightforward machine-learning-based approach over heavily-skewed distributions Volume of data means training time can be lengthy mainly feature extraction, due to MetaMap however, this is off-line and record-classification is fast Use of metadata improved performance Data linking helps Next Steps Expand to more ICD codes / diseases Incorporate more data sources into decision-making

Thanks! Questions?

Problematic sentences for phrase-matching baseline False positives: - Clinical indications: Surveillance of lung cancer - A small primary lung neoplasm would have to be considered -? small primary lung neoplasm - Clinical: Metastatic lung cancer False negatives: - I suspect this is more likely be carcinoma and lung primary are suspected in this highly likely - ill-defined nodular density superior to the right lung hilum (?)