Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Text mining for lung cancer cases over large patient admission data David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Opportunities for Biomedical Informatics Increasing roll-out of EHR and HI systems creating analysis opportunities in Biomed Informatics and ehealth Prediction: factors in disease and effective treatment Detection: observables indicating of disease Prevention: what factors circumvent those related to prediction Linked hospital data allows multiple sources to be leveraged for complex analytic tasks Radiology Pathology Pharmacology Admission Emergency Room etc

Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support: REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number

Alfred Health s REASON Discovery Platform

Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number Large-scale Data architecture (parts of )14+ years of data from Cerner HI implementations 171,000+ updates to Clinical Events table each day 62.4 million updates per annum

Language Technology and Decision Support Much information remains and will remain in text form Inter-departmental reports Clinical notes and narratives In-patient and discharge summaries EHRs/EMRs Such data can t be leveraged without language technology/nlp Tasks: Monitoring (adverse) clinical events, surveillance Providing best up-to-date evidence Creating knowledge bases Converting text into actionable data that can be mined

Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding Disease: Lung Cancer Identified by ICD-10 code C34: Malignant neoplasm of bronchus and lung

Hybrid Text Mining Processing Framework Machine learning algorithm Annotated Data Set Classification Model Language processing Words and Linguistic structure; Names of entities; Context features; Domain concepts Biomedical knowledge sources

Previous Applications Fungal infection surveillance by classifying CT scan reports Extracting information from pathology reports

Method Data: radiology reports for 2 (financial) years (2011--2013) 756,502 reports, plus associated metadata Each report linked to an admission record Metadata: ICD-10 (manually assigned) used as ground truth; demographics, reason for admission, etc Extracted from REASON platform Data pre-processed to remove ICD-10 codes and extract features Challenge: real distribution highly skewed data: only 0.8% of data are positive for lung cancer

Method Features: Bags-of-Words from reports text Bags-of-Phrases identified by MetaMap Negative context identified by NegEx Metadata associated with linked admission record NAME, DOB, SEX, MARITALSTATUS, RELIGION,...: ADMISSIONREASON, ADMISSIONUNIT, ADMISSIONTYPE: ALLERGIES, DRUGCODE, DRUGDESC:...

Method Machine learning algorithms Support Vector Machines (as implemented in Weka toolkit) Correlation-based feature subset selection filter Baseline: term-matching approach lung cancer, lung malignancy, lung malignant, lung neoplasm, lung tumour, lung carcinoma

Results: different systems / configurations Evaluation: stratified 10-fold cross-validation Classifier Precision Recall F-score Text features only 0.855 0.800 0.825 Full feature set (including metadata) 0.871 0.820 0.843 Term-matching baseline 0.643 0.742 0.689 * Results not using feature selection, which reduced performance

Results: temporal-based training sets

Conclusions Promising results using straightforward machine-learning-based approach over heavily-skewed distributions Volume of data means training time can be lengthy mainly feature extraction, due to MetaMap however, this is off-line and record-classification is fast Use of metadata improved performance Data linking helps

Thanks! Questions?

Problematic sentences for phrase-matching baseline False positives: - Clinical indications: Surveillance of lung cancer - A small primary lung neoplasm would have to be considered -? small primary lung neoplasm - Clinical: Metastatic lung cancer False negatives: - I suspect this is more likely be carcinoma and lung primary are suspected in this highly likely - ill-defined nodular density superior to the right lung hilum (?)