Text mining for lung cancer cases over large patient admission data David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor
Opportunities for Biomedical Informatics Increasing roll-out of EHR and HI systems creating analysis opportunities in Biomed Informatics and ehealth Prediction: factors in disease and effective treatment Detection: observables indicating of disease Prevention: what factors circumvent those related to prediction Linked hospital data allows multiple sources to be leveraged for complex analytic tasks Radiology Pathology Pharmacology Admission Emergency Room etc
Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support: REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number
Alfred Health s REASON Discovery Platform
Alfred Health s REASON Discovery Platform Initiative by Alfred Health Informatics Department: Technologies, tools and large-scale data sources to support REsearch, AnalysiS, and OperatioNs Integrates data sources from multiple hospital departments Historical patient data linked by unique Unit Record (UR) number Large-scale Data architecture (parts of )14+ years of data from Cerner HI implementations 171,000+ updates to Clinical Events table each day 62.4 million updates per annum
Language Technology and Decision Support Much information remains and will remain in text form Inter-departmental reports Clinical notes and narratives In-patient and discharge summaries EHRs/EMRs Such data can t be leveraged without language technology/nlp Tasks: Monitoring (adverse) clinical events, surveillance Providing best up-to-date evidence Creating knowledge bases Converting text into actionable data that can be mined
Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding
Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding Disease: Lung Cancer Identified by ICD-10 code C34: Malignant neoplasm of bronchus and lung
Disease Recognition from Clinical Reports Task: classify records according to specified disease Enables retrieval of specific cases Detect patterns of disease occurrence Support creation of patient cohorts Prelude to automated ICD-encoding Disease: Lung Cancer Identified by ICD-10 code C34: Malignant neoplasm of bronchus and lung Classification Task: assign code to patient admission record
Hybrid Text Mining Processing Framework Machine learning algorithm Annotated Data Set Classification Model Language processing Words and Linguistic structure; Names of entities; Context features; Domain concepts Biomedical knowledge sources
Previous Applications Fungal infection surveillance by classifying CT scan reports Extracting information from pathology reports
Method Data: radiology reports for 2 (financial) years (2011--2013) 756,502 reports, plus associated metadata Each report linked to an admission record Metadata: ICD-10 (manually assigned) used as ground truth; demographics, reason for admission, etc Extracted from REASON platform Data pre-processed to remove ICD-10 codes and extract features Challenge: real distribution highly skewed data: only 0.8% of data are positive for lung cancer
Method Features: Bags-of-Words from reports text Bags-of-Phrases identified by MetaMap Negative context identified by NegEx Metadata associated with linked admission record NAME, DOB, SEX, MARITALSTATUS, RELIGION,...: ADMISSIONREASON, ADMISSIONUNIT, ADMISSIONTYPE: ALLERGIES, DRUGCODE, DRUGDESC:...
Method Machine learning algorithms Support Vector Machines (as implemented in Weka toolkit) Correlation-based feature subset selection filter Baseline: term-matching approach lung cancer, lung malignancy, lung malignant, lung neoplasm, lung tumour, lung carcinoma
Results: different systems / configurations Evaluation: stratified 10-fold cross-validation Classifier Precision Recall F-score Text features only 0.855 0.800 0.825 Full feature set (including metadata) 0.871 0.820 0.843 Term-matching baseline 0.643 0.742 0.689 * Results not using feature selection, which reduced performance
Results: temporal-based training sets
Conclusions Promising results using straightforward machine-learning-based approach over heavily-skewed distributions Volume of data means training time can be lengthy mainly feature extraction, due to MetaMap however, this is off-line and record-classification is fast Use of metadata improved performance Data linking helps
Conclusions Promising results using straightforward machine-learning-based approach over heavily-skewed distributions Volume of data means training time can be lengthy mainly feature extraction, due to MetaMap however, this is off-line and record-classification is fast Use of metadata improved performance Data linking helps Next Steps Expand to more ICD codes / diseases Incorporate more data sources into decision-making
Thanks! Questions?
Problematic sentences for phrase-matching baseline False positives: - Clinical indications: Surveillance of lung cancer - A small primary lung neoplasm would have to be considered -? small primary lung neoplasm - Clinical: Metastatic lung cancer False negatives: - I suspect this is more likely be carcinoma and lung primary are suspected in this highly likely - ill-defined nodular density superior to the right lung hilum (?)