Interpretability of Sudden Concept Drift in Medical Informatics Domain Gregor Stiglic, Peter Kokol Faculty of Health Sciences University of Maribor Slovenia
Presentation Outline Visualization of Concept Drift National Hospital Discharge Data Data Pre-processing Experimental Settings Results Conclusions
Visualization of Concept Drift Changes in data distribution or in the concept of the predicted class. (Tsymbal, 2004) Very few research papers dealing with concept drift visualization or visual interpretation Mostly visual data exploration techniques for: Multivariate streams visualization or Univariate time-series anomaly detection K.B. Pratt, and G. Tschapek, Visualizing concept drift, KDD 2003. Using brushed parallel histograms
National Hospital Discharge Data Hospital discharge records for approximately 1% of US hospitals 10 consecutive years from 2000 to 2009 (approx. 300.000 hospitalizations/year) Altogether 3,106,176 hospitalization events We used 2,509,113 events (after removing nonadults)
National Hospital Discharge Data Each NHDS record contains the: Personal characteristics of the patient (age, gender, race, and marital status); Administrative information (length of stay, discharge status, etc.); Medical information Up to 7 diagnoses (optional admitting diagnosis) and Up to 4 surgical and nonsurgical procedures).
Data Pre-processing The ICD-9-CM coding of diagnoses 5-digit codes were collapsed to 3-digit codes For example: Flu is assigned to category 487, Influenza 487.0, Influenza with pneumonia 487.1, Influenza with other respiratory manifestations etc. Altogether 1188 3-digit diagnosis codes
Data Pre-processing ID Month Sex Age D038 D585 D678 1 1 M 25 A A A A 2 1 F 37 A P A A 21,343 2 F 65 P A A P 21,344 2 M 77 A A A A 21,345 2 M 23 A P A A 2,509,113 120 F 81 A A A A Class D585
Data Pre-processing Sparse matrix 1187 codes ID Month Sex Age D038 D585 D678 1 1 M 25 A A A A 2 1 F 37 A P A A 21,343 2 F 65 P A A P 21,344 2 M 77 A A A A 21,345 2 M 23 A P A A 2,509,113 120 F 81 A A A A Class D585
Data Pre-processing Sparse matrix 1187 codes ID Month Sex Age D038 D585 D678 1 1 M 25 A A A A 2 1 F 37 A P A A 21,343 2 F 65 P A A P 21,344 2 M 77 A A A A 21,345 2 M 23 A P A A 2,509,113 120 F 81 A A A A Class D585
Human Disease Network Metrics We use two co-morbidity measures for visualization: Relative risk RR ij = C ijn M i M j Phi-correlation ij = C ij N M i M j M i M j (N M i )(N M j )
Experimental Settings Performance comparison of Static vs. Dynamic Ensemble of Naïve Bayes classifiers Static: trained only on the first 12 months of data, does not change after that Dynamic: updated with each new batch of incoming patients (i.e. each month) 25 updateable NB classifiers are built using Random Spread Subsample (balanced instance sampling) Simple majority voting was used
Experimental Settings Performance metrics observed over 119 months AUC (effective for imbalanced classes) F-measure (integration of precision and recall) Own implementation of prequential evaluation in Weka was used Each batch (month) of data is first used for testing, before it can be used for training
FEB-00 JUL-00 DEC-00 MAY-01 OCT-01 MAR-02 AUG-02 JAN-03 JUN-03 NOV-03 APR-04 SEP-04 FEB-05 JUL-05 DEC-05 MAY-06 OCT-06 MAR-07 AUG-07 JAN-08 JUN-08 NOV-08 APR-09 SEP-09 Results 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 AUC (D585 Chronic kidney disease) D_585 AUC Static D_585 AUC Dynamic
Visualization How to explain the concept drift? Using motion charts: -correlation and relative risk measures w.r.t. the attribute class, Chi-square value or significance (p-value), Morbidity (support) for different diseases, Time. Only disease codes with support over 100 were visualized.
Visualization
Relative Risk Visualization
Visualization Relative Risk Phi-correlation
Visualization Relative Risk Phi-correlation Chi-square p-val
Visualization Relative Risk Phi-correlation Chi-square p-val Prevalence
Visualization Relative Risk Phi-correlation Chi-square p-val Prevalence Time
Visualization August 2005
Visualization Diagnosis code D401 D403 D404 D428 D583 D584 D585 D588 Description Essential hypertension Hypertensive chronic kidney disease Hypertensive heart and chronic kidney disease Heart failure Nephritis and nephropathy, not specified as acute or chronic Acute renal failure Chronic kidney disease (CKD) Disorders resulting from impaired renal function August 2005
Visualization October 2005
Visualization October 2005
Conclusions D403 and D404 stand out Further examination uncovers change of coding in October 2005 D403: Hypertensive renal disease (Sep 2005) -> Hypertensive chronic kidney disease (Oct 2005) Possible explanation: The new code names brought more attention to D403 and D404 in Oct 2005 and resulted in more accurate coding.
Conclusions Visualization can help in interpretation of concept drift events. Possible improvements: Additional variables for visualization of significant changes in variable values ( movement p-values ) Educational use (visualization of all classifiers in an ensemble of classifiers) Integration into MOA instead of Weka + Google Motion Charts implementation
Questions Gregor Stiglic gregor.stiglic@uni-mb.si Demo visualization available at: http://ri.fzv.uni-mb.si/icdm11