Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials Riccardo Miotto and Chunhua Weng Department of Biomedical Informatics Columbia University, New York City
Acknowledgments The research was supported by two grants: 1. R01 LM009886 Bridging the semantic gap between clinical research eligibility criteria and clinical data from The National Library of Medicine (PI: Weng) 2. UL1 TR000040 Clinical and Translational Science Award to Columbia University (PI: Ginsberg) 2
The Top 5 Challenges for Clinical Research 1. Recruitment 2. Recruitment 3. Recruitment 4. Recruitment 5. Recruitment 3
The Opportunity Samson W. Tu, Carol A. Kemper, Nancy M. Lane, Robert W. Carlson, and Mark A. Musen. A methodology for determining patients eligibility for clinical trials. Journal of Methods of Information in Medicine, 32(4):317 325, 1993. EHRs contains rich information for identifying eligible patients for clinical trials As of 2015, EHRs have become pervasive in medicine 4
Our Initial Plan To derive a computable representation of the clinical trial eligibility criteria and to align it with patient EHR records Free-text Eligibility Criteria Text-based Knowledge Acquisition Semantic Alignment Potentially Eligible Patients 5
and Attempts 6
and Attempts 7
and Attempts 8
and Attempts 9
and Attempts 10
and Attempts 11
Data quality issues (Weiskopf, JAMIA 2013) Completeness Concordance Correctness Plausibility Currency and Challenges Data representation heterogeneity: structured, unstructured Lack of common information models Concept granularity discrepancies even using the same T Incomplete knowledge, imprecise disease classification Workflow and practical issues: recent visit, belief in research, etc. 12
An Alternative To derive a computable representation of the clinical trial eligibility criteria target patient and to align it with patient EHR records find patients similar to the target 13
Our Conceptual Framework 1. A set of eligible patients or clinical trial participants is manually identified their EHRs are aggregated to derive the target patient 2. The target patient is applied to any unseen patient of a clinical data warehouse to check the eligibility status 3. For each patient the framework returns a relevance score the higher the score, the more likely the patient is eligible for the trial 4. Patients can be be ranked by their relevance scores potentially eligible patients are at the top of the list manual identification performed by the investigator can be done quickly 14
Example Related Work 15
Methods: comparison to related work State-of-the-art: train a binary classifier to determine if a patient is eligible or not the classifier is the target patient Limitation of the state-of-the-art training a binary classifier requires a list of participants and also a list of ineligible patients so that finding ineligible patients can be as difficult as laborious as finding eligible patients Alternative Approach generate the target patient by modeling only the trial participants Advantages Rely only on a patient representation using comprehensive data without formally representing eligibility criteria 16
A Pilot Study Study Goal to show the feasibility of using only minimal trial participants to discover new potentially eligible patients Our implementation favored a simple design to ensure a focused and correct evaluation more complicated implementations are more likely to introduce mistakes in the process Flexible customization on how to: process and summarize patient EHR data represent the clinical trial participants discover potential eligible patients 17
Patient EHR Processing (1) EHR data types medication orders, ICD-9 diagnosis, laboratory results, clinical notes Each participant is represented by four vectors, one per data type Medication orders count the presence of each code in every participant Laboratory results count the presence of each test if results were categorical or expressed with different scales average test result values if expressed using the same unit measure 18
Patient EHR Processing (2) ICD9 codes count the presence of each code in every participant Clinical notes extract relevant tags limited presence of stop words, matching to UMLS use UMLS to normalize tags aggregate synonyms and semantically similar concepts under the same tag remove negated tags remove temporally consecutive similar notes topic modeling unsupervised inference process that captures patterns of word co-occurrences within a heterogeneous set of notes to define topics represent each note as a a vector of topic probabilities 19
Target Patient (1) Participant EHR representations are aggregated to derive the target patient 20
Target Patient (2) Target patient for each data type, retain only the concepts frequently shared by the participants motivated by the small number of participants available for some trials a concepts is frequent when appears in at least 60% of the participants average concept occurrences over all participants Target patient represented by four vectors aggregated common patterns among the trial s participants same structure of a regular patient representation favor direct comparisons with patient data 21
Finding New Eligible Patients EHR data of unseen patients are matched to the target patient of a clinical trial Relevance score indicating the patient eligible likelihood pairwise cosine similarity within each data type aggregating the scores using a weighted linear combination w-comb The higher the relevance score, the more likely is the patient eligible List of patients sorted by score the most likely candidates at the top of the list to speed up manual review 22
Evaluation: study design Dataset 13 clinical trials from Columbia University 262 unique participants additional 30k patients extracted at random from the data warehouse 2-fold cross validation half participants to derive the target patient half participants plus the 30k random patients to test Ranking experiment for each trial rank all the patients in the test set by their relevance score with the corresponding target patient measure at which point of the list the participants in the test set are ranked 23
Evaluation: 13 multi-site trials 24
Evaluation: 13 multi-site trials 25
Evaluation: comparisons w-comb obtains the best results take advantage of the heterogeneous data types AMIA 2014 Annual Symposium, 15 19 November 2014 26
Evaluation: AUC Area under the ROC Curve = 0.952 27
Evaluation: results Precision-at-5 for each trial in each fold every trial has at least one potentially eligible patient within the top 5 position of the ranking list (P5 0.2) 28
Conclusions EHR data of clinical trial participants can be used to recommend new eligible patients ranking results consistent among multiple trials of different medical conditions satisfactory results regardless the number of participants used for training Potential applicative scenarios self-standing tool constantly monitoring a clinical data warehouse and alerting investigators when a new potentially eligible patient is identified to be used on request to rank all patients in a data warehouse to find eligible patients for a specific trial component to integrate approaches processing eligibility criteria AMIA 2014 Annual Symposium, 15 19 November 2014 29
Ongoing Works Patient representation test more sophisticated techniques to represent EHR data model laboratory results accounting for the temporal trends of the values model diagnosis and medications using a well-chosen probability distribution to handle the incompleteness problem of EHR data Improve the target patient statistical model that can be trained by only estimating the distributions of participants associated with each trial e.g., hidden Markov models, mixture models Extend the experimental framework more and diverse clinical trials more patients in the dataset Tackle the complexity of trial design, e.g., the need for > 1 target model 30
Chunhua Weng, Ph.D. chunhua@columbia.edu