A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Size: px

Start display at page:

Download "A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U"

Meagan Dennis
5 years ago
Views:

1 A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U T H E U N I V E R S I T Y O F T E X A S A T D A L L A S H U M A N L A N G U A G E T E C H N O L O G Y R E S E A R C H I N S T I T U T E H T T P : / / W W W. H L T. U T D A L L A S. E D U

2 Presentation Outline 1. The Problem Purpose Background 2. The Dataset The corpus Mathematical representation 3. The Approach Simple model Bayesian model Inference 4. Results Experiments Conclusions

3 The Problem: Motivation personalized medicine has the potential to [improve] patient care and disease prevention [... and] to positively impact two other important trends the increasing cost of health care and the decreasing rate of new medical product development. The ability to distinguish in advance those patients who will benefit from a given treatment and those who are likely to suffer important adverse effects could result in meaningful cost savings for the overall health care system. Moreover, the ability to stratify patients by disease susceptibility or likely response to treatment could also reduce the size, duration, and cost of clinical trials, thus facilitating the development of new treatments, diagnostics, and prevention strategies. - The President s Council of Advisers on Science and Technology

4 The Problem: EHRs There are an estimated million emergency department visits each year in the United States. 12% (16.4 million) result in hospital admissions average hospital stay of 4.8 days An electronic medical record (EMR) is an individual medical report which documents a variety of clinical observations, such as the patient s diagnoses, risk factors, medications, and test results The electronic health record (EHR) for an individual combines all the EMRs generated during the patient s clinical chronology EHRs document clinical observations made at different times throughout the health management of a patient. However, the clinical course of a disease continues to progress between the times when a physician examines the patient and updates the patients EHR.

5 The Problem: EHR Goals The United States government has outlined four major goals for widespread EHR adoption: 1. Track data over time 2. Identify patients who are due for preventive visits and screenings 3. Monitor how patients measure up to certain parameters, such as vaccinations and blood pressure readings 4. Improve overall quality of care in a practice In this presentation (and the associated paper), we show how each of those goals can be addressed defining a novel probabilistic model of patients clinical chronologies

6 Presentation Outline 1. The Problem Purpose Background 2. The Dataset The corpus Mathematical representation 3. The Approach Simple model Bayesian model Inference 4. Results Experiments Conclusions

7 The Dataset We considered a collection of 790 de-identified, longitudinal narrative electronic medical records (EMRs). This collection was provided by the organizers of the shared-tasks on Challenges in Language Processing for Clinical Data sponsored by the 2014 Informatics for Integrating Biology and the Beside (i2b2) and the University of Texas Health Science Center at Houston (UTHealth). The EMRs in this collection document the progression of heart disease for 296 diabetic patients, providing between three to five EMRs for each patient. Each EMR is associated with: 1. a patient identifier which uniquely identifies the patient associated with the EMR, 2. a de-identified creation date indicating the approximate creation time of the EMR, & 3. a large body of narrative text.

8 The Dataset: Annotations Each EMR contains manual annotations conducted by clinical experts These manual annotations explicitly document the presence of certain clinical findings and medications relevant to heart disease, i.e.: Diseases (such as CORONARY ARTERY DISEASE(CAD), DIABETES, or OBESITY) Risk factors (such as HYPERTENSION, HYPERLIPIDEMIA) Medications (such as asprin) Medication Types (such as calcium channel blockers) Each finding or medication was annotated with a temporal signal: BEFORE: the finding (or medication) was present at the creation-time of the EMR AFTER: the finding (or medication) was present only after the creation-time of the EMR DURING: the finding (or medication) was present through the entire duration of the EMR We considered only the clinical findings and medications which were observed as present or during these two temporal signals encompassed 89% of all observations instead, we directly encoded the elapsed time between successive EMRs

9 The Dataset: Findings

10 The Dataset: Medications

11 Presentation Outline 1. The Problem Purpose Background 2. The Dataset The corpus Mathematical representation 3. The Approach Simple model Bayesian model Inference 4. Results Experiments Conclusions

12 The Approach In order to automatically predict the way a patient s clinical observations might progress based on their medical history, we define a probabilistic temporal prediction model. Step 1: discover latent trends in the way clinical observations progressed in a provided collection of patient histories Step 2: apply these latent trends to the chronology of a new patient in order to predict how his or her clinical findings might progress. IDEA: when discovering trends, or making predictions, we would prefer to only consider the clinical histories of similar patients SOLUTION: learn latent groups, or clusters of patients based on the trends in the data!

13 The Approach Given a collection of longitudinal EMRs, we define the following parameters: N = the number of patients in our dataset (i.e., 128) L n = the number of EMRs associated with patient n in our dataset (i.e., between 3 to 5) V = the number of clinical observations we are modelling (i.e., 5 clinical findings + 22 medications = 27 clinical observations) K = the number of latent groups, or clusters, to learn from the dataset The clinical chronology of all patients in a dataset can be represented with 2 mathematical structures: O = O n,v,t 0,1 M V L n E = E n,t R M L n Where: n 1.. N denotes the patient v 1.. V denotes the clinical finding t 1.. L n denotes the index in the chronologically ordered EMR sequence Such that: O n,v,t is a binary 3 rd -order tensor indicating whether the v-th observation was present during the t-th EMR in patient n s clinical chronology E n,t is a real-valued matrix indicating the elapsed time (in days) between the t-th EMR and the previous t 1 -th EMR and (E n,0 = 0)

14 The Approach

15 The Approach: Simple Model Defining a Probabilistic Graphical Model (PGM): A set of statistical random variables A set of statistical dependencies or independencies between these variables We define the following statistical random variables: A binary variable for each entry in O n,v,t A continuous variable for each entry in E n,t A discrete variable z n indicating which of the 1.. K latent groups patient n is assigned

16 The Approach: Simple Model

17 The Approach: Simple Model We represent the chronological influences between clinical observations in successive EMRs using the following quantities: F trans u, v, z = the number of patients in group z whose clinical chronology included observation v immediately following observation u F base v, z = the number of patients in group z whose clinical chronology included observation v. F group z = the number of patients in group z This allows to represent three statistical influences or dependencies: The transition probability of an observation u being present given the presence (or absence) of observation v in the previous EMR for a patient in group z: P trans u v, z = F trans(u, v, z) F base v, z The base probability of an observation v being present for a patient in group z: P base v z = F base v, z F group z The temporal probability of an observation v being observed after an elapsed time x for patients in group z: P temp v x Exponential x; λ v = λ v e λ vx

18 The Approach This model operates according to the so-called closed-world assumption: the clinical chronologies in our dataset constitute all the possible clinical chronologies that may ever occur Clearly, this assumption is not always true. Thus, we relax this assumption by introducing a number of prior distributions over the variables in our model and assume that the clinical histories in our dataset were generated according to these prior distributions. O n,v,t ~ Binomial ψ v,k E n,t ~Exponential λ v,k z Multinomial θ Then, we can encode prior knowledge about these distributions using second-order prior distributions: ψ v,k Beta α v, β v λ v,k Gamma γ v, δ v θ Dirichlet η

19 The Approach: Bayesian Model

20 The Approach: Inference In order to learn the trends from our dataset, we need to find the values of the latent variables in our model: λ v,k θ ψ v,k z n To do this, we used collapsed Gibb s sampling.

21 The Approach: Inference Predicting new patient outcomes: 1. Encode the patient s history using statistical random variables so that we can leverage our probabilistic model: O v,t = binary matrix indicating which clinical findings were present in each of the patient s EMRs E t = continuous vector of the elapsed time between the patient s EMRs 2. Use our model to assign a latent group to the patient based on his or her medical history: z Ƹ = argmax P z O, E z 3. Use the transition, temporal, and base probabilities associated with that latent group to predict the presence (1) or absence (0) of each clinical finding (v) : w = argmax w 0,1 P base v = w z n V P trans v = w u, z n u=1

22 Presentation Outline 1. The Problem Purpose Background 2. The Dataset The corpus Mathematical representation 3. The Approach Simple model Bayesian model Inference 4. Results Experiments Conclusions

Results: Experiments In our experiments we used the official train/test split used in the 2014 i2b2/uthealth dataset for evaluating risk factor identification: Training set: 790 EMRs for 178 patients

23 Results: Experiments In our experiments we used the official train/test split used in the 2014 i2b2/uthealth dataset for evaluating risk factor identification: Training set: 790 EMRs for 178 patients Testing set: 514 EMRs for 118 patients We attempted to predict the set of observations in the last EMR for each patient, given all the previous EMRs for that patient. Note: we also performed leave-one-out cross validation; there were no statistically significant differences

24 Results: Experiments For each patient we considered each observation as: true positive (TP) if it was predicted by the model and mentioned in the EMR false positive (FP) if it was predicted by the model but not mentioned in the EMR false negative (FN) if it was not predicted by the model but was mentioned in the EMR true negative(tn) if it was not predicted by the model and was not mentioned in the EMR We considered a variety of performance measures: Accuracy (Acc.,TP+TNTP+FP+FN+TN); Positive Predictive Value (PPV, also known as Precision, TPTP+FP); False Negative Rate (FNR,a so known as the miss rate, FNFN+TP); False Positive Rate (FPR, also known as the fall-out, FPFP+TN); True Negative Rate (TNR, also known as Specificty, TNFP+TN) ; True Positive Rate (TPR, also known as the hit rate or Recall, TPTP+FN); F 1 -Measure (2TP2TP+FP+FN)

25 Results: Number of Patient Groups

26 Results: Individual Observations

27 Results: Conclusions In this presentation (and the associated paper), we presented a novel method for constructing a data-driven probabilistic graphical model of patients clinical chronologies. We have shown how this model can be used to 1. Infer latent groups of similar patients from a dataset 2. Discover trends in how clinical observations evolve over time from a dataset 3. Assign new patients to the most similar group in a dataset 4. Predict the most likely progression of clinical findings for a patient The model we presented does not depend on any a priori knowledge about any particular clinical findings, and instead discovers trends based on latent statistical information. We have shown that this model yields promising performance for predicting risk factors of heart disease in a dataset of diabetic patients.

28 Questions?

A Probabilistic Reasoning Method for Predicting the Progression of Clinical Findings from Electronic Medical Records

A Probabilistic Reasoning Method for Predicting the Progression of Clinical Findings from Electronic Medical Records Travis Goodwin, Sanda M. Harabagiu, PhD University of Texas at Dallas, Richardson, TX,