Visualizing Data for Hypothesis Generation Using Large-Volume Health care Claims Data Eberechukwu Onukwugha PhD, School of Pharmacy, UMB Margret Bjarnadottir PhD, Smith School of Business, UMCP Shujia Zhou PhD, Computer Science, UMBC Acknowledgement Catherine Plaisant PhD, Human Computer Interaction Lab (HCIL), UMCP Sana Malik, Computer Science, UMCP Ran Qi, UMBC Jinani Jayasekera, UMB 1
A picture is worth What hypotheses would you want to test? Motivation Few tools for hypothesis generation and datadriven insight Limited guidance on how to generate insight Identify available tools Two case studies 2
Proportional symbols Treemap Choropleth mapping Available tools Proportional symbols Source: http://canceratlas.cancer.org/risk-factors/ [Accessed 5/6/2016] 3
Treemap Source: http://vizhub.healthdata.org/gbd-compare/ [Accessed 5/16/2016] Choropleth Map Source: http://vizhub.healthdata.org/gbd-compare/ [Accessed 5/6/2016] 4
Ideal tool Suited to large-volume health utilization data High-level abstractions and individual detail Integrated into to population health analyses Cost Prediction Using a Survival Grouping Algorithm: An Application to Incident Prostate Cancer Cases E Onukwugha, R Qi, J Jayasekera, S Zhou PharmacoEconomics 2016 Feb;34(2):207-16. 5
Outline Motivation Grouping Algorithm Interpretation Hypothesis generation Objective To illustrate how a grouping algorithm can be used to generate hypotheses regarding cost accumulation 6
Prognostic systems Prognostic systems are utilized in clinical practice to predict survival and outcomes Example: TNM classification system for classifying primary tumors Patients within a TNM class would have similar disease progression and survival Patient demographics Clinical predictors Cancer histology Age at diagnosis Comorbid conditions Survival Curves Average survival pattern 7
Cost Curves We can apply this approach to build cost curves across patient groups Identify cost predictors over time Prognostic Systems Identify groups of patients who have similar clinical prognostic factors TNM classification scheme is a bin model Mutually exclusive bins Exhaustive partitioning of patients Bins are grouped into stages Use the mean survival of patients in a bin to predict the survival of a new patient placed in that bin 8
Prognostic Systems TNM puts constraints on the number of prognostic factors The Ensemble Algorithm for Clustering Cancer Data (EACCD) 1 admits more prognostic factors Increased prediction accuracy Long computational time Increasing the number of prognostic factors increases the number of bins dramatically 1. Chen D, Xing K, Henson D, Sheng L, Schwartz AM, Cheng X. Developing prognostic systems of cancer patients by ensemble clustering. BioMed Research International. 2009 Prognostic Systems and Cost Curves Grouping Algorithm for Cancer Data (GACD) 1 uses clustering algorithm Reduces computational time Increases clustering accuracy 1. Qi R, Zhou S, editors. Simulated Annealing Partitioning: An Algorithm for Optimizing Grouping in Cancer Data. Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on; 2013: IEEE. 9
Overview of SEER-Medicare SEER - Cancer registry data - Cause of death - Area characteristics - Cases from 2000 to 2007 Medicare - Health Care claims - Treatment dates - Cost data - Claims from 1999 to 2009 SEER- Medicare 22 Methods Datasets Application of the GACD required two sets of data 1. Survival data Survival time Clinical variables Demographics Indicator for censoring 10
Methods Datasets (cont.) 2. Cost data Healthcare costs reimbursed by Medicare Physician Skilled nursing facility Outpatient care Home health care Hospice care Durable medical equipment Costs in 2009 US dollars using the Consumer Price Index (CPI) Methods Prognostic Factors Prognostic factors and cost drivers Identified from literature reviews Cancer stage Urban residence Age Poor performance status (proxy indicator) race 11
Methods Data Processing Patients with similar profiles were grouped into natural clusters Example Data Grouping Example Variable Unformatted data Formatted data Stage Urban/ rural location at the time of diagnosis Age at the time of diagnosis Stage 0 0 Stage I 1 Stage II 2 Stage III 3 Stage IV 4 Unknown stage 5 Rural 0 Urban 1 65-69 1 70-74 2 75-79 3 80-84 4 85+ 5 12
Data Grouping Example (cont.) Variable Performance status proxy measured in the 12 months prior to prostate cancer diagnosis Race Grade Unformatted data No claims for use of walking aids, wheelchairs, oxygen and related supplies, skilled nursing facility or hospitalizations 12 months prior to prostate cancer diagnosis. Formatted data At least one claim for the above 1 Non-Hispanic White 1 Non-Hispanic African American 2 Hispanic 3 Other 4 Well differentiated, moderately differentiated, unknown 0 Poorly differentiated or un-differentiated 1 0 Data Grouping Example (cont.) Factors Mean Level Combination Stage Prostate cancer stage II 2 Age 75-79 3 Race White 1 Performance status proxy measured in the 12 months prior to prostate cancer diagnosis No relevant indicator 0 23101 Natural cluster Urban Urban 1 13
Methods Grouping by Survival Similarity GACD applied to the formatted data (e.g. 23101 ) Patients grouped according to their survival similarities Generated cost curves for the resulting groups Methods Grouping Algorithm Step 1: Extract the combinations based on the all possible combinations, removing the combinations 100 14
Methods Grouping Algorithm (cont.) Step 2: Using the log-rank test to initialize the dissimilarity between combinations Apply a sequence of non-randomized clustering procedures (e.g., the PAM algorithm) Redefine the dissimilarity between combinations by assigning weights to clustering results Methods Grouping Algorithm (cont.) Step 3: Performing agglomerative hierarchical clustering to obtain the affinity to the survival curves as well as groups of patients 15
Methods Grouping Algorithm (cont.) The hierarchical clustering result represented by a dendrogram Average linkage method Closest survival curves are merged first Connected components form clusters Dendrogram of nine combinations Methods Grouping Algorithm (cont.) Step 4: Identifying groups of patients for plotting cost curves 16
Original bushy curves Intermediate trimmed curves 17
$US Final trimmed curves Cost curves for final survival curves 18
Training Sample 50,091 men with incident prostate cancer Result from grouping algorithm Review Interpret Generate hypotheses Results Interpret Variable Stage at prostate cancer diagnosis Full Sample (N a =50,091) % (or Mean) Group 0 (N a = 8,897) Col. % b Group 1 (N a =7,572) Col. % b Group 2 (N a = 29,006) Col. % b Group 3 (N a = 2,727) Col. % b Group 4 (N a = 1,889) Col. % b 0 - - - - - - 1 - - - - - - 2 27,556 55.01 45.80 42.45 68.62 13.35 0.00 3 2,269 4.53 3.62 1.85 6.23 0.00 0.00 4 3,393 6.77 0.00 15.21 0.00 28.31 77.77 Unknown 16,873 33.68 50.58 40.49 25.15 58.34 22.23 P-value c <0.01 19
Variable Demographic Characteristics Age at diagnosis Results Interpret (cont.) Full Sample (N a =50,091) % (or Mean) Group 0 (N a = 8,897) Col. % b Group 1 (N a =7,572) Col. % b Group 2 (N a = 29,006) Col. % b Group 3 (N a = 2,727) Col. % b Group 4 (N a = 1,889) Col. % b P-value c 65-69 12,543 25.04 3.61 9.90 39.55 0.00 0.00 <0.01 70-74 15,070 30.09 22.06 9.67 41.80 9.17 0.00 75-79 11,994 23.94 44.76 27.26 18.17 19.14 8.26 80-84 7,018 14.01 29.57 40.89 0.48 17.64 35.52 85+ 3,466 6.92 0.00 12.28 0.00 54.05 56.22 Age (mean, SD) d 50,091 74.5 (6.1) 76.9 (4.1) 78.8 (5.8) 71.1 (3.7) 83.3 (6.0) 85.5 (4.6) <0.01 Race White non- Hispanic 42,988 85.82 86.50 83.49 84.40 95.34 100.00 African American 3,774 7.53 8.42 15.10 6.05 4.66 0.00 Other e 1,508 3.01 1.65 1.41 4.32 0.00 0.00 Location Urban 46,525 92.88 92.24 95.83 91.66 94.79 100.00 Rural 3,566 7.12 7.76 4.17 8.34 5.21 0.00 92% are of age 80 <0.01 <0.01 Results Interpret (cont.) Variable Full Sample (N a =50,091) % (or Mean) Group 0 (N a = 8,897) Col. % b Group 1 (N a =7,572) Col. % b Group 2 (N a = 29,006) Col. % b Group 3 (N a = 2,727) Col. % b Group 4 (N a = 1,889) Col. % b P-value c Clinical Characteristics (Pre- <0.01 Period) Charlson comorbidity index <0.01 Zero One Two or higher Missing Performance status proxies 32,486 64.85 61.28 54.52 70.80 54.75 46.37 10,202 20.37 22.94 22.97 18.59 21.38 23.61 5,399 10.78 12.99 18.89 6.21 19.44 25.46 2,004 4.00 2.79 3.63 4.39 4.44 4.55 7,366 14.71 21.67 36.56 2.46 35.50 52.30 <0.01 20
Results Interpret (cont.) Variable Characteristics (Post-Period) Charlson comorbidity index Full Sample (N a =50,091) % (or Mean) Group 0 (N a = 8,897) Col. % b Group 1 (N a =7,572) Col. % b Group 2 (N a = 29,006) Col. % b Group 3 (N a = 2,727) Col. % b Group 4 (N a = 1,889) Col. % b P-value c Zero 27,653 55.21 53.95 45.77 61.55 38.10 26.26 One 10,898 21.76 22.88 22.39 21.88 19.66 15.03 Two or higher 7,129 14.23 16.74 21.17 10.83 20.50 17.79 Missing 4,411 8.81 6.43 10.67 5.74 21.75 40.92 Performance status 22,354 44.63 38.27 49.51 42.30 57.79 71.68 <0.01 proxies All cause death 12,434 24.82 27.00 43.90 10.91 70.30 86.02 <0.01 Prostate cancer related death 3,100 28.47 15.90 27.48 16.77 36.12 57.30 <0.01 Time-to death (in days) (mean, SD) Length of followup (in days) (mean, SD) 12,434 1,092 (799) 50,091 1,562 (847) 1,274 (827) 1,716 (886) 1,126 (786) 1,495 (884) 1,263 (789) 1,618 (786) 940 (755) 1,216 (901) Second longest 601 (593) 742 (725) <0.01 <0.01 <0.01 Hypothesis generation Generate hypotheses from grouping results on full sample Consider subsamples Inpatient costs Outpatient costs Prescription costs 21
$US $US Hypotheses Predictors of cost accumulation Group 4, highest - Mortality rate - Proportion with CCI 2 in the pre-period - Proportion with at least one performance status proxy indicator - No health services use prior to dx Hypotheses (contd.) Group 1: - Highest proportion of African Americans (15%) Group 3: - Largest number of men with an unknown cancer stage (58%) Groups 1 and 3: - Highest proportion of men with CCI score 2 in the post-period only 22
$US $US Cost: White, non-hispanic (WNH) sample Cost: African-American (AA) sample 23
$US $US Inpatient Cost: WNH Inpatient Cost: AA 24
Limitations The incorporation of other available claimsbased measures may lead to different hypotheses Utilizing electronic medical records and other linked datasets could impact the grouping results and hypotheses Prognostic tools Discussion Provide information on health status over time Describe cost accumulation over time Grouping algorithm can characterize groups associated with higher future costs Grouping algorithm can be used to generate hypotheses for future research 25
Human Computer Interaction Lab Visualization with EventFlow Margrét Vilborg Bjarnadóttir Robert H. Smith School of Business University of Maryland With Eberechukwu Onukwugha, Sana Malik, Catherine Plaisant and Tanisha Gooden Morbidity Mortality Costs Adherence $13.35 billion in hospitalization costs annually due to medication non-adherence (Sullivan et al 1990) The medication possession ratio (MPR) Michael A. Kane, Margrét V. Bjarnadóttir, Sanjay Ghimire. 2012. Study of compliance in hypertension treatment American Society of Hypertension, Annual Scientific Meeting. Poster Presentation, New York, NY, May 2012. 26
The Data 900,000 Individuals 16 million prescription claims 5 Drug classes: Angiotension-Converting Enzyme-Inhibitors (ACE) Angiotension II Receptor Blockers (ARB) Calcium Channel Blockers (CCB) Beta blockers (Beta) Diuretics The Research Questions Can we use visualization to understand adherence patterns What are the effects of modeling decisions on our outcome measures 27
Hypertension Treatment time Hypertension Treatment time 28
Event Flow Event Flow 29
EventFlow Behind the GUI VISUALIZE Display the aggregation RECORD RECORD RECORD AGGREGATE Merge multiple records into tree A B Constructing the EventFlow Overview C D E 30
31
32
33
34
Number of Records Time 35
Event Flow 36
USING EVENTFLOW WITH CLAIMS DATA Confetti to visualizations that answer questions 37
Hypertension Treatment time time Gaps & Overlaps 38
Gaps & Overlaps Case study UNERSTANDING TEMPORAL PATTERNS IN HYPERTENSIVE THERAPY 39
The Patients Event Flow Visualizing Patterns 40
Diuretics General Patterns The Research Questions What are the effects of modeling decisions on our outcome measures Can we use visualization to understand adherence patterns 41
Diuretics Only - Gaps Diuretics Only - Gaps Modeling Decisions* 42
Diuretics Only - Gaps Diuretics Only - Gaps 43
Diuretics Only & Gaps The Research Questions What are the effects of modeling decisions on our outcome measures Can we use visualization to understand adherence patterns Can we understand patient behavior? Can we identify good vs bad patterns? Can we identify early non-adherence? Patterns vs. medical outcomes Pattern stabilitization 44
Event Flow Drilling Down Non-compliance to guidelines for members with a history of Heart Failure Non-dihydropyridines (Acceptable) Dihydropyridines (Not acceptable) 45
EventFlow Summary On our way to understanding adherence behavior in hypertension therapy: Patterns are far from ideal How should adherence be described? EventFlow is a great tool to: Understand the big picture Drill down Generate hypotheses More information: www.hcil.umd.edu/eventflow 46
http://hcil.umd.edu/coco Cohort Comparison Thank You! 47
Q & A Appendix 48
Dataset : Linked Surveillance, Epidemiology and End Results (SEER)- Medicare database. 8 Methods Cost Curves and Prediction The curve of each group reflects cumulative inverse-probability weighted (IPW) costs To evaluate prediction accuracy Split survival data into two data sets (D 0 and D 1 ) of equal size D 0 : training data set D 1 : testing data set 49
Methods Cost Curves and Prediction (cont.) D 0 : develop patients groups using GACD D 1 : group the combinations in the dataset on the basis of the results from D 0 Difference between the predicted cost (based on D 0 ) and the actual cost from D 1 With and without the application of the GACD Methods Cost Curves and Prediction The curve of each group reflects cumulative inverse-probability weighted (IPW) costs To evaluate prediction accuracy Split survival data into two data sets (D 0 and D 1 ) of equal size D 0 : training data set D 1 : testing data set 50
Methods Cost Curves and Prediction (cont.) D 0 : develop patients groups using GACD D 1 : group the combinations in the dataset on the basis of the results from D 0 Difference between the predicted cost (based on D 0 ) and the actual cost from D 1 With and without the application of the GACD Predicted Costs with Grouping 51
Predicted Cost without Grouping Methods Cost Curves and Prediction (cont.) Between the actual cost and predicted cost, we calculated: Average difference Room mean squared error (RMSE) Mean absolute error (MAE) and 95% confidence interval 52
Results Difference in Predictions Grouped data (US$) Non-grouped data (US$) Average difference a 41,524.9 43,113.2 Root mean squared error (RMSE) 45,917.0 48,381.2 Mean absolute error (MAE) 41,789.5 43,639.3 95% Confidence interval Lower 41,420.8 43,061.7 Upper 42,158.2 44,216.9 The 5-year cost prediction without grouping sample overestimate of US$79,544,508 Appendix 110 53
References Onukwugha E, Qi R, Jayasekera J, Zhou S. Cost Prediction Using a Survival Grouping Algorithm: An Application to Incident Prostate Cancer Cases. PharmacoEconomics. 2015. Qi R, Zhou S, editors. A Comparative Study of Algorithms for Grouping Cancer Data. Proceedings of the International MultiConference of Engineers and Computer Scientists; 2014. Chen D, Xing K, Henson D, Sheng L, Schwartz AM, Cheng X. Developing prognostic systems of cancer patients by ensemble clustering. BioMed Research International. 2009. 54
References Qi R, Zhou S, editors. Simulated Annealing Partitioning: An Algorithm for Optimizing Grouping in Cancer Data. Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on; 2013: IEEE. National Cancer Institute - Surveillance, Epidemiology and End Results. Bethesda: NCI 2013. http://seer.cancer.gov/registries. Accessed March 29 2013. 55