Detecting Anomalous Patterns of Care Using Health Insurance Claims

Similar documents
Data Fusion: Integrating patientreported survey data and EHR data for health outcomes research

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP

Predictive Diagnosis. Clustering to Better Predict Heart Attacks x The Analytics Edge

Lecture 21. RNA-seq: Advanced analysis

Rise of the Machines

Supplementary Online Content

Supplementary Online Content

Jacqueline C. Barrientos, Nicole Meyer, Xue Song, Kanti R. Rai ASH Annual Meeting Abstracts 2015:3301

Supplementary Appendix

arxiv: v2 [stat.ml] 4 Jul 2017

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Jae Jin An, Ph.D. Michael B. Nichol, Ph.D.

PubH 7405: REGRESSION ANALYSIS. Propensity Score

Supplementary Online Content

NQF-ENDORSED VOLUNTARY CONSENSUS STANDARDS FOR HOSPITAL CARE. Measure Information Form Collected For: CMS Outcome Measures (Claims Based)

Comparison of Medicare Fee-for-Service Beneficiaries Treated in Ambulatory Surgical Centers and Hospital Outpatient Departments

BIOSTATISTICAL METHODS

OHDSI Tutorial: Design and implementation of a comparative cohort study in observational healthcare data

TENNCARE Bundled Payment Initiative: Description of Bundle Risk Adjustment for Wave 2 Episodes

Title: New Perspectives on the Synthetic Control Method. Authors: Eli Ben-Michael, UC Berkeley,

Chapter 1. Introduction

Finland and Sweden and UK GP-HOSP datasets

Supplementary Online Content

Does Machine Learning. In a Learning Health System?

Tuning Epidemiological Study Design Methods for Exploratory Data Analysis in Real World Data

Detection of Informative Disjunctive Patterns in Support of Clinical

Biostatistics 2 nd year Comprehensive Examination. Due: May 31 st, 2013 by 5pm. Instructions:

Instrumental Variables I (cont.)

Basic Biostatistics. Chapter 1. Content

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Using the NIH Collaboratory's and PCORnet's distributed data networks for clinical trials and observational research - A preview

Since 1980, obesity has more than doubled worldwide, and in 2008 over 1.5 billion adults aged 20 years were overweight.

Recent advances in non-experimental comparison group designs

The Effects of Maternal Alcohol Use and Smoking on Children s Mental Health: Evidence from the National Longitudinal Survey of Children and Youth

In each hospital-year, we calculated a 30-day unplanned. readmission rate among patients who survived at least 30 days

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

HEALTH CARE EXPENDITURES ASSOCIATED WITH PERSISTENT EMERGENCY DEPARTMENT USE: A MULTI-STATE ANALYSIS OF MEDICAID BENEFICIARIES

Table S1: Diagnosis and Procedure Codes Used to Ascertain Incident Hip Fracture

Peter C. Austin Institute for Clinical Evaluative Sciences and University of Toronto

Propensity Score Matching with Limited Overlap. Abstract

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Prospective Comparison of Two Commercially Available Hospital Admission Risk Scores

PharmaSUG Paper HA-04 Two Roads Diverged in a Narrow Dataset...When Coarsened Exact Matching is More Appropriate than Propensity Score Matching

The Cost Burden of Worsening Heart Failure in the Medicare Fee For Service Population: An Actuarial Analysis

STATISTICS & PROBABILITY

TOTAL HIP AND KNEE REPLACEMENTS. FISCAL YEAR 2002 DATA July 1, 2001 through June 30, 2002 TECHNICAL NOTES

Supplementary Appendix

Testing Statistical Models to Improve Screening of Lung Cancer

Appendix Identification of Study Cohorts

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Goal-setting for a healthier self: evidence from a weight loss challenge

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

Table S1. Read and ICD 10 diagnosis codes for polymyalgia rheumatica and giant cell arteritis

Youth Using Behavioral Health Services. Making the Transition from the Child to Adult System

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Research in Real Life. Real Life Effectiveness of Aclidinium Bromide. Version Date: 26 September 2013

SGRQ Questionnaire assessing respiratory disease-specific quality of life. Questionnaire assessing general quality of life

etable 3.1: DIABETES Name Objective/Purpose

Supplementary Online Content

NQF-ENDORSED VOLUNTARY CONSENSUS STANDARD FOR HOSPITAL CARE. Measure Information Form Collected For: CMS Outcome Measures (Claims Based)

Identification of Tissue Independent Cancer Driver Genes

Pro-ana and pro-mia social networks

Baseline Health Data Report: Cambria and Somerset Counties, Pennsylvania

Representing Association Classification Rules Mined from Health Data

The effect of surgeon volume on procedure selection in non-small cell lung cancer surgeries. Dr. Christian Finley MD MPH FRCSC McMaster University

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Population Health Metrics. Steven Teutsch, MD, MPH March 19, 2015

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Motivation Empirical models Data and methodology Results Discussion. University of York. University of York

Supplementary Online Content

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Live WebEx meeting agenda

Supplementary Online Content

Discovering New Variables and Improving the Prediction of ALS Progression

Extending Rungie et al. s model of brand image stability to account for heterogeneity

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP

GSK Medicine: Study Number: Title: Rationale: Study Period: Objectives: Indication: Study Investigators/Centers: Research Methods: Data Source

TRIPLL Webinar: Propensity score methods in chronic pain research

RADIATION-EPIDEMIOLOGICAL STUDY OF GROUP OF DIFFERENT DISEASES OF CIRCULATORY SYSTEM AND CONCOMITANT DISEASES AMONG CHERNOBYL LIQUIDATORS

Novel Machine Learning Methods for Public Health and Disease Surveillance

Supplementary Appendix

Visualizing Data for Hypothesis Generation Using Large-Volume Health care Claims Data

SUPPLEMENTARY INFORMATION. Confidence matching in group decision-making. *Correspondence to:

A Pre-Syndromic Surveillance Approach for Early Detection of Novel and Rare Disease Outbreaks

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Recent trends in health care legislation have led to a rise in

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Estimating Medicaid Costs for Cardiovascular Disease: A Claims-based Approach

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Supplementary Online Content

Risk Adjustment and Hierarchical Condition Category Coding

Rates and patterns of participation in cardiac rehabilitation in Victoria

The clinical and economic benefits of better treatment of adult Medicaid beneficiaries with diabetes

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Data Analysis Using Regression and Multilevel/Hierarchical Models

Cardiovascular. Mathew Mercuri PhD(C), Sonia S Anand MD PhD FRCP(C)

Definitions of chronic conditions used to define the number of serious comorbidities in the study.

Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data. Technical Report

Transcription:

Partially funded by National Science Foundation grants IIS-0916345, IIS-0911032, and IIS-0953330, and funding from Disruptive Health Technology Institute. We are also grateful to Highmark Health for providing data. Detecting Anomalous Patterns of Care Using Health Insurance Claims Sriram Somanchi Edward McFowland III Daniel B. Neill Mendoza College of Business University of Notre Dame Carlson School of Management University of Minnesota H.J. Heinz III College Carnegie Mellon University 1

Agenda Introduction Research Question Motivating Example Literature and Contribution Methods- Anomalous Patterns of Care (APC) Scan Problem Formulation Algorithm Modeling the scoring function Empirical Analysis on Highmark Claims Data Data Results Validation using regression analysis 2

Introduction Challenges the US healthcare system faces 1,2 Instances of over-treatment and under-treatment Inconsistencies in execution of care 1. N.C. Lallemand, Health Policy Brief: Reducing Waste in Health Care, Health Affairs, 13 Dec. 2012. 2. L.T. Kohn, et al., To Err Is Human: Building a Safer Health System, Inst. of Medicine/Nat l Academy Press, 1999. 3

Introduction Huge opportunity to discover novel patterns of care that are potentially effective due to availability of Electronic Health Records Documentation of patient care through health insurance claims Analyze patterns across patients and provide actionable insights 4

Research Question Given health insurance claims data, we wish to identify a treatment and a corresponding sub-population for whom that treatment corresponds to significantly better or worse outcomes. Observational data Multiple treatments Population characteristics varying in multiple dimensions Identify most significant combination of treatment and sub-population. 5

Motivating Example Health Insurance Claims Data Healthcare Analyst Patrick Congestive Heart Failure Patients 1. Males 2. Age above 50 3. Similar co-morbidity (atrial fibrillation, on anticoagulant) Taking Carvidilol is associated with longer stay in hospital Can we automate the process of producing these interesting hypotheses? 6

Literature and Contribution Heterogeneous Treatments Effects with a given treatment Randomized Control Trials Imai and Ratkovic (2013) McFowland et al. (2015) see previous talk in this session Observational Studies Athey and Imbens (2015 arxiv) Wager and Athey (2015 arxiv) Our Contributions Given multiple treatments, identify combination of treatment and sub-population associated with anomalous outcomes. Computationally efficient algorithm instead of evaluating exponentially many sub-populations Observational studies Effectively use observational data to design future randomized control trials 7

Agenda Introduction Research Question Motivating Example Literature and Contribution Methods- Anomalous Patterns of Care (APC) Scan Problem Formulation Algorithm Modeling the scoring function Empirical Analysis on Highmark Claims Data Data Results Validation using regression analysis 8

Problem Formulation Let X = (XX 1, XX 2,, XX NN ) be the set of observed covariates for a patient (demographics, diagnoses, etc.) Let TT 1, TT 2,, TT MM be the set of available treatments Let Y be the scalar outcome of interest (for example, total length of hospital stay in following 12 months). 9

Estimating Potential Outcome Distributions We want to estimate the distribution of potential outcomes for treatment assignments TT jj = 1, for a given sub-population, SS ff jjj,ss = ff yy (1) xx SS) Similarly, we want to estimate ff jjj,ss = ff yy (0) xx SS) 10

Our Goal Identify the combination of treatment and subpopulation for which outcomes are most divergent between treated and untreated groups. max SS max jj DDDDDD ff jjj,ss, ff jjj,ss Age Male Female <30 YY 2MM YY 2FF TT 1 TT 2 30-40 YY 3MM YY 3FF TT 3 40-50 YY 4MM YY 4FF >50 YY 5MM YY 5FF TT MM 11

Anomalous Patterns of Care Scan Age <30 30-40 40-50 >50 Age <30 30-40 40-50 >50 Male Female YY 2MM YY 2FF YY 3MM YY 3FF YY 4MM YY 4FF YY 5MM YY 5FF TT MM 12 =1 Male Female YY 2MM YY 2FF YY 3MM YY 3FF YY 4MM YY 4FF YY 5MM YY 5FF TT 1 MM 2 = 0 ff MMM,SS 11,SS 21,SS ff MMM,SS 10,SS 20,SS FF MM,SS 1,SS 2,SS = DDDDDD TT 1 TT 2 TT 3 TT MM ff 11,SS 21,SS MMM,SS,, ff 10,SS ff 20,SS MMM,SS 1. Start with a random sub-population SS 2. For each TT jj a. Compute the propensity scores b. Reweight outcome distributions c. Compute Divergence FF jj,ss 3. jj = argmax jj FF jj,ss 4. Reweight entire population outcomes based on TT jj 5. Use MD-Scan to identify SS = aaaaaaaaaaxx SS FF jj,ss 6. Set SS = SS and repeat steps 2 to 5 until score stops increasing 7. Repeat steps 1-6 for R times 8. Compute statistical significance by randomization testing Iterative Ascent algorithm between sub-populations and treatments 12

Inverse Propensity Score Weighting We use inverse propensity score weighting to estimate the outcome distribution from observational data ff jjj,ss = ff yy (1) xx SS) xx SS ff(yy,tt jj =1,XX=xx) PP TT jj =1 XX=xx) ff jj0,ss = ff yy (0) xx SS) xx SS ff(yy,tt jj =0,XX=xx) PP TT jj =0 XX=xx) 13

Efficiently Optimizing for Divergence Parametric form Compute the sufficient statistic Expectation-based Subset Scan framework In order to efficiently optimize, the divergence score needs to satisfy the Linear Time Subset Scanning (LTSS) property. If so, each conditional optimization step becomes linear rather than exponential in the arity of that attribute. 14

Multi-Dimensional Scan (MD-Scan) SS = aaaaaaaaaaxx SS FF jj,ss Age <30 Male YY 2MM Female YY 2FF 30-40 40-50 YY 3MM YY 4MM YY 3FF YY 4FF >50 YY 5MM YY 5FF Each step is computationally efficient if divergence function satisfies LTSS property 15

Modeling the Scoring Function We model the scoring function as generalized log-likelihood ratio statistic We assume a parametric distribution for the outcome and compute the sufficient statistics of the expected distribution from the untreated group (TT jj = 0) Expectation Based Poisson Expectation Based Gaussian Exponential family distributions 16

Expectation Based Poisson statistic for potential outcomes HH 00 YY ii (1) XX ii XX ss ~ PPPPPPPPPPPPPP(λλ ss XX ss λλ ss = EE YY (0) XX XX ss HH 11 SS, qq YY ii (1) XXii XX ss ~ PPPPPPPPPPPPPP qq λλ ss XX ss SS HH 11 SS, qq YY ii (1) XXii XX ss ~ PPPPPPPPPPPPPP λλ ss XX ss SS FF SS qq) = log PP DDDDDDDD HH 1 SS, qq ) PP DDDDDDDD HH 0 ) FF SS = max FF SS qq qq SS = max FF SS SS 17

Agenda Introduction Research Question Motivating Example Literature and Contribution Methods- Anomalous Patterns of Care (APC) Scan Problem Formulation Algorithm Modeling the scoring function Empirical Analysis on Highmark Claims Data Data Results Validation using regression analysis 18

Highmark Claims Data Patients with primary or admission diagnosis as diseases of the circulatory system from the year 2008 to 2014 ~125K patients 6 months 6 months Time First Hospitalization (Input covariates XX) Drugs Therapeutic Class (Treatments TT) Number of Hospitalizations (Outcome YY) 19

Highmark Claims Data Covariates (XX) were built based on Demographics Median income at patient s zip code level Diagnosis (primary and secondary) Charlson Comorbidity Index 1 Length of current stay Previous outpatient visits Treatments (TT jj ) Drug Therapeutic Class Outcome (YY) Number of hospitalizations, Total length of stay Bronchial Dilators Glucocorticoids Thyroid Preparations Diabetic Therapy Lipotropics Hypotensives Vasodilators Digitalis Preparations Cardiovascular Preparations Anticoagulants Diuretics 1. Quan et al (2005). Coding algorithms for defining comorbidities in icd-9- cm and icd-10 administrative data. Medical Care, 43(11):1130 1139 20

Descriptive Statistics Characteristics Values Percentage of Patients Entire Population 100% (124,146) Gender Age Hypertensive Diabetic Obese Primary Diagnosis Male Female Below40 40to60 60to80 Above80 Yes No Yes No Yes No Rheumatic (390-398) Hypertensive (401-405) Ischemic (410-414) Pulmonary (415-417) Heart Failure (420-429) Cerebrovascular (430-438) Arteries (440-448) Veins and lymphatics (451-459) 53.0% 47.0% 2.8% 19.8% 43.5% 33.9% 53.9% 46.1% 29.2% 70.8% 11.1% 88.9% 0.5% 3.5% 24.5% 3.7% 33.0% 16.6% 5.0% 13.2% 21

Results We ran our methodology on this dataset to identify patterns of interest We have ranked order of the highest scoring combination of subpopulation and treatments As a case study, here we discuss the highest scoring subpopulation and treatment pair 22

Highest Scoring Subpopulation-Treatment Combination Subpopulation Characteristics Identified Gender Male Medical condition Hypertension Obese or Overweight Age 40 to 80 Primary diagnosis Ischemic Heart disease (ICD9 410 414) Heart Failure (ICD9 420 429) Cerebrovascular heart disease (ICD9 430 439) Secondary diagnosis No respiratory (ICD9 460 519) Endocrine and Immunity disorders (ICD9 240 279) Drug therapeutic class Outcome Glucocorticoids More number of hospitalizations Glucocorticoids Yes No Number of Patients 264 1713 Mean Number of Hospitalizations 0.606 (0.069) 0.280 (0.016) 23

Validation of our results There is huge literature in the medical community on Glucocorticoids and Cardiovascular issues: Association using 10 years of observational data (Heart, 2004) Metabolic and tissue level effects in heart (European Journal of Endocrinology, 2007) Experiments at micro level analysis of glucocorticoids signaling certain receptors in heart for mice (J of Biochemical and Molecular Biology, 2015) 24

Confirming the results using regression analysis We randomly split the data into: 60% for running our APC Scan 40% for running the regression analysis Regression with outcome YY as number of hospitalizations with Glucocorticoids as one of independent variable XX, for The entire population The entire population with a dummy for subpopulation identified by APC Scan The subpopulation identified by APC Scan The complementary subpopulation 25

Regression analysis (Poisson) on a Hold-Out set Glucocorticoids 0.101*** (0.007) Glucocorticoids Subpopulation Number of Hospitalizations (1) (2) 0.099*** (0.007) 0.265*** (0.088) Subpopulation -0.313*** (0.068) Age 0.079*** (0.004) Females 0.116*** (0.008) Hypertensive -0.163*** (0.008) Diabetic 0.286*** (0.008) Obesity 0.007 (0.013) 0.079*** (0.004) 0.113*** (0.008) -0.161*** (0.008) 0.286*** (0.008) 0.020 (0.013)...... Constant -0.773*** (0.044) -0.772*** (0.044) Observations 49,658 49,658 Number of Hospitalizations (3) (4) (2) (3) (4) 0.410*** (0.089) -0.040 (0.079) 0.193*** (0.089) 0.099*** (0.007) 0.080*** (0.004) 0.113*** (0.008) -0.161*** (0.008) 0.287*** (0.008) 0.020 (0.013)...... -1.634*** (0.120) -0.772*** (0.044) 796 48,862 10.6% 50.6% (1) Entire Population (2) Entire Population with dummy for the subpopulation (3) Only with the subpopulation identified by APC- Scan (4) Only with the complementary subpopulation We have included all input characteristics XX for our regression Note: *p<0.1; **p<0.05; ***p<0.01 26

Sensitivity analysis Modifications to the identified subpopulation dramatically reduce the effect. 0.6 Coefficient of Glucocorticoids 0.5 0.4 0.3 0.2 0.1 0 Entire APC-Scan Females Non-HT Non-Obese Coefficient of Glucocorticoids 27

Robustness checks Typical diseases treated using Glucocorticoids Rheumatic Arthritis Chronic Obstructive Pulmonary Disease Cushing s syndrome Ruled out hospital level biases in propensity to treat with Glucocorticoids Overlap coefficient between two groups is 0.78 28

Ongoing work Better estimation of treated and non-treated outcome distributions given sparse data. Moving beyond categorical input attributes and binary treatments incorporate BMI, lab results, etc. Using other scoring functions (both parametric and non-parametric). 29

Summary of our contributions Developed a general framework for detecting combinations of treatment and subpopulation that have large deviations in their observed outcomes Used multidimensional constraints to scan a large number of subpopulation and treatment combinations in a computationally efficient manner Theoretical analysis: Showed that our scoring functions with propensity reweighted outcomes removes the bias from the observed characteristics Empirical evaluation: Generated interesting hypothesis related to heart disease by analyzing large, complex and observational health care claims data 30