Propensity scores and causal inference using machine learning methods

Similar documents
Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Estimating average treatment effects from observational data using teffects

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

Manitoba Centre for Health Policy. Inverse Propensity Score Weights or IPTWs

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

BayesRandomForest: An R

A Guide to Quasi-Experimental Designs

Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Estimating population average treatment effects from experiments with noncompliance

Rise of the Machines

Progress in Risk Science and Causality

Causal Validity Considerations for Including High Quality Non-Experimental Evidence in Systematic Reviews

Draft Proof - Do not copy, post, or distribute

Propensity scores: what, why and why not?

Matching methods for causal inference: A review and a look forward

Using Classification Tree Analysis to Generate Propensity Score Weights

Syllabus.

Combining machine learning and matching techniques to improve causal inference in program evaluation

Analysis methods for improved external validity

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Development, Validation, and Application of Risk Prediction Models - Course Syllabus

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

Propensity Score Analysis Shenyang Guo, Ph.D.

SUNGUR GUREL UNIVERSITY OF FLORIDA

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

Chapter 17 Sensitivity Analysis and Model Validation

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

Predictions Without Models? Using Natural Experiments to Test the Performance of Machine Learning Algorithms

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

PharmaSUG Paper HA-04 Two Roads Diverged in a Narrow Dataset...When Coarsened Exact Matching is More Appropriate than Propensity Score Matching

P. RICHARD HAHN. Research areas. Employment. Education. Research papers

Practical propensity score matching: a reply to Smith and Todd

Jake Bowers Wednesdays, 2-4pm 6648 Haven Hall ( ) CPS Phone is

State-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups. Beth Ann Griffin Daniel McCaffrey

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

Article from. Forecasting and Futurism. Month Year July 2015 Issue Number 11

Using Classification and Regression Trees to Model Survey Nonresponse

TRIPLL Webinar: Propensity score methods in chronic pain research

Pros. University of Chicago and NORC at the University of Chicago, USA, and IZA, Germany

Propensity Score Methods to Adjust for Bias in Observational Data SAS HEALTH USERS GROUP APRIL 6, 2018

Selection and Combination of Markers for Prediction

Using machine learning to assess covariate balance in matching studies

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Using ASPES (Analysis of Symmetrically- Predicted Endogenous Subgroups) to understand variation in program impacts. Laura R. Peck.

Introduction to Observational Studies. Jane Pinelis

Optimal full matching for survival outcomes: a method that merits more widespread use

Simulation study of instrumental variable approaches with an application to a study of the antidiabetic effect of bezafibrate

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Erratum and discussion of propensity-score reweighting

POL 574: Quantitative Analysis IV

Recent advances in non-experimental comparison group designs

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Propensity score methods : a simulation and case study involving breast cancer patients.

Moving beyond regression toward causality:

Biostatistics II

How Do Machine Learning Algorithms Perform in Changing Environments? Evidence from Disaster Induced Hospital Closures

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

MODEL SELECTION STRATEGIES. Tony Panzarella

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

TREATMENT EFFECTIVENESS IN TECHNOLOGY APPRAISAL: May 2015

Generalizing Experimental Findings

Limited dependent variable regression models

Score Tests of Normality in Bivariate Probit Models

Index. Springer International Publishing Switzerland 2017 T.J. Cleophas, A.H. Zwinderman, Modern Meta-Analysis, DOI /

Challenges of Automated Machine Learning on Causal Impact Analytics for Policy Evaluation

Empirical Strategies

How Do Machine Learning Algorithms Perform in. Changing Environments? Evidence from Disaster. Induced Hospital Closures

TACKLING SIMPSON'S PARADOX IN BIG DATA USING CLASSIFICATION & REGRESSION TREES

Study of cigarette sales in the United States Ge Cheng1, a,

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

BIOSTATISTICAL METHODS

EMPIRICAL STRATEGIES IN LABOUR ECONOMICS

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

BIOINFORMATICS ORIGINAL PAPER

Propensity Score Analysis and Strategies for Its Application to Services Training Evaluation

The Use of Propensity Scores for Nonrandomized Designs With Clustered Data

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration

Quantitative Methods. Lonnie Berger. Research Training Policy Practice

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Personalized Effect of Health Behavior on Blood Pressure: Machine Learning Based Prediction and Recommendation

Logistic regression: Why we often can do what we think we can do 1.

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

Book review of Herbert I. Weisberg: Bias and Causation, Models and Judgment for Valid Comparisons Reviewed by Judea Pearl

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Propensity Scores; Generalising Results from RCT s

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Impute vs. Ignore: Missing Values for Prediction

PRINCIPLES OF EFFECTIVE MACHINE LEARNING APPLICATIONS IN REAL-WORLD EVIDENCE

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

A Potential Outcomes View of Value-Added Assessment in Education

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Transcription:

Propensity scores and causal inference using machine learning methods Austin Nichols (Abt) & Linden McBride (Cornell) July 27, 2017 Stata Conference Baltimore, MD

Overview Machine learning methods dominant for classification/prediction problems. Prediction is useful for causal inference if one is trying to predict propensity scores (probability of treatment conditional on observables); But sometimes better predictions lead to worse causal inference: greater bias or mean squared error (MSE). Simulation shows some machine learning methods are more robust to specification error when not all confounders are observed (the real world case). Abt Associates pg 2

Why propensity scores? With nonrandom selection of treatment status A, can estimate average treatment effect b by conditioning on all possible confounders W (if we observe all of them). Even if we observe all W, do we know the right functional form? Propensity score matching or weighting solves the functional form problem (not the incomplete observation problem). Abt Associates pg 3

Why propensity score weights? Where potential outcomes are conditionally independent of A given W, they are also conditionally independent given the conditional probability of A, E(A W), a.k.a. the propensity score. We can condition on the propensity score to eliminate bias due to confounders (Rosenbaum and Rubin, 1983); We can improve efficiency by using estimated E(A W) to reweight the data (Hirano, Imbens, & Ridder 2003), even if we know the true propensity score (i.e. we throw away that information). Abt Associates pg 4

How should we estimate E(A W)? This is a prediction problem, and one at which machine learning algorithms have been shown to excel A few papers explore the use of machine learning approaches, but with full set of confounders Zador, Judkins, and Das (2001): MART for survey nonresponse adjustment McCaffrey, Ridgeway & Morral (2004): Generalized boosted model for PSW Setoguchi et al. (2008): Neural networks, CART for PSM Lee, Lessler & Stuart (2009): CART, Pruned CART, Bagged CART, Random Forest, Boosted CART for PSW Diamond & Sekhon (2013): Genetic Matching, Random Forest, Boosting for PSM Abt Associates pg 5

Simulations To ascertain finite-sample performance, we run a large set of simulations (each one 10,000 iterations) Each simulation imposes a causal diagram with binary treatment A and 10 potential confounders W, of which only 4 are actually confounders Vary the functional form for E(A W): base case is logit with no interactions, but allow nonlinearity/interactions Estimation default as in teffects ipwra (outcome model is linear, treatment model is logit) Abt Associates pg 6

Simulation causal diagram Simulation structure follows Setoguchi et al. (2008) and Lee, Lessler, and Stuart (2009). Shaded boxes binary, orange excluded instruments (Z), blue controls (X). Abt Associates pg 7

Variation across 7 scenarios x x (base case) means no interactions in true model: nonadditivity means e.g. coef on W 2 W 4 is zero; nonlinearity means e.g. coef on W 2 W 2 is zero. Abt Associates pg 8

Simulations Can assume we know the true functional forms (unrealistic), or there is some specification error (i.e. unmodeled interactions in a logit). Can assume we observe all confounders W 1, W 2, W 3, W 4 (unrealistic) or some subset. Conditioning on a proper subset of confounders can decrease bias or amplify bias: useful figures in Steiner and Kim (2016). We compare mean-squared error (MSE) and bias reduction. Abt Associates pg 9

Not every confounder can be observed This means we should not assume we can see W1, W2, W3, W4. Where there are omitted variables (i.e., absent full ignorability), including an excluded instrument in propensity score estimation increases the inconsistency (Heckman and Navarro-Lozano 2004, Battacharya and Vogt 2007) and bias (Pearl 2011, Wooldridge 2009) of the estimator. We report results for simulations with and without conditioning on the excluded instruments W5, W6, W7. Abt Associates pg 10

Methods to estimate E(A W) Logit: Generalized linear model with log odds linear in parameters Regression tree (RT): Algorithm recursively partitions a feature space to minimize distance between mean and predicted outcomes within each partition; implemented with rpart in R Pruned regression tree (PRT): Prunes back RTs to prevent overfitting; rpart Regression forest (RF): Produces many low bias RTs from random subsets of the data and then averages across those trees to reduce variance of predictor; randomforest Boosted regression tree (BRT): Builds out trees on residuals of prior bifurcations in the feature space; gbm Least absolute shrinkage and selection operator (LASSO): Penalized regression; glmnet Abt Associates pg 11

Scenario a Logit is the right model but underperforms RF when we improperly condition on excluded instruments: Lower MSE is better; logit wins when no excluded instruments are in treatment model Conditioning set includes excluded instruments W 5, W 6, W 7 But RF wins, even when logit is the right model, with excluded instruments in tmodel Abt Associates pg 12

RF more robust to spec. error RF similar to logit in easy cases, but substantially better than logit when some specification error is introduced Abt Associates pg 13

RF best in class on bias, MSE Distribution of RFbased propensity-scoreweighting estimator centered on truth, with higher peak and narrower spread, in the worst case scenario g. Abt Associates pg 14

Conclusions We offer guidance on selection of variables and methods for the estimation of propensity scores when measuring average treatment effects (ATE) or ATE on the treated (ATT) under different scenarios. Machine learning estimators, especially regression forest (RF), perform well where the treatment assignment mechanism is unknown and can offer better protection against improper conditioning on excluded instruments when not all confounders are included (the realistic case). Abt Associates pg 15

Conclusions No statistical test can distinguish confounders and excluded instruments. Theory and assumptions (i.e. a good causal diagram) play an outsized role in which variables to include and which estimation approaches to use in which settings; However, propensity score reweighting using regression forest dominates several alternatives in a realistic class of settings. MSE for IV is largest in all simulations. Abt Associates pg 16

Next steps Still more variations on this theme to be explored. But. Current generation of machine learning algorithms targets quality of predictions, not quality of causal inference using those predictions. In process: improved stochastic ensemble methods (along the lines of regression forests) as a Stata package. Abt Associates pg 17

References Bhattacharya, J. and Vogt W. (2007.) "Do Instrumental Variables Belong in Propensity Scores?" NBER Technical Working Papers, 343. Cambridge, MA: National Bureau of Economic Research. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984.) Classification and regression trees. CRC press. Breiman, L. (1996.) "Bagging predictors." Machine learning 24(2): 123-140. Chernozhukov, V., C. Hansen, and M. Spindler. (2015.) "Post-Selection and Post-Regularization Inference in Linear Models with Very Many Controls and Instruments." American Economic Review. Cole, Stephen R. and Elizabeth A. Stuart. (2010.) Generalizing evidence from randomized clinical trials to target populations: The ACTG-320 trial. American Journal of Epidemiology 172(1): 107-15. Diamond, A., and J. Sekhon (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3), 932-945. Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angle regression." Annals of Statistics, 32(2): 407-499. Hastie, T., Tibshirani, R., Friedman, J., and Franklin, J. (2005.) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85. Hirano, K., Imbens, G. W., Ridder, G. (2003.) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161-1189. Lee, Brian, Justin Lessler, and Elizabeth A. Stuart. (2010.) Improving Propensity Score Weighting Using Machine Learning. Statistics in Medicine, 29:3, 337346. Liaw, A. and M. Wiener (2002.) Classification and Regression by randomforest. R News 2(3),18-22. McCaffrey, D. F., G. Ridgeway, and A. R. Morral. (2004.) Propensity Score Estimation with Boosted Regression for Evaluating Causal Effects in Observational Studies. Psychological Methods 9:4: 403-425. Pearl, J. (2011). "Invited Commentary: Understanding Bias Amplification." American Journal of Epidemiology, 174(11): 1223-1227. Ridgeway, G. with contributions from others (2015). gbm: Generalized Boosted Regression Models. R package version 2.1.1. http://cran.r-project.org/package=gbm Rosenbaum, Paul R., and Donald B. Rubin, The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70:1 (1983), 4155. Setoguchi, S., Schneeweiss, S., Brookhart, M., Glynn, R., and Cook, E.F. (2008). Evaluating uses of data mining techniques in propensity score estimation: a simulation study." Pharmacoepidemiol Drug Saf, 17(6): 546-555. Steiner, Peter M., and Yongnam Kim. (2016). The Mechanics of Omitted Variable Bias: Bias Amplification and Cancellation of Offsetting Biases. Journal of Causal Inference, 4(2). Stuart, Elizabeth A., Stephen R. Cole, Catherine P. Bradshaw, and Philip J. Leaf. 2011. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society Series A 174 (2): 369-386. Therneau, T., B. Atkinson, B. Ripley (2015). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10. http://cran.r-project.org/package=rpart Tibshirani, R. (1996). Regression shrinkage and selection via the lasso." J. Royal. Statist. Soc B., 58(1): 267-288. Wooldridge, J. (2009). Should instrumental variables be used as matching variables?" East Lansing, MI: Michigan State University. [http://econ.msu.edu/faculty/wooldridge/docs/treat1r6.pdf] Zador, Paul, David Judkins, and Barnali Das. (2001). Experiments with MART to Automate Model Building in Survey Research: Applications to the National Survey of Parents and Youth. Proceedings of the Section on Survey Research Methods of the American Statistical Association. Abt Associates pg 18