Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Similar documents
EC352 Econometric Methods: Week 07

Motivation Empirical models Data and methodology Results Discussion. University of York. University of York

Practical propensity score matching: a reply to Smith and Todd

MEA DISCUSSION PAPERS

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Data harmonization tutorial:teaser for FH2019

You must answer question 1.

Module 14: Missing Data Concepts

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Pros. University of Chicago and NORC at the University of Chicago, USA, and IZA, Germany

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Improving ecological inference using individual-level data

1. Introduction Consider a government contemplating the implementation of a training (or other social assistance) program. The decision to implement t

TREATMENT EFFECTIVENESS IN TECHNOLOGY APPRAISAL: May 2015

Chapter 1: Exploring Data

Methods for Addressing Selection Bias in Observational Studies

MS&E 226: Small Data

The Prevalence of HIV in Botswana

Small Sample Bayesian Factor Analysis. PhUSE 2014 Paper SP03 Dirk Heerwegh

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

LINEAR REGRESSION FOR BIVARIATE CENSORED DATA VIA MULTIPLE IMPUTATION

Propensity score weighting with multilevel data

Problem set 2: understanding ordinary least squares regressions

Using register data to estimate causal effects of interventions: An ex post synthetic control-group approach

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

1. INTRODUCTION. Lalonde estimates the impact of the National Supported Work (NSW) Demonstration, a labor

Applied Econometrics for Development: Experiments II

Machine Learning Statistical Learning. Prof. Matteo Matteucci

Score Tests of Normality in Bivariate Probit Models

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Performance of Median and Least Squares Regression for Slightly Skewed Data

Unit 1 Exploring and Understanding Data

Selection and Combination of Markers for Prediction

University of Pennsylvania

11/24/2017. Do not imply a cause-and-effect relationship

A re-randomisation design for clinical trials

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Estimating drug effects in the presence of placebo response: Causal inference using growth mixture modeling

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

Still important ideas

Comparisons of Dynamic Treatment Regimes using Observational Data

Introduction to Observational Studies. Jane Pinelis

9 research designs likely for PSYC 2100

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

Rank, Sex, Drugs and Crime

Population-adjusted treatment comparisons Overview and recommendations from the NICE Decision Support Unit

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Fixed Effect Combining

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Measuring the Impacts of Teachers: Reply to Rothstein

Econometric Evaluation of Health Policies

DISCUSSION PAPER SERIES. No MEASURING THE IMPACTS OF TEACHERS: RESPONSE TO ROTHSTEIN (2014) Raj Chetty, John Friedman and Jonah Rockoff

Modeling Sentiment with Ridge Regression

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Comparing Pre-Post Change Across Groups: Guidelines for Choosing between Difference Scores, ANCOVA, and Residual Change Scores

Class 1: Introduction, Causality, Self-selection Bias, Regression

Title: New Perspectives on the Synthetic Control Method. Authors: Eli Ben-Michael, UC Berkeley,

A Bayesian Nonparametric Model Fit statistic of Item Response Models

ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Rise of the Machines

Instrumental Variables I (cont.)

JSM Survey Research Methods Section

Logistic regression: Why we often can do what we think we can do 1.

Propensity scores: what, why and why not?

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Inference with Difference-in-Differences Revisited

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

breast cancer; relative risk; risk factor; standard deviation; strength of association

EMPIRICAL STRATEGIES IN LABOUR ECONOMICS

Multiple Regression Analysis

Simple Linear Regression the model, estimation and testing

Erratum and discussion of propensity-score reweighting

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

NBER WORKING PAPER SERIES HEAPING-INDUCED BIAS IN REGRESSION-DISCONTINUITY DESIGNS. Alan I. Barreca Jason M. Lindo Glen R. Waddell

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model

6. Unusual and Influential Data

Still important ideas

Challenges of Observational and Retrospective Studies

Detecting Anomalous Patterns of Care Using Health Insurance Claims

Regression Discontinuity Analysis

The Late Pretest Problem in Randomized Control Trials of Education Interventions

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Identifying Endogenous Peer Effects in the Spread of Obesity. Abstract

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

LINEAR REGRESSION FOR BIVARIATE CENSORED DATA VIA MULTIPLE IMPUTATION WEI PAN. School of Public Health. 420 Delaware Street SE

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Estimating the causal eect of zidovudine on CD4 count with a marginal structural model for repeated measures

TESTING FOR COVARIATE BALANCE IN COMPARATIVE STUDIES

A Heap of Trouble? Accounting for Mismatch Bias in Retrospectively Reported Data. (with application to smoking cessation and (non) employment)

Instrumental Variables Estimation: An Introduction

Transcription:

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies Arun Advani and Tymon Sªoczy«ski 13 November 2013

Background When interested in small-sample properties of estimators, researchers typically provide evidence from Monte Carlo simulation, rather than analytical results. Typically relatively `stylised' Data Generating Processes (DGPs) are used.

Background Huber et al. 2013 have criticised these stylised DGPs, suggesting their external validity may be low. `Design dependence'. Similarly, Busso et al. (2013) encourage empirical researchers to `conduct a small-scale simulation study designed to mimic their empirical context'. Both propose instead generating data that mimics the original data.

Motivation Suggestion by both Busso et al. and Huber et al. based on premise that carefully designed, empirically motivated Monte Carlo simulation can inform the empirical researcher about performance of estimators. Implication that `the advantage [of an empirical Monte Carlo study] is that it is valid in at least one relevant environment' (Huber et al., 2013). i.e. its internal validity is high by construction.

Contribution We evaluate the recent proposition that `empirical Monte Carlo studies' have high internal validity. We outline some conditions that are necessary for this to be true. We show that these conditions are generically so restrictive that they require the evaluation problem to be non-existent. Using the well-known National Supported Work (LaLonde, 1986) data, we show that in practice these conditions don't hold.

Outline EMCS What is it? Dierent designs When might it work (in theory) Application Data & Estimators Results Conclusions

What is EMCS? Empirical Monte Carlo Studies (EMCS) are studies where: We have an initial dataset of interest. We want to somehow generate samples from the same DGP that created the initial data. We can then test the performance of estimators of a particular statistic relative to the true eect in that sample. We use the results on performance to inform us about which estimators are most useful in the original data. Key issue will be how to generate these samples from the same DGP.

EMCS designs Suppose we have an original dataset with outcome Y, covariates X, and treatment status T. N observations: N T treated, N C control. Want to draw data from the DGP the created this, and estimate, e.g. the ATT. Two approaches suggested in the literature: `Structured design' (Abadie and Imbens, 2011; Busso et al. 2013). `Placebo design' (Huber et al. 2013).

Structured design Generate N observations, and assign treatment status s.t N T are treated. Draw covariates X from a distribution which mimics the empirical distribution, conditional on T. Correlation For a binary variable, match Pr(X (1) = 1 T = t) in generated sample to i X (1) i.1(t =t) i 1(T =t). For a continuous variable, draw from normal/log-normal with appropriate mean and variance. Estimate a model for the outcome on the original data. Use this to construct tted values for the new observations. Generate new outcome as the tted value plus an error with variance that matches that of the residuals.

Placebo design In original data, estimate a treatment status equation. Run logit of T on relevant part of X. Store tted value. Draw N observations, with replacement, from the control sample of the original data to create new samples. Assign `placebo' treatment status to observations in this sample: T i = 1(T i > 0), where T i = α + λx i β + ε i and ε i iid logit. Choose α s.t. Pr(T = 1) in sample is same as in original data. Choose λ = 1, as HLW. Calibrating Lambda By construction all treatment eects will be zero.

When might we expect EMCS to work? Suppose we... observed all the variables determining treatment and the outcome, and knew the functional forms for their relationships with the covariates. Then clearly could generate data from the distribution...... but would also already know what the treatment eect is, so no need.

When might we expect EMCS to work? Treatment eect estimators we consider assume we observe all the relevant covariates, so we can assume this for our DGP as well. Already a big assumption.

When might we expect EMCS to work? Treatment eect estimators we consider assume we observe all the relevant covariates, so we can assume this for our DGP as well. `Structured' makes strong functional form assumptions. Reasonable likelihood of misspecication. Proposition in literature is implicitly that EMCS is more informative about the performance of estimators than a stylised DGP would be, even if estimated structured DGP were misspecied.

When might we expect EMCS to work? Treatment eect estimators we consider assume we observe all the relevant covariates, so we can assume this for our DGP as well. `Structured' makes strong functional form assumptions. `Placebo' avoids functional form assumptions for outcome. Only uses subsample of data and has treatment eect of zero by construction. Not clear when this might work.

Data National Supported Work: work experience programme in 15 locations in mid-1970s US. Programme had experimental control group, so could recover experimental estimate of eect by comparing means. LaLonde (1986) famously used this to test treatment eect estimators using non-experimental control groups drawn from CPS and PSID data. We use similar idea: a `good' EMCS should be able to replicate the true ranking of estimators, based on their ability to uncover the experimental estimate.

Estimators Use a range of common estimators: standard parametric regression-based estimators. exible parametric (Oaxaca-Blinder) estimators. kernel-based estimators i.e. matching, local linear regression. nearest-neighbour matching. inverse probability weighting (IPW). Want to recover ATT, so evaluation problem is only about getting counterfactual outcome for treated observations.

Correlation Key idea: in a good EMCS, performance (in some dimension) of estimators in the generated samples should be informative about performance in the original data. Typically would like to choose estimators that have low bias and low variance, so might want to compare estimators by RMSE. But, don't observe variance of estimator in original data, only bias.

Correlation We consider correlation in bias, correlation in absolute bias, and ranking by absolute bias. At a minimum want to reproduce bias correctly. Absolute bias is the critereon a researcher would use if trying to choose which estimator to use.

Results Structured `Structured' EMCS can replicate bias. i.e. estimates from original data and EMCS samples are positively correlated.

Results Structured PSID Table: Correlations Between the Biases in the Uncorrelated and Correlated Structured Designs and in the Original NSW-PSID Data Set True biases Uncorrelated Correlated (1) (2) (1) (2) Correlations BiasMean bias 0.371** 0.256 0.643*** 0.549*** (0.031) (0.189) (0.000) (0.002) Abs. biasabs. mean bias RankRank Sample restrictions Exclude outliers Y Y Y Y Exclude OaxacaBlinder N Y N Y Number of estimators 34 28 35 29 NOTE: P-values are in parentheses. We dene outliers as those estimators whose mean biases are more than three standard deviations away from the average mean bias. The following estimators are treated as outliers: unnormalised reweighting with the common support restriction (rst columns). *Statistically signicant at the 10% level; **at the 5% level; ***at the 1% level.

Results Structured `Structured' EMCS can replicate bias. i.e. estimates from original data and EMCS samples are positively correlated. Can't generally replicate absolute bias.

Results Structured PSID Table: Correlations Between the Biases in the Uncorrelated and Correlated Structured Designs and in the Original NSW-PSID Data Set True biases Uncorrelated Correlated (1) (2) (1) (2) Correlations BiasMean bias 0.371** 0.256 0.643*** 0.549*** (0.031) (0.189) (0.000) (0.002) Abs. biasabs. mean bias 0.363** 0.217 0.435*** 0.216 (0.035) (0.267) (0.009) (0.260) RankRank 0.357** 0.169 0.380** 0.142 (0.038) (0.391) (0.025) (0.461) Sample restrictions Exclude outliers Y Y Y Y Exclude OaxacaBlinder N Y N Y Number of estimators 34 28 35 29 NOTE: P-values are in parentheses. We dene outliers as those estimators whose mean biases are more than three standard deviations away from the average mean bias. The following estimators are treated as outliers: unnormalised reweighting with the common support restriction (rst columns). *Statistically signicant at the 10% level; **at the 5% level; ***at the 1% level.

Results Structured `Structured' EMCS can replicate bias. Can't generally replicate absolute bias. True when in-sample bias is comparing to bias in original data. Bias in original data for an estimator is dierence between the estimate and the true eect. If instead we compare in-sample bias to a hypothetical bias, calculated as dierence between estimate and predicted value of the model in the original data, performance is much better.

Results Structured PSID Table: Correlations Between the Biases in the Uncorrelated and Correlated Structured Designs and in the Original NSW-PSID Data Set Hypothetical biases Uncorrelated Correlated (1) (2) (1) (2) Correlations BiasMean bias 0.371** 0.256 0.643*** 0.549*** (0.031) (0.189) (0.000) (0.002) Abs. biasabs. mean bias 0.408** 0.297 0.698*** 0.616*** (0.017) (0.125) (0.000) (0.000) RankRank 0.408** 0.222 0.693*** 0.599*** (0.017) (0.256) (0.000) (0.001) Sample restrictions Exclude outliers Y Y Y Y Exclude OaxacaBlinder N Y N Y Number of estimators 34 28 35 29 NOTE: P-values are in parentheses. We dene outliers as those estimators whose mean biases are more than three standard deviations away from the average mean bias. The following estimators are treated as outliers: unnormalised reweighting with the common support restriction (rst column). *Statistically signicant at the 10% level; **at the 5% level; ***at the 1% level.

Results Structured `Structured' EMCS can replicate bias. Can't generally replicate absolute bias. If instead we compare in-sample bias to a hypothetical bias, calculated as dierence between estimate and predicted value of the model in the original data, performance is much better. In PSID data, true eect on unemployment is 11.06pp, but `predicted value of model (structured)' estimates eect of 25.68pp. This is because DGP was based on Oaxaca-Blinder LPM, which doesn't perform well here. In CPS we know that OB LPM does perform well (estimated eect is 11.74pp), so `true' absolute bias results should be good.

Results Structured CPS Table: Correlations Between the Biases in the Uncorrelated and Correlated Structured Designs and in the Original NSW-CPS Data Set True biases Uncorrelated Correlated (1) (2) (1) (2) Correlations BiasMean bias 0.390** 0.259 0.530*** 0.379** (0.023) (0.184) (0.001) (0.042) Abs. biasabs. mean bias 0.458*** 0.420** 0.396** 0.333* (0.007) (0.026) (0.019) (0.078) RankRank 0.484*** 0.428** 0.426** 0.334* (0.004) (0.023) (0.011) (0.077) Sample restrictions Exclude outliers Y Y Y Y Exclude OaxacaBlinder N Y N Y Number of estimators 34 28 35 29 NOTE: P-values are in parentheses. We dene outliers as those estimators whose mean biases are more than three standard deviations away from the average mean bias. The following estimators are treated as outliers: unnormalised reweighting with the common support restriction (rst column). *Statistically signicant at the 10% level; **at the 5% level; ***at the 1% level.

Results Placebo In `placebo' design we always know the true eect. But, it isn't clear that only using the control data to test for a placebo treatment eect is a relevant comparison to the original data. Only using a subset of the data. Treatment eect used is generally dierent to truth. In general we nd it is unable to even replicate biases let alone absolute biases

Results Placebo Table: Correlations Between the Biases in the Uncalibrated and Calibrated Placebo Designs and in the Original NSW-CPS and NSW-PSID Data Sets Uncalibrated Calibrated NSW-PSID NSW-CPS NSW-PSID NSW-CPS Correlations BiasMean bias 0.337** 0.353** 0.403** 0.470*** (0.048) (0.041) (0.018) (0.004) Abs. biasabs. mean bias 0.022 0.045 0.273 0.015 (0.900) (0.801) (0.119) (0.930) RankRank 0.061 0.187 0.351** 0.178 (0.730) (0.289) (0.042) (0.307) Sample restrictions Exclude outliers Y Y Y Y Number of estimators 35 34 34 35 NOTE: P-values are in parentheses. We dene outliers as those estimators whose mean biases are more than three standard deviations away from the average mean bias. The following estimators are treated as outliers: matching on the propensity score, N = 40 (second column) and bias-adjusted matching on covariates, N = 40 (third column). *Statistically signicant at the 10% level; **at the 5% level; ***at the 1% level.

Conclusions A number of recent papers have suggested some form of EMCS might overcome the design dependence issues common in MCS. We considered two forms of EMCS: `Structured'. `Placebo'. Find that structured design is only informative if treatment eect in data is same as that implied by DGP. Clearly untestable, and if we knew the true treatment eect then we would stop there. Placebo design appears to be even more problematic. Unfortunately only very negative results: Don't nd any silver bullet for choosing estimator in particular circumstance. For now best to continue using multiple approaches.

Structured design (correlated) As before, but now we want to allow covariates to be correlated in a way that matches original data. In particular, want to draw each binary X (n), from the distribution suggested by the data conditional on T and {X (1),..., X (n 1) }, and draw continuous outcomes jointly conditional on the discrete covariates, so that we just need mean, variance and covariance. Back

Placebo design (calibrated) In `uncalibrated' placebo design, λ = 1. Huber et al. (2013) suggest this should guarantee `selection [into treatment] that corresponds roughly to the one in our population.' Only true if degree of covariate overlap between treated and controls in original data were same as overlap between placebo treated and placebo control in sample. No reason we should expect this to be true.

Placebo design (calibrated) Can grid search λ {0.01, 0.02,..., 0.99} and nd value of λ that minimises RMSD between simulated overlap and overlap in data. `Overlap' dened here as proportion of placebo treated individuals whose estimated propensity score is between the minimum and maximum pscore among placebo controls. Back