Propensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany

Similar documents
Propensity Score Analysis to compare effects of radiation and surgery on survival time of lung cancer patients from National Cancer Registry (SEER)

Propensity Score Methods for Longitudinal Data Analyses: General background, ra>onale and illustra>ons*

Simple Sensitivity Analyses for Matched Samples Thomas E. Love, Ph.D. ASA Course Atlanta Georgia

Unit 1 Exploring and Understanding Data

Regression Discontinuity Analysis

BIOSTATISTICAL METHODS

Propensity Score Analysis to compare effects of radiation and surgery on survival time of lung cancer patients from National Cancer Registry (SEER)

PubH 7405: REGRESSION ANALYSIS. Propensity Score

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Introduction to Observational Studies. Jane Pinelis

Chapter 13 Estimating the Modified Odds Ratio

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Assessing the impact of unmeasured confounding: confounding functions for causal inference

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Supplementary Appendix

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Unit 7 Comparisons and Relationships

OHDSI Tutorial: Design and implementation of a comparative cohort study in observational healthcare data

EXPERIMENTAL RESEARCH DESIGNS

A NEW TRIAL DESIGN FULLY INTEGRATING BIOMARKER INFORMATION FOR THE EVALUATION OF TREATMENT-EFFECT MECHANISMS IN PERSONALISED MEDICINE

Chapter 7: Descriptive Statistics

Today Retrospective analysis of binomial response across two levels of a single factor.

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

You must answer question 1.

Confounding, Effect modification, and Stratification

UNSUPERVISED PROPENSITY SCORING: NN & IV PLOTS

MS&E 226: Small Data

Trial designs fully integrating biomarker information for the evaluation of treatment-effect mechanisms in stratified medicine

Two-sample Categorical data: Measuring association

A Practical Guide to Getting Started with Propensity Scores

Understandable Statistics

Supplementary Appendix

Propensity Score Analysis Shenyang Guo, Ph.D.

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Methods for Addressing Selection Bias in Observational Studies

Book review of Herbert I. Weisberg: Bias and Causation, Models and Judgment for Valid Comparisons Reviewed by Judea Pearl

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Sensitivity Analysis in Observational Research: Introducing the E-value

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

UNIT 5 - Association Causation, Effect Modification and Validity

Mediation Analysis With Principal Stratification

Biostatistics II

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

IAPT: Regression. Regression analyses

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

The Regression-Discontinuity Design

CHL 5225 H Advanced Statistical Methods for Clinical Trials. CHL 5225 H The Language of Clinical Trials

Chapter 11. Experimental Design: One-Way Independent Samples Design

Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002

Measuring association in contingency tables

Title:The self-reported health of U.S. flight attendants compared to the general population

SELF-PROXY RESPONSE STATUS AND QUALITY OF CIGARETTE-RELATED INFORMATION

Joseph W Hogan Brown University & AMPATH February 16, 2010

bivariate analysis: The statistical analysis of the relationship between two variables.

Controlling Bias & Confounding

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Inverse Probability of Censoring Weighting for Selective Crossover in Oncology Clinical Trials.

Causal Mediation Analysis with the CAUSALMED Procedure

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Statistics 2. RCBD Review. Agriculture Innovation Program

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

FACTORIAL DESIGN. 2.5 Types of Factorial Design

Summary. 20 May 2014 EMA/CHMP/SAWP/298348/2014 Procedure No.: EMEA/H/SAB/037/1/Q/2013/SME Product Development Scientific Support Department

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Challenges of Observational and Retrospective Studies

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Propensity scores: what, why and why not?

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

AnExaminationoftheQualityand UtilityofInterviewerEstimatesof HouseholdCharacteristicsinthe NationalSurveyofFamilyGrowth. BradyWest

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

Instrumental Variables Estimation: An Introduction

Matched Cohort designs.

How to interpret results of metaanalysis

Vocabulary. Bias. Blinding. Block. Cluster sample

Section 6: Analysing Relationships Between Variables

Live WebEx meeting agenda

Department of International Health

A NON-TECHNICAL INTRODUCTION TO REGRESSIONS. David Romer. University of California, Berkeley. January Copyright 2018 by David Romer

UMbRELLA interim report Preparatory work

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Technical Specifications

We define a simple difference-in-differences (DD) estimator for. the treatment effect of Hospital Compare (HC) from the

Logistic regression: Why we often can do what we think we can do 1.

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

Appendix B Statistical Methods

Computing discriminability and bias with the R software

Ambiguous Data Result in Ambiguous Conclusions: A Reply to Charles T. Tart

Self-assessment test of prerequisite knowledge for Biostatistics III in R

Overview of Perspectives on Causal Inference: Campbell and Rubin. Stephen G. West Arizona State University Freie Universität Berlin, Germany

Lesson 9: Two Factor ANOVAS

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Worksheet 6 - Multifactor ANOVA models

Appendix III Individual-level analysis

A brief history of the Fail Safe Number in Applied Research. Moritz Heene. University of Graz, Austria

Propensity score methods : a simulation and case study involving breast cancer patients.

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

Transcription:

Propensity Score Analysis: Its rationale & potential for applied social/behavioral research Bob Pruzek University at Albany

Aims: First, to introduce key ideas that underpin propensity score (PS) methodology and to provide some history of this subject; several of the main references will be discussed briefly. An example involving analysis of data from a true experiment will be used to develop some of the basic PS logic. Methods for conditioning will be emphasized, as will randomization and some other essentials. Next, counterfactual logic will be introduced and briefly illustrated.

I have prepared a PSA wiki, at: http://propensityscoreanalysis.pbwiki.com where a number of files that may be found that you may find informative about PSA. Feel free to add others, and to see/add links there. Password PSA. As for software related to PSA, several of the packages in R provide functions for executing PSArelated computations and graphics. Particularly if you are just beginning to learn the R system, the Links and files there may help. See my wiki: http://epym08learnr.pbwiki.com/ Password R.

Propensity score analysis, as we shall see is strongly connected to true experiments; in fact, PSA owes its central rationale to the logic that underpins such studies. For true experiments, units are randomly allocated to the (two) treatment groups at the outset of an experiment, that is, before any treatments begin. This helps to ensure that the units allocated to each treatment group are in one fundamental sense comparable to one another, so that when the time for experimental assessment arrives, the investigator can argue that (in principle) it was the treatments that caused the observed differences. If the one group scores systematically higher than the other, thanks to randomized allocation of units to treatments, this finding can (with some qualifications) be attributed to the treatments, and not some other factors.

Failure to randomize implies a greater likelihood of the presence of selection bias, this being the central problem toward which propensity score methods are mostly directed, usually (but not always) in the context of observational studies. Even in the best of circumstances, however, I will argue below that true experiments may fail. The problem will be seen as one where, lacking a design soundly based on comprehensive subject matter knowledge, results of true experiments can be either uninformative or misleading (regardless of sample size). The problem generalizes to PS studies too so it will be worthy of some discussion.

Three people have written the key articles and books that underpin propensity score methods. They are William Cochran, his student Donald Rubin, and then his student Paul Rosenbaum. Rubin s most recent book is in fact a Compilation of articles he as at least coauthored; it is his Matched sampling for causal effects (2006) (Cambridge). Rosenbaum completed his Observational studies (2nd Edit.) In 2002. A review of one of Cochran s reports, done 40 years ago is worth brief examination.

Cochran (1968) studied death rates of smokers and nonsmokers. It had been found when using unstratified data that death rates for smokers and non-smokers were nearly identical (evidence that many smokers and manufacturers of tobacco products found greatly to their liking). But Cochran decided to sort both smokers and non-smokers by age; after re-calculating rates after age-based stratification he found death rates were on average 40 to 50% higher for smokers than non-smokers -- and this was for very large samples. Results of this kind represent early versions of what now can be seen as propensity score analysis a term that gave nearly a million hits in a recent Google search!

More will be said about PSA soon enough, but to help ensure that PSA might be fully understood, I want to take a few minutes to present an example pertaining to a true experiment; the issues I highlight generalize to virtually all comparisons of treatments, experimental and observational. While the virtues of randomization are often praised, there are reasons not to be sanguine even when randomization has been used. The first figure below displays data as if for an experiment with two independent groups. My aim is to focus thinking on one of the simplest possible true experiments, where I shall assume initially that randomization was been used to assign individuals to groups. Real life complexities are avoided when the data are simulated (but there is a cost to this simplification to which I aim to return, if only briefly).

(Let rows = Blocks, columns = Treatments; a balanced $aov.summary Cell Means 2 x 2 factorial ) A Interpretation (re: graphic): 1 2 mean Red dotted line above B 1 11.40 9.08 10.2 depicts row B1; 2 8.36 12.10 10.2 Blue dotted line above col.mean 9.88 10.59 10.2 depicts row B2. Summary table for 2 x 2 design Df Sum Sq Mean Sq F value Pr(>F) factor(a) 1 12 12 1.52 0.22 factor(b) 1 0.0100 0.0100 0.0012 0.97 factor(a):factor(b) 1 228 228 28.37 0.00 Residuals 96 771 8

Clearly, the interaction effect is huge, but no notable main effects are seen for either Rows (Treatments) or Columns (Blocks). Although these are simulated data, as if for a true experiment, this analysis demonstrated that when an analysis ignores a key blocking factor, it can fail to show major features of treatment effects. In nearly any comparable analysis of simple effects (only), whether block effects are available to the analysis, or even if the blocking information was not collected (!), the experimenter will generally have no idea of what has been missed. And it is a notable point that a proper statistical conclusion in the preceding case was to infer no treatment effects (e.g., A-Factor, N.S.). But if blocking has been used, having identified important individual differences associated w/ blocks, then, since so many potential blocking variates are generally available, the likelihood may be high of finding (notable?) interaction effects. And prospects for such findings are chiefly dependent on the subject-matter, and the design, not on statistical considerations.

The latter effect would normally be called causal, given standard caveats associated with inductive inference. There could be other (possibly causal ) effects, were other (blocking) factors identified and incorporated in the analysis. Naturally, for a principle this basic it is to be expected that others have examined this issue before. A review of some of the best early experimental design books (e.g. Cox, 1958) shows that design specialists had convinced themselves in their agricultural work that collections must be relatively homogeneous (in a certain sense) if simple designs were not often to lead to confusion. Cutting to the essense, a good summary would be to say that when target populations are broadly heterogeneous, then blocking is likely to be of central importance, and perhaps essential. This basic point seems to have gotten lost in recent times.

Although the preceding analysis was focused the concept of a true experiment, there are good reasons to consider blocking in virtually all studies of treatment effects, whether the study is experimental or observational. The best blocking variables are ones thought likely to be most correlated with the response, and/or likely to interact with the treatments. Some of the most recent work on PSA methods has been directed toward related problems, where attempts are made to make effective use of covariates in two ways -- as will be discussed briefly if time permits. But the problems are notably more difficult in observational studies; for true experiments almost everything is easier (although not easy). The current status of applied observational studies is that most of them have not employed blocking, or at least have not done so in an exemplary way. The PSA analyses that are presented below are therefore not wholly exemplary, even though they seem to have answered key questions that were being asked.

With a look back at the brief example where Bill Cochran used a covariate (age) to stratify units before proceeding to his analysis, we can now introduce the basic features of typical propensity score analyses. There are two basic Phases in a PSA: In Phase I pre-treatment covariates are used to construct a single variable, a propensity score, that aims to summarize all key differences among respondents with respect to the two treatments being compared. (Except for some very recent work, nearly all PSA s to date have focused on two group comparisons; we shall focus exclusively on such studies.) A number of methods have been used to estimate propensity scores, and we shall examine two below. In Phase II respondents (from both groups) are sorted on their propensity scores, and then the two groups are compared on one or more outcome measures, conditional on the P-scores.

Formally, a Propensity Score (c.f., Rosenbaum and Rubin 1983, Biometrika) P(x) is defined as the conditional probability of receiving a treatment, given scores on the observed covariates x. Symbolically, P(x i ) = Propensity Score (also a balancing score) = Pr (z * i =1 x i ) = E (z * i x i ) x i = Vector of covariate values for i-th individual z * i = 1 (Treatment) or 0 (Control) Note that the Propensity Score can be defined just as well for true experiments, since for two groups, when randomization is used the probability of receiving the treatment (or being in the control group) is just 1/2, and is a constant for all units or individuals. The first application of PSA that we illustrate below is based on matching each member of the treatment group with a member of the control group, using covariates.

Data shown in the next slide* derive from an observational study by Morten, et. al (1982, Amer. Jour. Epidemiology, p. 549 ff). Children of parents who had worked in a factory where lead was used in making batteries were matched by age and neighborhood with children whose parents did not work in lead-related industries. Whole blood was assessed for lead content to provide responses. Results shown compare Exposed w/ Control Children in what can be seen as a paired samples design. Conventional dependent sample analysis shows that the (95%) C.I. for the population mean difference is far from zero. The mean of the difference scores is 5.78, and the results support the interpretation that parents lead-related occupations tend to influence how much lead is found in their children's blood. This plot is called a Propensity Score Assessment Plot and is produced by the function granova.ds in the R package granova (Pruzek & Helmreich, 2007). Note that the heavy black line on the diagonal corresponds to X = Y, so if if a X > Y its point is below the identity diagonal. Parallel projections to the lower left line segment show the distribution of difference scores corresponding to the pairs; the red dashed line shows the average difference score, and the green line segment shows the 95% C.I.

The graphic shows more, however. Note the wide dispersion of lead measurements for Exposed children in comparison with their Control counterparts. One interpretation is that it is the particular characteristics of parents experiences at work or home with their children that need to be taken into account to make comparison most informative. That is, given the wide variation in blood levels across the Exposed group, where those with the lowest levels are quite comparable with their Control counterparts, but not those corresponding to points on the far right side, it seems reasonable to say that the general hypothesis should be reformulated in terms of specifics. Children whose parents worked at the factory in jobs with lower levels of exposure to lead, and parents who practiced good personal hygiene, despite having high exposure to lead via their jobs tended to be parents of children with lower blood lead levels.

Although it is not certain that Control & Exposed children did not differ in other ways (than age and neighborhood of residence), Rosenbaum (2002), who discusses this example in detail, uses a sensitivity analysis to show that the hidden bias would have to be extreme to explain away differences this large. The essential question asked in a sensitivity analysis is this: is there at least one covariate (the investigator can think of) that could explain as large a difference between treatment & control groups as was observed? Or not? In other words, can the analyst rule out an alternative explanation for the effect that was observed after (comprehensive) consideration of covariates that were not used in construction of the propensity score vector? If not, then the PSA effect stands as likely to be evidence of a causal relationship of the treatment ; if yes, then the difference is too small, relative to the observed treatment effect, to make a strong case that the treatment caused the difference. Sensitivity analyses can be essential to a wrap-up of a PSA study (but they are often not completed, which, alas, is a feature of our following illustration).

Consider next, estimation of propensity scores for a real medical using data on 996 initial Percutaneous Coronary Interventions (PCIs) performed in 1997 at the Lindner Center, Christ Hospital, Cincinnati. Description: Data from an observational study of 996 patients receiving a PCI at Ohio Heart Health in 1997 and followed for at least 6 months by the staff of the Lindner Center. This is a landmark dataset in the literature on propensity score adjustment for treatment selection bias due to practice of evidence based medicine; patients receiving abciximab tended to be more severely diseased than those who did not receive a cascade blocker. The binary variable abcix indicates control (0) or treatment (1). Logistic regression was used initially to estimate propensity scores.

The next two slides show two graphics. In the first, (densities) for both of the Lindner treatment groups, where the Counts were 298 for Control, and 698 for Treatment. Five strata are also identified by vertical lines in the plot. The loess plot, the second figure, is particularly informative, as it shows the (non-linear) regression lines for predicting costs (log metric) of the two treatments. Interpretation of the loess graphic comes after it s presentation. What you need to know at this point is that a form of regression (logistic regression) was used to produce the baseline variable on the next slides (the propensity scores (PS)); this PS variable is itself a function of the covariates chosen in the regression model. The variables that predict or discriminate best between treatment and control groups are the one s most strongly associated with the PS variable.

The preceding graphic shows all 996 data points colored to correspond with the treatment/control designations. Note that the loess (non-linear) regression curve for abcix = 1 lies above the curve for the control across nearly the full range of the propensity scores. This indicates that the costs of treatments (this being a proxy for health problems) for abcix were almost universally higher than for their control counterparts, after adjusting for all covariate differences using LR-based P-scores; 95% C.I.=(.05,.21) in Log metric. Summary results for strata are shown in the table below, as well as in the graphic on the next page. Stratum counts.0 counts.1 means.0 means.1 diff.means 1 97 107 9.36 9.48 0.12 2 73 122 9.37 9.55 0.18 3 62 138 9.39 9.60 0.21 4 48 151 9.47 9.62 0.15 5 18 180 9.60 9.60 0.00

Summary for PSA of Lindner Data (only parts of which were seen here) Before matching (n=966: 298 treatments and 698 controls) (not shown) 5 out of 7 variables, including major covariates ejecfrac, Ves1pro, acutemi, daibetic and stent show statistically significant differences between treatment and control groups. For the LR-based quintile stratification method, five strata were defined. Treatment effects differed somewhat across strata, indicating some interaction between covariates and treatments. Strata 2,3 & 4 showed the strongest treatment effects (might aim to characterize covariate characteristics for these strata); stratum 5 showed a change of sign in the mean response difference, but essentially a null effect. The related loess regression plot shows the full range of adjusted effect results. For the classification tree stratification method (not shown), 6 strata were found and concordance of treatment effects across propensity score levels were observed. DAE was used to summarize treatment effects, and the t test yielded significance, P<0.05, in favor of the treatment. That there were notable differences in results for LR & Classification tree methods should not be surprising, due to the major algorithm differences. The graphical-visualization methods for PSA provide especially clear images of the sizes and direction of treatment effects, and help to clarify the differences found for the different PSA methods.

Going beyond these examples, consider comparing two behavioral regimes with one another, say two diets or two exercise plans, or two food supplement conditions. When medical researchers develop vaccines to prevent disease, drugs to treat illness or methods to aid victims of injury they may be said to be designing treatments. When biologists study the effects of clear cutting forests, or educational psychologists evaluate how class size affects learning, they are studying the consequences of treatments. In short, treatments are ubiquitous, they can take virtually any form, but it is treatment comparisons that are often the essence of applied scientific study. If a non-standard treatment is selected, then the particular question of how to define the control group may need special attention. The requirement that observational studies use effective or appropriate pre-treatment covariates is most notable.