Propensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany

Propensity Score Analysis: Its rationale & potential for applied social/behavioral research Bob Pruzek University at Albany

Aims: First, to introduce key ideas that underpin propensity score (PS) methodology and to provide some history of this subject; several of the main references will be discussed briefly. An example involving analysis of data from a true experiment will be used to develop some of the basic PS logic. Methods for conditioning will be emphasized, as will randomization and some other essentials. Next, counterfactual logic will be introduced and briefly illustrated.

I have prepared a PSA wiki, at: http://propensityscoreanalysis.pbwiki.com where a number of files that may be found that you may find informative about PSA. Feel free to add others, and to see/add links there. Password PSA. As for software related to PSA, several of the packages in R provide functions for executing PSArelated computations and graphics. Particularly if you are just beginning to learn the R system, the Links and files there may help. See my wiki: http://epym08learnr.pbwiki.com/ Password R.

Propensity score analysis, as we shall see is strongly connected to true experiments; in fact, PSA owes its central rationale to the logic that underpins such studies. For true experiments, units are randomly allocated to the (two) treatment groups at the outset of an experiment, that is, before any treatments begin. This helps to ensure that the units allocated to each treatment group are in one fundamental sense comparable to one another, so that when the time for experimental assessment arrives, the investigator can argue that (in principle) it was the treatments that caused the observed differences. If the one group scores systematically higher than the other, thanks to randomized allocation of units to treatments, this finding can (with some qualifications) be attributed to the treatments, and not some other factors.

Failure to randomize implies a greater likelihood of the presence of selection bias, this being the central problem toward which propensity score methods are mostly directed, usually (but not always) in the context of observational studies. Even in the best of circumstances, however, I will argue below that true experiments may fail. The problem will be seen as one where, lacking a design soundly based on comprehensive subject matter knowledge, results of true experiments can be either uninformative or misleading (regardless of sample size). The problem generalizes to PS studies too so it will be worthy of some discussion.

Three people have written the key articles and books that underpin propensity score methods. They are William Cochran, his student Donald Rubin, and then his student Paul Rosenbaum. Rubin s most recent book is in fact a Compilation of articles he as at least coauthored; it is his Matched sampling for causal effects (2006) (Cambridge). Rosenbaum completed his Observational studies (2nd Edit.) In 2002. A review of one of Cochran s reports, done 40 years ago is worth brief examination.

Cochran (1968) studied death rates of smokers and nonsmokers. It had been found when using unstratified data that death rates for smokers and non-smokers were nearly identical (evidence that many smokers and manufacturers of tobacco products found greatly to their liking). But Cochran decided to sort both smokers and non-smokers by age; after re-calculating rates after age-based stratification he found death rates were on average 40 to 50% higher for smokers than non-smokers -- and this was for very large samples. Results of this kind represent early versions of what now can be seen as propensity score analysis a term that gave nearly a million hits in a recent Google search!

More will be said about PSA soon enough, but to help ensure that PSA might be fully understood, I want to take a few minutes to present an example pertaining to a true experiment; the issues I highlight generalize to virtually all comparisons of treatments, experimental and observational. While the virtues of randomization are often praised, there are reasons not to be sanguine even when randomization has been used. The first figure below displays data as if for an experiment with two independent groups. My aim is to focus thinking on one of the simplest possible true experiments, where I shall assume initially that randomization was been used to assign individuals to groups. Real life complexities are avoided when the data are simulated (but there is a cost to this simplification to which I aim to return, if only briefly).

(Let rows = Blocks, columns = Treatments; a balanced $aov.summary Cell Means 2 x 2 factorial ) A Interpretation (re: graphic): 1 2 mean Red dotted line above B 1 11.40 9.08 10.2 depicts row B1; 2 8.36 12.10 10.2 Blue dotted line above col.mean 9.88 10.59 10.2 depicts row B2. Summary table for 2 x 2 design Df Sum Sq Mean Sq F value Pr(>F) factor(a) 1 12 12 1.52 0.22 factor(b) 1 0.0100 0.0100 0.0012 0.97 factor(a):factor(b) 1 228 228 28.37 0.00 Residuals 96 771 8

Clearly, the interaction effect is huge, but no notable main effects are seen for either Rows (Treatments) or Columns (Blocks). Although these are simulated data, as if for a true experiment, this analysis demonstrated that when an analysis ignores a key blocking factor, it can fail to show major features of treatment effects. In nearly any comparable analysis of simple effects (only), whether block effects are available to the analysis, or even if the blocking information was not collected (!), the experimenter will generally have no idea of what has been missed. And it is a notable point that a proper statistical conclusion in the preceding case was to infer no treatment effects (e.g., A-Factor, N.S.). But if blocking has been used, having identified important individual differences associated w/ blocks, then, since so many potential blocking variates are generally available, the likelihood may be high of finding (notable?) interaction effects. And prospects for such findings are chiefly dependent on the subject-matter, and the design, not on statistical considerations.

The latter effect would normally be called causal, given standard caveats associated with inductive inference. There could be other (possibly causal ) effects, were other (blocking) factors identified and incorporated in the analysis. Naturally, for a principle this basic it is to be expected that others have examined this issue before. A review of some of the best early experimental design books (e.g. Cox, 1958) shows that design specialists had convinced themselves in their agricultural work that collections must be relatively homogeneous (in a certain sense) if simple designs were not often to lead to confusion. Cutting to the essense, a good summary would be to say that when target populations are broadly heterogeneous, then blocking is likely to be of central importance, and perhaps essential. This basic point seems to have gotten lost in recent times.

Although the preceding analysis was focused the concept of a true experiment, there are good reasons to consider blocking in virtually all studies of treatment effects, whether the study is experimental or observational. The best blocking variables are ones thought likely to be most correlated with the response, and/or likely to interact with the treatments. Some of the most recent work on PSA methods has been directed toward related problems, where attempts are made to make effective use of covariates in two ways -- as will be discussed briefly if time permits. But the problems are notably more difficult in observational studies; for true experiments almost everything is easier (although not easy). The current status of applied observational studies is that most of them have not employed blocking, or at least have not done so in an exemplary way. The PSA analyses that are presented below are therefore not wholly exemplary, even though they seem to have answered key questions that were being asked.

With a look back at the brief example where Bill Cochran used a covariate (age) to stratify units before proceeding to his analysis, we can now introduce the basic features of typical propensity score analyses. There are two basic Phases in a PSA: In Phase I pre-treatment covariates are used to construct a single variable, a propensity score, that aims to summarize all key differences among respondents with respect to the two treatments being compared. (Except for some very recent work, nearly all PSA s to date have focused on two group comparisons; we shall focus exclusively on such studies.) A number of methods have been used to estimate propensity scores, and we shall examine two below. In Phase II respondents (from both groups) are sorted on their propensity scores, and then the two groups are compared on one or more outcome measures, conditional on the P-scores.

Formally, a Propensity Score (c.f., Rosenbaum and Rubin 1983, Biometrika) P(x) is defined as the conditional probability of receiving a treatment, given scores on the observed covariates x. Symbolically, P(x i ) = Propensity Score (also a balancing score) = Pr (z * i =1 x i ) = E (z * i x i ) x i = Vector of covariate values for i-th individual z * i = 1 (Treatment) or 0 (Control) Note that the Propensity Score can be defined just as well for true experiments, since for two groups, when randomization is used the probability of receiving the treatment (or being in the control group) is just 1/2, and is a constant for all units or individuals. The first application of PSA that we illustrate below is based on matching each member of the treatment group with a member of the control group, using covariates.

Data shown in the next slide* derive from an observational study by Morten, et. al (1982, Amer. Jour. Epidemiology, p. 549 ff). Children of parents who had worked in a factory where lead was used in making batteries were matched by age and neighborhood with children whose parents did not work in lead-related industries. Whole blood was assessed for lead content to provide responses. Results shown compare Exposed w/ Control Children in what can be seen as a paired samples design. Conventional dependent sample analysis shows that the (95%) C.I. for the population mean difference is far from zero. The mean of the difference scores is 5.78, and the results support the interpretation that parents lead-related occupations tend to influence how much lead is found in their children's blood. This plot is called a Propensity Score Assessment Plot and is produced by the function granova.ds in the R package granova (Pruzek & Helmreich, 2007). Note that the heavy black line on the diagonal corresponds to X = Y, so if if a X > Y its point is below the identity diagonal. Parallel projections to the lower left line segment show the distribution of difference scores corresponding to the pairs; the red dashed line shows the average difference score, and the green line segment shows the 95% C.I.

The graphic shows more, however. Note the wide dispersion of lead measurements for Exposed children in comparison with their Control counterparts. One interpretation is that it is the particular characteristics of parents experiences at work or home with their children that need to be taken into account to make comparison most informative. That is, given the wide variation in blood levels across the Exposed group, where those with the lowest levels are quite comparable with their Control counterparts, but not those corresponding to points on the far right side, it seems reasonable to say that the general hypothesis should be reformulated in terms of specifics. Children whose parents worked at the factory in jobs with lower levels of exposure to lead, and parents who practiced good personal hygiene, despite having high exposure to lead via their jobs tended to be parents of children with lower blood lead levels.

Although it is not certain that Control & Exposed children did not differ in other ways (than age and neighborhood of residence), Rosenbaum (2002), who discusses this example in detail, uses a sensitivity analysis to show that the hidden bias would have to be extreme to explain away differences this large. The essential question asked in a sensitivity analysis is this: is there at least one covariate (the investigator can think of) that could explain as large a difference between treatment & control groups as was observed? Or not? In other words, can the analyst rule out an alternative explanation for the effect that was observed after (comprehensive) consideration of covariates that were not used in construction of the propensity score vector? If not, then the PSA effect stands as likely to be evidence of a causal relationship of the treatment ; if yes, then the difference is too small, relative to the observed treatment effect, to make a strong case that the treatment caused the difference. Sensitivity analyses can be essential to a wrap-up of a PSA study (but they are often not completed, which, alas, is a feature of our following illustration).

Consider next, estimation of propensity scores for a real medical using data on 996 initial Percutaneous Coronary Interventions (PCIs) performed in 1997 at the Lindner Center, Christ Hospital, Cincinnati. Description: Data from an observational study of 996 patients receiving a PCI at Ohio Heart Health in 1997 and followed for at least 6 months by the staff of the Lindner Center. This is a landmark dataset in the literature on propensity score adjustment for treatment selection bias due to practice of evidence based medicine; patients receiving abciximab tended to be more severely diseased than those who did not receive a cascade blocker. The binary variable abcix indicates control (0) or treatment (1). Logistic regression was used initially to estimate propensity scores.

The next two slides show two graphics. In the first, (densities) for both of the Lindner treatment groups, where the Counts were 298 for Control, and 698 for Treatment. Five strata are also identified by vertical lines in the plot. The loess plot, the second figure, is particularly informative, as it shows the (non-linear) regression lines for predicting costs (log metric) of the two treatments. Interpretation of the loess graphic comes after it s presentation. What you need to know at this point is that a form of regression (logistic regression) was used to produce the baseline variable on the next slides (the propensity scores (PS)); this PS variable is itself a function of the covariates chosen in the regression model. The variables that predict or discriminate best between treatment and control groups are the one s most strongly associated with the PS variable.

The preceding graphic shows all 996 data points colored to correspond with the treatment/control designations. Note that the loess (non-linear) regression curve for abcix = 1 lies above the curve for the control across nearly the full range of the propensity scores. This indicates that the costs of treatments (this being a proxy for health problems) for abcix were almost universally higher than for their control counterparts, after adjusting for all covariate differences using LR-based P-scores; 95% C.I.=(.05,.21) in Log metric. Summary results for strata are shown in the table below, as well as in the graphic on the next page. Stratum counts.0 counts.1 means.0 means.1 diff.means 1 97 107 9.36 9.48 0.12 2 73 122 9.37 9.55 0.18 3 62 138 9.39 9.60 0.21 4 48 151 9.47 9.62 0.15 5 18 180 9.60 9.60 0.00

Summary for PSA of Lindner Data (only parts of which were seen here) Before matching (n=966: 298 treatments and 698 controls) (not shown) 5 out of 7 variables, including major covariates ejecfrac, Ves1pro, acutemi, daibetic and stent show statistically significant differences between treatment and control groups. For the LR-based quintile stratification method, five strata were defined. Treatment effects differed somewhat across strata, indicating some interaction between covariates and treatments. Strata 2,3 & 4 showed the strongest treatment effects (might aim to characterize covariate characteristics for these strata); stratum 5 showed a change of sign in the mean response difference, but essentially a null effect. The related loess regression plot shows the full range of adjusted effect results. For the classification tree stratification method (not shown), 6 strata were found and concordance of treatment effects across propensity score levels were observed. DAE was used to summarize treatment effects, and the t test yielded significance, P<0.05, in favor of the treatment. That there were notable differences in results for LR & Classification tree methods should not be surprising, due to the major algorithm differences. The graphical-visualization methods for PSA provide especially clear images of the sizes and direction of treatment effects, and help to clarify the differences found for the different PSA methods.

Going beyond these examples, consider comparing two behavioral regimes with one another, say two diets or two exercise plans, or two food supplement conditions. When medical researchers develop vaccines to prevent disease, drugs to treat illness or methods to aid victims of injury they may be said to be designing treatments. When biologists study the effects of clear cutting forests, or educational psychologists evaluate how class size affects learning, they are studying the consequences of treatments. In short, treatments are ubiquitous, they can take virtually any form, but it is treatment comparisons that are often the essence of applied scientific study. If a non-standard treatment is selected, then the particular question of how to define the control group may need special attention. The requirement that observational studies use effective or appropriate pre-treatment covariates is most notable.