Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Effects of propensity score overlap on the estimates of treatment effects Yating Zheng & Laura Stapleton Introduction Recent years have seen remarkable development in estimating average treatment effects in non-experimental designs. Researchers have developed various methods, including matching (Rosenbaum, 1989), regression (Hahn, 1998; Heckman et al., 1998), and propensity score methods (Rosenbaum & Rubin, 1983; Hirano et al., 2003). Among these methods, propensity score methods are popular because 1) compared with matching, they can easily construct matched sets with similar distributions of multiple covariates to facilitate estimation of unbiased treatment effects, 2) they can avoid the assumption violation problems in regression (e.g., the functional form may not correctly specify the relation between the covariates and the outcome for data not observed). However, propensity score methods are not without their limitations. A potential concern is that they require sufficient overlap of the propensity score distributions between the treatment and control groups (Crump et al., 2009), which sometimes may not be the case in practice. However, previous studies have seldom explored what it is sufficient overlap and how it would influence the estimates of the treatment effects. In this study, a simulation study is used to explore the effects of propensity score overlap on the point estimates of treatment effects as well as their sampling variance. Theoretical Framework The propensity score is defined as the probability of receiving the treatment given the observed covariates (Rosenbaum & Rubin, 1983). In general, propensity score methods (matching, weighting and sub-classification) work through five steps: 1) identify baseline confounding covariates that could potentially bias estimates of the treatment effect, 2) calculate propensity scores into treatment using logistic regression (or a nonparametric approach) on the baseline covariates, 3) condition the propensity scores between the treatment and control groups through matching or reweighting of the data, 4) check the conditioning quality (e.g., balance check) of the matched samples, 5) estimate the treatment effects (Stuart, 2010). To obtain reliable estimates of treatment effects, it requires a sufficient overlap between the treatment and control groups. The lack of overlap can lead to imprecise estimates of the treatment effects (Crump et al., 2009) as insufficient overlap implies that the treatment and control groups are not balanced in the covariates. Step 3 aims to address this issue of unbalanced covariate distributions. However, the current methods have limitations. A common way is to discard individuals with propensity scores outside the range of the other group (Grzybowski et al., 2003; Vincent et al., 2002), which may change the population for which the results apply (Crump et al., 2009). Another way is to change the weight, or contribution, of data from participants in the control group, increasing the weights of those with propensity scores similar to the participants in the treatment group and decreasing the weights of individuals with propensity scores different from those in the treatment group (Heckman et al., 1998; Dehejia & Wahba,

1999). A potential drawback of the weighting method is that the variance may be high if the weights are extreme (Stuart, 2010). Few studies have explored the effects of propensity score overlap on the estimates of treatment effects and how imprecise the estimates would be for different levels of overlap. In this study, we use an index to quantify the overlap and explore the reliability and validity of the estimates at different overlap levels. In addition, we also explore the effects of insufficient overlap on the estimates of treatment effects using different propensity score methods (weighting, matching and doubly robust methods) to provide a guide rule about which method performs better under what overlap level. Methods Research Design The data generation process follows a common approach used in prior propensity score simulation studies (Kaplan & Chen, 2011; Craycrot, 2016): 1. generate confounding covariates X 1, X 2 and X 3 from normal and binomial distributions 2. calculate the propensity score (ps) using Eq(1) exp (β! X! + β! X! + β! X! ) ps = Eq(1) 1 + exp (β! X! + β! X! + β! X! ) 3. use Bernoulli distribution with probability of the calculated propensity score to decide treatment assignment 4. calculate the outcome value using Eq(2) Y = α! X! + α! X! + α! X! + α! T Eq 2 where T is an indicator of treatment assignment, control = 0, and treat = 1; α 4 is the true treatment effect. The values of all parameters are listed in Table 1 (In the full presentation, interactions between predictors, and interactions between the treatment assignment and the predictors will be added). Ten thousand replications are run. For each replication, the sample size is 1,000. Table 1. Values of the parameters X 1 ~ Normal(mean1, 1), mean1 ~ Normal(0, 1) X 2 ~ Normal(mean2, 1), mean2 ~ Normal(0.5, 1) X 3 ~ Binomial(1,000, 0.5) β 1 = 0.3, β 2 = 0.4, β 3 = -1 α 1 = 0.4, α 2 = -0.3, α 3 = 0.2, α 4 = 0.15 Note. Mean1 and mean2 are both vectors of size 10,000. Propensity score methods are then used to estimate treatment effects. First, a logistic regression model is run using the generated covariates as predictors and the treatment assignment as the outcome and the fitted model is used to obtain estimated propensity scores. The next step is to calculate the overlap rate of the propensity score distributions, which equals the intersection area of the density plots of the two groups divided by the sum of the area of the two density plots (the intersection area is only counted once). For example, in Figure 1, the overlap rate is the ratio of area 2 over the sum of areas 1, 2 and 3. As the overlap rate is empirically defined, we cannot control the number of replications

for each overlap rate level. Finally, different propensity score methods are used to estimate the treatment effect. For propensity score weighting, the method of weighting by the odds (WBO) is used, calculated as: w! = T! + (1 T!) e! 1 e! Eq(3) where w! is the weight for subject i, T! is an indicator about whether subject i received the treatment, and e! is the estimated propensity score for subject i. 1 3 2 Figure 1. Propensity score distributions of treatment and control groups. Analysis The estimated treatment effect is calculated as the average group mean difference after matching/weighting/sub-classification. Relative bias (the proportional difference between the true and estimated treatment effect) and variance (variance of the estimated treatment effects for a specific overlap level over replications) are used to measure the performance of the estimates. Preliminary Results The results from propensity score weighting shows that, in general, as the overlap rate increases the variance of the estimates decreases (see Figure 2) which is consistent with the findings of previous studies (Stuart, 2010); we quantify this decrease in variance for different overlap levels. Regarding bias, when the overlap rate is extremely small (<0.2), the relative average bias is comparatively large (see Table 2) but when the overlap rate goes beyond 0.2, the relative average bias becomes small. This implies that we need to be cautious about using propensity score weighting method to estimate treatment effects when the propensity score overlap rate is smaller than 0.2. A possible reason for the comparatively higher bias at low overlap levels is that the WBO method does not exclude control individuals who are very different from those receiving treatment. Although their weights are decreased, inclusion of a large amount of people with very different propensities, which is the case at low overlap rates, may bias the estimates. In

the full presentation, other propensity score methods (e.g., matching, doubly robust methods) will be explored as well as inclusion of interactions within the treatment effect generation model (Eq2). Results from different methods will be compared to provide guidelines about which method is recommended under what overlap rate. Figure 2. Relationship between bias and estimated overlap rate. Table 2. Average bias and variance of the estimated treatment effect overlap level N Relative mean bias Variance* [0, 0.1) 1700 9.9% 9713.1 [0.1, 0.2) 2591 1.9% 2146.8 [0.2, 0.3) 2440 0.3% 565.7 [0.3, 0.4) 1539 0.2% 172.6 [0.4, 0.5) 866 0.5% 54.2 [0.5, 0.6) 498 0.1% 17.0 [0.6, 0.7) 244 0.4% 7.2 [0.7, 0.8) 91 0.2% 2.9 [0.8, 0.9) 29 0.1% 0.5 [0.9, 1] 2 0.3% 0.0 Note. N is the number of replications with empirical overlap rates in the category listed. Extremely high overlap rates are difficult to obtain given the generation model in Eq1, so the frequency of replications with high overlap levels is very small. Relative mean bias is the ratio of mean bias over true treatment effect, where true treatment effect is 0.15 in this case. Variance has been rescaled by a factor of 100,000.

References Crump, R. K., Hotz, V. J., Imbens, G., W. & Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1), 187-199. Dehejia, R. H. & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevluating the evaluation of training programs. Journal of the American Statitstical Association, 94(448), 1053-1062. Grzybowski, M., Clements, E. A., Parsons, L., Welch, R., Tintinalli, A. T. & Ross, M. A. (2003). Mortality benefit of immediate revascularization of acute STT-segement elevation myocardinal infarction in patients with contraindications to thrombolytic therapy: A propensity analysis. Journal of the American Medical Association, 290, 1891-1898. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315-331. Heckman, J., Ichimura, H. & Todd, P. (1998). Matching as an econometric evaluation estimator. The Reviews of Economic Studies, 65, 261-294. Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161-1189. Kaplan, D. & Chen, C. J. (2011). Bayesian propensity score analysis: Simulation and case study. Presentation at the annual conference of Society of Research on Educational Effectiveness, Washington D. C.. Rosenbaum, P. R., & Rubin, D. B. (1983). The central of the propensity score in observational studies for casual effects. Biometrika, 70(1), 41-55. Rosenbaum, P. R. (1989). Optimal matching in observational studies. Journal of the American Statistical Association, 84, 1024-1032. Stuart, E. A. (2010). Matching methods for casual inference: A review and a look forward. Statistical Science, 25(1), 1-21. Vincent, J. L., Baron, J., Reinhart. K., Gattinoni, L., Thijs, L., Webb, A., Meier- Hellmann, A., Nollet, G. & Peres-Bota, D. (2002). Anemia and blood transfusion in critically ill patients. Journal of the American Medical Association, 288, 1499-1507.