Propensity score analysis with hierarchical data

Size: px

Start display at page:

Download "Propensity score analysis with hierarchical data"

Jeffery Lloyd
6 years ago
Views:

1 Section on Statistics in Epidemiology Propensity score analysis wit ierarcical data Fan Li, Alan M. Zaslavsky, Mary Bet Landrum Department of Healt Care Policy, Harvard Medical Scool 180 Longwood Avenue, Boston, MA 0115 October 9, 007 Abstract Propensity score (Rosenbaum and Rubin, 1983 metods are being increasingly used as a less parametric alternative to traditional regression metods in medical care and ealt policy researc. Data collected in tese disciplines are often clustered or ierarcically structured, in te sense tat subjects are grouped togeter in one or more ways tat may be relevant to te analysis. However, propensity score was developed and as been applied in settings wit unstructured data. In tis report, we present and compare several propensity-scoreweigted estimators of treatment effect in te context of ierarcically structured data. For te simplest case witout covariates, we sow te double-robustness of tose weigted estimators, tat is, wen bot of te true underlying treatment assignment mecanism and te outcome generating mecanism are ierarcically structured, te estimator is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. Tis result olds for any balancing weigt. We obtain te exact form of bias wen clustering is ignored in bot steps. We apply tose metods to study racial disparity in te service of breast cancer screening among elders wo participate Medicare ealt plans. KEY WORDS: double robustness, ealt policy researc, ierarcical data, propensity score, racial disparity, weigting. 1. Introduction Population-based observational studies often are te best metodology for obtaining generalizable results on access to, patterns of, and outcomes from medical care wen large-scale controlled experiments are infeasible. Comparisons between groups can be biased, owever, wen te groups are unbalanced wit respect to measured and unmeasured confounders. Standard analytic metods adjust for observed differences between treatment groups by stratifying or matcing patients on a few observed covariates or wit regression analysis in te case of many observed confounders. But if treatment groups differ greatly in observed caracteristics, estimates of treatment effects from regression models rely on model extrapolations and te resulting conclusions can be very sensitive to model mis-specification (Rubin, Propensity score metods (Rosenbaum and Rubin, 1983, 1984 ave been proposed as a less parametric alternative to regression adjustment and are being increasingly used in ealt policy studies (Connors et al., 1996; D Agostino, 1998, and references terein. Tis approac, wic involves comparing subjects weigted (or stratified, matced according to teir propensity to receive treatment (i.e., propensity score, attempts to balance subjects in treatment groups in terms of observed caracteristics as would occur in a randomized experiment. Propensity score metods permit control of all observed confounding factors tat migt influence bot coice of treatment and outcome using a single composite measure, witout requiring specification of te relationsips between te control variables and outcome. Propensity score metods were developed and ave been applied in settings wit unstructured data. However, data collected in medical care and ealt policy studies are typically clustered or ierarcically structured, in te sense tat subjects are grouped togeter in one or more ways tat may be relevant to te analysis. For example, subjects maybe grouped by geograpical area, treatment center (e.g., ospital or pysicians, or in te example we consider in tis paper, ealt plan. Generally, subjects are assigned to clusters by an unknown mecanism tat may be associated wit measured subject caracteristics tat we are interested in (e.g., race, age, clinical caracteristics, measured subject caracteristics tat are not of intrinsic interest and are believed to be unrelated to outcomes except troug teir effects on assignment to clusters (e.g., location, and unmeasured subject caracteristics (e.g., unmeasured severity of disease, aggressiveness in seeking treatment. Wen subjects are ierarcically structured, a number of issues appear tat are not present wit an unstructured collection of subjects. First of all, standard error calculations tat ignore te ierarcical structure will be inaccurate, leading to incorrect inferences. A more interesting set of issues arises because tere may be bot measured and unmeasured factors at te cluster level tat create variation among clusters in quality of treatment and ence in outcomes. Hierarcical regression models ave been developed to give a more compreensive description tan non-ierarcical models provide for suc data (e.g., Gatsonis et al., Despite te increasing popularity of propensity score analyses and te vast literature regarding regional and provider variation in medical care and ealt policy researc (e.g., Nattinger et al., 199; Farrow et al., 1996, owever, to our knowledge, te implications of suc data structures for propensity score analyses ave been rarely studied. Huang et al. (005 applied propensity score metods to clustered ealt service data. But teir goal was to rank te performance of multiple ealt service providers (clusters in- 474

2 stead of to estimate an overall treatment effect from data wit clustered structure, wic is te goal of tis paper. Specifically, we will present several propensity score models analogues to many of te commonly used regression models for clustered data in Section ; investigate te beavior of tose estimators, especially te bias wen clustering information is ignored in te analysis in Section 3; and apply te metods to study racial disparities in te service of breast cancer screening among elders in Section 4. Summaries and remarks will be provided in Section 5. Our discussion concerns te case were a binary treatment is assigned at individual level. Also, to illustrate te major point and yet witout loss of generality, we focus on data wit two-level ierarcical structure.. Estimators Te class of estimands considered in tis paper is generally referred as a treatment effect, E x [E(Y X, Z 1] E x [E(Y X, Z 0], (1 Section on Statistics in Epidemiology i.e., te average difference in outcome between two treatment groups tat ave same distribution of covariates. Te propensity score e is defined as te conditional probability of being assigned to a particular treatment z given measured covariates x: e(x P (z 1 x. In most observational studies, te propensity score is not known and tus needed to be estimated. Terefore propensity score analysis usually involves two steps. Te first step is to estimate te propensity score, typically by a logistic regression. Te second step is to estimate te treatment effect by incorporating (e.g., by weigting or matcing te. Hierarcical structure leads to a range of different coices of modeling in bot steps. In tis section, we will introduce several most widely used models. Before going into more details, ere we make a note regarding te targeted estimand treatment effect defined above, wic is sligtly different from tose causal treatment effect defined using te conventional potential outcomes framework. Te propensity score originated from and as been widely used in causal inference, but its use is certainly not restricted to studying causal effects. For instance, in many ealt policy studies, te major interest is to compare te difference in te average of a feature (e.g., access to care between two groups (e.g., races, social economical status, rater tan to make a causal statement. Moreover, te treatment is often a non-manipulable variable, e.g., race or gender, wic does not gives a well-defined casual effect in te sense of Rubin (1978 (more discussion in Section 4. Neverteless, propensity score is still a valid and powerful tool to balance te covariates distribution between groups for studies wit non-causal purposes. Terefore, we avoid te subtle issue of causality trougout te paper and note te results obtained ere are applicable for studies wit more general (non-causal purposes. For ease of description, we still refer to our estimands discussed as treatment effects even toug tey are not necessarily causal. Hencefort, let m denote te total number of clusters; n te number of subjects in cluster ; y k te outcome for subject k in cluster (e.g., a clinical diagnosis; x k te corresponding covariates (typically vector-valued, e.g., age, stage of detection, comorbidity scores, etc.; v te cluster-level covariates (e.g., teacing status or measures of tecnical capacity of a ospital; z k te treatment assignment for te subject, z k {0, 1}; and e k te propensity score..1 Step 1. Estimating te propensity score To estimate te propensity score, several logistic regression models are available wit various treatment of te ierarcical structure..1.1 Marginal model As te name suggests, marginal regression models ignore clustering information. A typical marginal propensity score model would be ( ek log β e x k + κ e v, ( 1 e k were e k P (z k 1 x k, v. Tis model in fact assumes te treatment assignment mecanism is te same across all clusters. In oter words, it assumes tat two subjects are excangeable in terms of treatment propensity if tey ave te same vector of covariates, weter or not tey come from te same cluster. Tis propensity score model can be tougt of as a nonparametric alternative to a regression-based adjustment for individual and cluster covariates. Te analogous marginal regression model would be, y k γz k + β y x k + κ y v + ɛ k, (3 were ɛ k N(0, δ ɛ, and γ is te treatment effect. As model (, estimates derived from tis regression model rely on te assumption tat te outcome generating mecanism is te same across all clusters. Models ( and (3 ave a manifest similarity of form. A deeper connection is tat te sufficient statistics to estimate te treatment effect tat are balanced under propensity score estimator are te same tat must be balanced under model (3..1. Pooled witin-cluster model A pooled witin-cluster model for propensity score conditions on bot te covariates and te cluster indicators, e k log( δ e + β e x k, (4 1 e k were δ e is a cluster-level main effect, δe N(0,, and e k P (z k 1 x k,. Tis model implies te treatment assignment mecanism differs among clusters, and te difference is controlled by a cluster-level main effect δ e. Model (4 involves a more general assumption (weaker on te treatment assignment mecanism tan te marginal model (, because te cluster-level covariate v is a function of te cluster indicator. 475

3 Section on Statistics in Epidemiology In te above model, if we assume te cluster-specific main..1 Marginal estimator effects δ e follow a distribution, δe N(0, σ δ, ten we ave a new propensity score model wit random effects, Similar to te marginal model in step 1, te marginal estimator ignores clustering. A specific nonparametric estimator is te e k log( δ e + β e x k + κ e v. difference of te weigted overall means of te outcome of 1 e k two treatment groups, More generally, β e can be allowed to vary across clusters and zk 1 zk 0 follow a distribution. In practice, results from te above random effects model are usually similar to tose from te pooled.,marg zk 1,k w k y k,k w k y k ˆ zk 0, (7,k w k,k w k witin-cluster model wen te number of clusters is big. A corresponding pooled witin-cluster outcome model adjusting for cluster-level main effects and covariates is of te form: were te weigt w k is a function of te estimated propensity score. Te coice of weigt will be discussed in Section.3. Assume y k is omoscedastic and var(y k σ, ten te y k γz k + δ y + βy x k + ɛ k, (5 large sample variance of te marginal estimator is, were δ y is a cluster-level main effect, δy N(0,. Under s.,marg var( ˆ.,marg tis model, all information is obtained by comparisons witin clusters, since te δ y term absorbs all between-cluster information..1.3 Surrogate indicator model Wen tere are a large number of clusters wit large sample size, te computational task of fitting te pooled witin-cluster model can get demanding for standard software. Alternatively, define d z k n, te cluster-specific proportion of being treated, we can consider te following propensity score model e k d log( λ log( + β e x k + κ e v. (6 1 e k 1 d In te simplest situation were tere is no covariates, e k d for any, k. Terefore, comparing models (4 and (6, te logit of d maybe expected to be a reasonable surrogate for te cluster indicator in te pooled witin-cluster model wit te coefficient λ being around 1. Te inference is same as in te marginal model wit an additional covariate logit(d. Usually te coefficients of te cluster-level covariates κ e are very small since most of teir effects ave been absorbed by λ. Te surrogate indicator model reduces te m parameters (δ s in te pooled witin-cluster model to a single parameter λ, tus greatly reducing te computation required for model fitting. However, tis reduction is based on te assumption tat logit of te empirical cluster-specific proportion of being treated, logit(d, is linearly correlated wit logit of te true propensity score. Wen te underlying trut is far from tis assumption, te surrogate indicator model could perform poorly. Te goodness of fit of tese models can be cecked by conventional diagnostic procedures (e.g., Rosenbaum and Rubin, For example, one can ceck bot te overall and witin-cluster balance of te distribution of covariates weigted by te in different groups.. Step. Estimating te treatment effect Common approaces estimate treatment effects using propensity score involve weigting, matcing and stratification. We will focus on weigting in tis report. σ z k1,k wk zk0 ( z k 1,k w k + σ,k wk ( z k 0,k w k. (8 In practice σ can be estimated from te sample variance of y k... Clustered estimator A second estimator is to first obtain te cluster-specific weigted difference and ten calculate te weigted average of tese differences based on te sum of weigts in eac cluster. Tat is, for cluster, ˆ zk 1 w k y k zk 1 w k zk 0 w k y k zk 0. w k Te variance of te cluster-specific estimator ˆ under te independent omoscedastic assumption of y k witin cluster is s var( ˆ σ zk 1 wk ( z k 1 w k + σ zk 0 wk w k. ( z k 0 Similarly, σ can be estimated from its empirical counterpart witin eac cluster. Let w be a function of te weigts in cluster, e.g., te sum of weigts w w k, or te precision of te estimator ˆ, w s. Te overall clustered estimator is ten an average of te ˆ s weigted by w, And te overall variance is s.,clu var( ˆ.,clu ˆ.,clu w ˆ w. (9 ( w k s (,k w k. (10 Standard errors of estimators s.,marg and s.,clu also be obtained from resampling metods suc as te bootstrap. 476

4 ..3 Doubly-robust estimators Te weigted mean can be regarded as a weigted regression witout covariates. Terefore in step, we can replace te nonparametric weigted mean (7 or (9 by a parametric regression (e.g., model (3 or (5 weigted by te estimated propensity score. And te coefficient of te treatment assignment γ is te targeted estimand of treatment effect. Tis is essentially te class of doubly-robust estimators proposed by Scarfstein et al. (1999. Doubly-robust estimators allow flexible model coices in bot steps, wic can be very beneficial in applications. Tese estimators are coined doubly-robust in te sense tat tey are proven to be consistent if one but not necessarily bot of te step 1 and models are correctly specified under te Horvitz-Tompson weigt (see below. Detailed discussion of tis property wit ierarcical data is presented in te next section..3 Coice of weigts We now consider te coice of weigts. We call te class of weigts wic balances te distribution of covariates between treatment groups balancing weigts. Te most widely used balancing weigt is te Horvitz-Tompson (inverse probability weigt [ X(1Z 1e(X w k { 1 e k, for z k 1 1 1e k, for z k 0. [ ] XZ Te H-T weigt is a balancing weigt because E e(x ] E. Te H-T estimator compares te expected outcome of te subjects placed in z 0 versus tat of te subjects placed in z 1, averaging over te distribution of covariates in te combined population. Tat is, [ ] Y Z Y (1 Z E E[(Y Z 1 (Y Z 0]. e(x 1 e(x In fact, te doubly-robust estimators in Scarfstein et al. (1999 are restricted to using te H-T weigt because of tis clear causal interpretation. However, te H-T estimator as been well known to ave excessively large variance wen tere are subjects wit extremely small propensity score. Neverteless, te same idea is readily extended to any balancing weigt, altoug alternative weigts migt define different estimands. For example, we can consider te population-overlap weigt, { 1 ek, for z w k k 1 e k, for z k 0. were eac subject is weigted by te probability of being assigned to te oter treatment group. It is also a balancing weigt because E[XZ{1 e(x}] E[X(1 Ze(X]. In teory, te population-overlap weigt gives te smallest variance under a omoscedastic model for Y given X. But it defines a different estimand tan te te Horvitz-Tompson weigt. Specifically, we call tis te population-overlap weigt because it results in an average treatment effect tat is Section on Statistics in Epidemiology averaged over te distribution of covariates in te population were te two treatment groups overlap E[Y Z{1 e(x} Y (1 Ze(X] E[{(Y Z 1 (Y Z 0}e(X{1 e(x}]. Tis population-overlap estimator can be calculated wit acceptable variance wen te H-T estimator cannot be practically estimated, because e(x can approac 0 or 1 for some would become extremely large. In effect te H-T estimator attempts to estimate a treatment effect for types of cases wic are essentially unrepresented in one or te oter group, wile te populationoverlap weigting focuses on te types of cases wit a more balanced distribution of treatment. In addition to its statistical advantage, te latter analysis may be more scientifically relevant since it focuses attention on comparison of outcomes among te kinds of cases wic bot treatments are currently observed, for example tose in clinical equipoise between treatments. part of x space suc tat 1 e(x or 1 1e(x 3. Bias of Estimators In tis section, we investigate te bias of eac of te estimators proposed in te previous section. We first look at te simplest case wit two level-ierarcical structure and no covariates. Let n 1 (n 0 denote te number of subjects wit z 1(z 0 in cluster ; and n +1 n 1, n +0 n 0, n +1 + n +0. Assume te outcome generating mecanism for a continuous outcome follows a random effects model wit cluster-level random intercepts and random treatment effects, y k δ + γ z k + αd + ɛ k, (11 were δ N(0, σδ, ɛ k N(0, σɛ, α is te effect of te cluster-specific proportion of being treated d on te outcome, and te true treatment effect is γ wit γ N(γ 0, σγ. We first look at te situation were clustering information is ignored in bot steps. For te marginal model in step 1, it is easy to sow tat te is te same for eac subject ê k n+1. Consequently, te marginal estimator is ˆ marg,marg zk 1,k y k n +1 n 1 n +1 zk 0,k y k n +0 γ + ( n 1 n 0 δ n +1 n +1 n +0 zk 1 zk 0,k ɛ k,k ɛ k +( n +0 +α n +1n +0 n d (1 d n +1n +0 n 1 < and n +1 < old, ten by te weak law of large num- Assume te common regularity conditions n 0 n

5 Section on Statistics in Epidemiology bers for te weigted sum of independent and identicallydistributed n < as n random variables (e.g., Cow and Lai, 1973, ++ n 1 n +1 γ converges to γ 0 as te number of clusters goes to infinity, and ( n 1 n +1 n 0 ˆ n +0 δ goes to 0, so does te tird pool,marg n term in te above formula. In te fourt term, ++ n +1n +0 is in fact te variance of te total number of treated subjects, var(n +1, if all clusters are excangeable, i.e., if all subjects regardless of te clusters follow te same treatment assignment mecanism, z Bernoulli( n+1. Furtermore, n d (1 d is te sum of te variance of te number of treated subjects witin eac cluster, var(n 1, if eac cluster separately follows a treatment assignment mecanism, z Bernoulli( n 1 n. Terefore, bias of te marginal estimator wit propensity score estimated from te marginal model is [ Bias( ˆ var(n+1 marg,marg α var(n ] 1. var(n +1 (1 Te size of te bias is controlled by two factors: (1 te ratio of te variance of te total number of treated subjects under a omogeneous versus a cluster-eterogeneous treatment assignment mecanism; and ( te effect tat te cluster-specific proportion of being treated d as in te response, i.e., α. Tis is intuitive because te first factor measures te variation in te treatment assignment mecanism among clusters and te second measures te variation in te outcome generating mecanism, bot of wic are ignored in te analysis wit marginal models in bot steps. Wen eiter but not necessarily bot of te two mecanisms is omogenous across clusters, te marginal estimator, ˆ marg,marg, is also consistent. However, in reality, it is most likely tat bot of te mecanisms are eterogenous among clusters. We now look at te opposite situation were clustering information is taken into account in bot steps. For te pooled witin-cluster model in step 1, it is easy to sow tat te estimated propensity score is ê k n 1 n. Ten te clustered weigted estimator is ˆ pool,clu ( z k 1 y k n 1 m γ m + ( z k 1 ( z k 0 ɛ k n 1 m y k n 0 ( z k 0 ɛ k n 0 m m n,m γ 0 (13 wic is asymptotically unbiased. Te result is free of te form of weigt. Simple calculation sows tat te clustered weigted estimator combining te marginal model in step 1, ˆ marg,clu, is of exactly te same form as tat in (13 and tus also unbiased. Furtermore, te marginal estimator wit propensity score estimated from te pooled witincluster model, ˆ pool,clu, follows te same form as in (13, but only under H-T weigt and a balanced design (i.e., eac cluster as same number of subjects. Under H-T weigt but an unbalanced design, te estimator is also consistent (assume n γ n,m γ 0. However, te same estimator under te population-overlap weigt is n,m γ 0. ˆ pool,marg n 0( z k 1 n 1 n 0 y k n n 1( z k 0 n 1 n 0 n ɛ k n (γ + z k 1 n 1 z k 0 n 1 n 0 n y k n ɛ k n 0 Even toug tis estimator is also asymptotically unbiased, its small sample beavior can be quite different from tat of te estimator under H-T weigt. Under te omoscedasticity assumption of outcome, te tree H-T estimators ˆ pool,marg, ˆ marg,clu, and ˆ pool,clu tat take into account clustering in at least one step ave te same variance, s σɛ n n ( n 1 n 0 Similarly as te discussion on bias, tis result is generally not applicable for oter type of weigts. Specifically, te variance of ˆ pool,marg is usually larger tan tat of ˆ marg,clu and ˆ pool,clu. Wen tere are no covariates, te surrogate indicator model gives te as te pooled witincluster model. Tus te results obtained above regarding te pooled witin-cluster model automatically old for te surrogate indicator model. But tis is not te case for te general situation wit covariates. Te proofs are analogous for data wit a iger order of ierarcical levels. For te simplest case witout covariates, above we ave sown te double-robustness of tose propensity score estimators, tat is, wen bot of te true underlying treatment assignment mecanism and outcome generating mecanism are ierarcically structured, te estimator using a balancing weigt is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. Tis can be viewed as bot a special case and an extension of te double-robustness property of te estimator in Scarfstein et al. (1999. Te extension lies in tat our conclusion is instead free of te form of weigt. In te more general cases wit covariates, usually tere is no closed-form solution to te logistic models for estimating te propensity score. Consequently, tere is no closed-form of te bias of tose estimators as above. Neverteless, tis situation can be explored eiter by large-scale simulations, or by adopting a probit (instead of logistic link for estimating te propensity score. Intuitively, te double-robustness property still olds. But te bias of a marginal estimator ˆ marg,marg is 478

6 expected to also be affected by te size of te true treatment effect γ (negative correlated and te ratio of between-cluster and witin-cluster variance g σ δ σ (positively correlated, ɛ in addition to α and var(n+1 var(n 1 var(n +1 in (1. A compreensive discussion is beyond te scope tis report and is subject to furter researc. 4. Application We now apply te above metods to study racial disparity in ealt services. Disparity refers to racial differences in care attributed to operations of ealt care system. Our application concerns te HEDIS R measures of ealt care provided in Medicare ealt plans. Eac of tese measures is an estimate of te rate at wic a guideline-recommended clinical service is provided witin te appropriate population. We obtained individual-level data from te Centers for Medicare and Medicaid Services (CMS on breast cancer screening of women in Medicare managed care ealt plans (Scneider et al., 00. Our main interest is te disparity between wites and blacks, so we exclude subjects of oter races for wom racial identification is unreliable in tis dataset. We focus on plans wit at least 5 wites and 5 blacks, leaving 64 plans wit a total sample size of For practical reasons, we drew a random subsample of size 3000 from eac of te tree large plans wit more tan 3000 subjects, leaving a total sample size of All te covariates considered in te analysis are binary. Te individual-level covariates x k include two indicators of age category (70-80,>80 wit reference group being 60-70; eligibility for Medicaid (1 yes; neigborood status indicator (1 poor. Te plan-level covariates v include nine geograpical code indicators; non/for-profit status (1 for-profit; and te practice model of providers (1 staff-group model; 0 networkindependent practice model. Te outcome y is a binary variable equal to 1 if te enrollee underwent breast cancer screening and equal to 0 oterwise, and te treatment z ere is race (1 black, 0 wite. We want to estimate te difference in te proportion of undergoing breast cancer screening between wites and blacks. As mentioned before, race is not a valid treatment in conventional sense in causal inference, because it is not manipulable (Holland, However, in tis particular application, our goal is not to study te causal patway between race and ealt service utilization, but simply to estimate te magnitude of disparity under balanced distributions of covariates between te two races. Hence, te propensity score in tis application is merely an analytical tool to acieve tis goal, and it sould not be taken as aving te explicit meaning of te probability of being black. We first estimate te propensity score using te tree models introduced in Section.1 wit all te above covariates included. Details of te fitted models are omitted ere since te focus is te fitted values (. All models suggest tat living in poor neigborood, being eligible for Medicaid and enrollment in for-profit insurance plan are significantly associated wit being black race. Figures 1 and sow istograms of te for Section on Statistics in Epidemiology wites and blacks. Different models clearly give quite different estimates of propensity score in tis data, were te marginal model departs mostly from te oter two models. Te variance of te of blacks is muc bigger tan tat of wites, regardless of te model. We cecked te weigted distributions of covariates. Eac model leads to good balance of te overall weigted covariates distributions between groups. However, te marginal model in general does poorly in balancing covariates between races witin eac cluster, wile te surrogate indicator model does better, and te pooled witin-cluster model does te best. Tis suggests tat tere is important between-cluster variation Histogram of Propensity score of Wites Estimated from Marginal Model Histogram of Propensity score of Wites Estimated from Pooled WitinCluster Model Histogram of Propensity score of Wites Estimated from Surrogate Indicator Model Figure 1: Histogram of propensity score estimated from different models for wites Histogram of Propensity score of Blacks Estimated from Marginal Model Histogram of Propensity score of Wites Estimated from Pooled WitinCluster Model Histogram of Propensity score of Wites Estimated from Surrogate Indicator Model Figure : Histogram of propensity score estimated from different models for blacks. Using te, we estimate racial disparity in breast cancer screening among te elder women participating Medicare ealt plans by te estimators proposed in Section.. Altoug te outcome is binary in tis case, te probabilities of outcome are in a range were te linear probability model is an acceptable fit. Hence, for te doubly-robust estimators, we adopt te combinations of te tree propensity score models (, (4 and (6 in step 1 and 479

7 te two outcome models (3 and (5 in step. Table 1 sows te estimates using te H-T weigt. Eac row represents one step 1 model, and eac column represents one type of step model/estimator. Analogous results using te populationoverlap weigt are given in Table. Section on Statistics in Epidemiology among te elders wo participate in Medicare ealt plans, blacks on average ave a significantly lower cance to receive breast cancer screening tan wites, after adjusting for age, geograpical region, social economical status and ealt plan caracteristics. weigted doubly-robust marginal clustered marginal pooled witin marginal (0.008 (0.008 (0.004 (0.004 pooled witin (0.009 (0.008 (0.004 (0.004 surrogate indicator (0.009 (0.008 (0.004 (0.004 Table 1: Difference in te proportion of getting breast cancer screening between blacks and wites using Horvitz- Tompson weigt All models sow te proportion of receiving breast cancer screening is significantly lower among blacks tan among wites wit similar caracteristics. Te estimates are similar except for te analyses tat ignore clustering in bot steps, wic overestimate te treatment effect. Tis pattern matces te double-robustness property. Results from te surrogate indicator model in step 1 are sligtly different from te oters, suggesting te cluster-specific proportion of being treated d is correlated wit certain covariates. Te doubly-robust estimates ave smaller standard errors because te extra variation is explained by covariates in step. Not surprisingly, te estimates using H-T weigt ave muc larger variances tan tose using te population-overlap weigt. We also notice tat te estimates incorporating clustering in step ave less variation tan tose doing so in step 1. Tis observation suggests, in application, modeling te ierarcical structure for te outcome generating mecanism leads to more stable estimates, even toug in teory correct model specification in bot steps are equivalent in terms of teir effect on consistency. A possible explanation is te impact of misspecifying propensity score is attenuated troug weigting because te ultimate estimand is a function of te outcome, rater tan of te propensity score. Even toug we do not know te underlying trut, te similarity of various estimators suggests our analyses capture te main information regarding disparity in tis data. Tat is, weigted doubly-robust marginal clustered marginal pooled witin marginal (0.007 (0.008 (0.004 (0.004 pooled witin (0.007 (0.008 (0.004 (0.004 surrogate indicator (0.007 (0.008 (0.004 (0.004 Table : Difference in te proportion of getting breast cancer screening between blacks and wites using population-overlap weigt 5. Summary and Remarks Since first been proposed twenty-five years ago, propensity score metods ave gained increasing popularity in observational studies in multiple disciplines. One example is ealt care policy researc, were data wit ierarcical structure are rule rater tan exception nowadays. However, despite te wide appreciation of propensity score among bot statisticians and ealt policy researcers, tere is very limited literature regarding te metodological issues of propensity score metods in te context of ierarcical data, wic motivates our exploration in tis paper. Specifically, we present tree typical models for estimating propensity score and two types of nonparametric weigted (by estimators of treatment effect for ierarcically structured data. Furtermore, for te simplest (conceptual case witout covariates, we sow te double-robustness of tose weigted estimators: wen bot of te true underlying treatment assignment mecanism and outcome generating mecanism are ierarcically structured, te estimator is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. We also quantify te bias of te estimator wen clustering is ignored in bot steps. We ave focused on te case of treatment being assigned at te individual level in tis paper. Treatment assigned at te cluster level (e.g., ospital, ealt care provider is also common in medical care and ealt policy studies, were several new callenging issues can arise. First, te number of clusters is often relatively small despite a large total sample size. Tis could lead to poorly s wit excessively large standard errors. Second, te clusterlevel propensity score only balances te cluster-level covariates and te average individual-level coviariates. Wat are te consequences of te possible imbalance in te overall distributions of individual-level covariates? Tis also as a strong connection to te ecological inference commonly encountered in political science (e.g., King, 1997 were te estimand as an interpretation as an average effect on individual outcomes. Tird, all te nonparametric weigted estimators discussed in tis paper do not make use of te individual-level covariates, wic often contain crucial information. Te doubly-robust estimators wit flexible regression model coice in te second step appear to be preferable in tis case. But wat specific regression model to coose greatly depends on te specific data. Fourt, most interestingly, te foundational stable-unittreatment-value assumption (SUTVA te observation on one unit sould be unaffected by te particular assignment of treatments to te oter units (Cox 1958,.4 often no longer olds under clustered treatment assignment, especially in te studies wit, for instance, beavioral outcomes and infectious disease. In tat case, correct modeling of te interference 480

8 Section on Statistics in Epidemiology among subjects is crucial for valid analysis. Tose issues are among a range of open questions remained to be explored on tis topic. Furter systematic researc efforts are desired to sed insigt to te metodological issues and to provide guidelines for practical applications. REFERENCES Connors, A., Speroff, T., Dawson, N., and et al. (1996. Te effectiveness of rigt eart cateterization in te initial care of critically ill patients. Journal of te American Medical Association 76, Cox. C.P. (1958. Te Analysis of Latin Square Designs wit Individual Curvatures in one Direction. Journal of te Royal Statistical Society. Series B. 0(1, Cow, Y. S. and Lai, T.L. (1973. Limiting beavior of weigted sums of independent random variables. Te Annals of Probability 1(5, D Agostino, R. (1998. Tutorial in biostatistics: propensity score metods for bias reduction in te comparisons of a treatment to a non-randomized control. Statistics in Medicine 17, Farrow, D., Samet, J. and Hunt, W. (1996. Regional variation in survival following te diagnosis of cancer. Journal of Clinical Epidemiology 49, Gatsonis, C., Normand, S., Liu, C., and Morris, C. (1993. Geograpic variation of procedure utilization: a ierarcical model approac. Medical Care 31, YS54-YS59. King, G. (1997. A Solution to te Ecological Inference Problem: Reconstructing Individual Beavior from Aggregate Data. Princeton University Press. Holland, P.W. (1986. Statistics and causal inference (wit discussion. Journal of te American Statistical Association 81, Huang, I.C., Frangakis, C.E., Dominici, F., Diette, G. and Wu, A.W. (005. Application of a propensity score approac for risk adjustment in profiling multiple pysician groups on astma care. Healt Services Researc 40, Nattinger, A., Gottilieb, M., Veum, J., and et al. (199. Geograpic variation in te use of breast-conserving treatment for breast cancer. New England Journal of Medicine 36, Rosenbaum, P.R. and Rubin, D.B. (1983. Te central role of te propensity score in observational studies for causal effects. Biometrika 70(1, Rosenbaum, P.R. and Rubin, D.B. (1984. Reducing bias in observational studies using subclassification on te propensity score. Journal of te American Statistical Association 79, Rubin, D.B. (1978 Bayesian inference for causal effects: te role of randomization. Annals of Statistics 6, Rubin, D.B. (1979. Using multivariate matced sampling and regression adjustment to control bias in observational studies. Journal of te American Statistical Association 74, Scarfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999. Adjusting for nonignorable drop-out using semiparametric nonresponse models (wit discussion. Journal of te American Statistical Association 94, Scneider E.C., Zaslavsky A.M., Epstein, A.M. (00. Racial disparities in te quality of care for enrollees in Medicare managed care. Journal of te American Medical Association 87(10,

Propensity Score Methods with Multilevel Data. March 19, 2014

Propensity Score Methods with Multilevel Data March 19, 2014 Multilevel data Data in medical care, health policy research and many other fields are often multilevel. Subjects are grouped in natural clusters,