Propensity score analysis with hierarchical data

Similar documents
Propensity Score Methods with Multilevel Data. March 19, 2014

EXPERTISE, UNDERUSE, AND OVERUSE IN HEALTHCARE * Amitabh Chandra Harvard and the NBER. Douglas O. Staiger Dartmouth and the NBER

White Rose Research Online URL for this paper:

Propensity score weighting with multilevel data

Individual differences in the fan effect and working memory capacity q

Running head: SEPARATING DECISION AND ENCODING NOISE. Separating Decision and Encoding Noise in Signal Detection Tasks

Unbiased MMSE vs. Biased MMSE Equalizers

Propensity scores: what, why and why not?

Correcting for Lead Time and Length Bias in Estimating the Effect of Screen Detection on Cancer Survival

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Each year is replete with occasions to give gifts. From

Widespread use of pure and impure placebo interventions by GPs in Germany

Sickle Cell. Scientific Investigation

Derivation of Nutrient Prices from Household level Consumption Data: Methodology and Application*

A LEGACY OF SERVICE, LOVE AND SOCIAL JUSTICE

Homophily and minority size explain perception biases in social networks

Name: Key: E = brown eye color (note that blue eye color is still represented by the letter e, but a lower case one...this is very important)

Propensity Score Analysis Shenyang Guo, Ph.D.

Introduction to Observational Studies. Jane Pinelis

Methods for Addressing Selection Bias in Observational Studies

Lothian Palliative Care Guidelines patient information

Magnetic Resonance Imaging in Acute Hamstring Injury: Can We Provide a Return to Play Prognosis?

the risk of heart disease and stroke in alabama: burden document

BIOSTATISTICAL METHODS

Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

A GEOMETRICAL OPTIMIZATION PROBLEM ASSOCIATED WITH FRUITS OF POPPY FLOWER. Muradiye, Manisa, Turkey. Muradiye, Manisa, Turkey.

MEA DISCUSSION PAPERS

Journal of Theoretical Biology

Sensitivity Analysis in Observational Research: Introducing the E-value

Three-dimensional simulation of lung nodules for paediatric multidetector array CT

Spiral of Silence in Recommender Systems

Mediation Analysis With Principal Stratification

PubH 7405: REGRESSION ANALYSIS. Propensity Score

Allergy: the unmet need

A Platoon-Level Model of Communication Flow and the Effects on Operator Performance

Performance of Fractured Horizontal Wells in High-Permeability Reservoirs P. Valkó, SPE and M. J. Economides, SPE, Texas A&M University

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

MS&E 226: Small Data

USE OF AREA UNDER THE CURVE (AUC) FROM PROPENSITY MODEL TO ESTIMATE ACCURACY OF THE ESTIMATED EFFECT OF EXPOSURE

Complier Average Causal Effect (CACE)

Studying the effect of change on change : a different viewpoint

Small-area estimation of mental illness prevalence for schools

Matching methods for causal inference: A review and a look forward

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Public Assessment Report. Scientific discussion. Amoxiclav Aristo 500 mg/125 mg and 875 mg/125 mg film-coated tablets

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

How should the propensity score be estimated when some confounders are partially observed?

Preparations for pandemic influenza. Guidance for hospital medical specialties on adaptations needed for a pandemic influenza outbreak

Applying Inhomogeneous Probabilistic Cellular Automata Rules on Epidemic Model

International Journal of Health Sciences and Research ISSN:

Instrumental Variables I (cont.)

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Modeling H1N1 Vaccination Rates. N. Ganesh, Kennon R. Copeland, Nicholas D. Davis, National Opinion Research Center at the University of Chicago

Public Assessment Report. Scientific discussion. Ramipril Teva 1.25 mg, 2.5 mg, 5 mg and 10 mg tablets Ramipril DK/H/2130/ /DC.

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

Locomotor and feeding activity rhythms in a light-entrained diurnal rodent, Octodon degus

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Summary. Introduction. Methods

Imputation approaches for potential outcomes in causal inference

Propensity Score Matching with Limited Overlap. Abstract

Citation Knight J, Andrade M (2018) Genes and chromosomes 4: common genetic conditions. Nursing Times [online]; 114: 10,

A Mathematical Model for Assessing the Control of and Eradication strategies for Malaria in a Community ABDULLAHI MOHAMMED BABA

Improving ecological inference using individual-level data

TG13 management bundles for acute cholangitis and cholecystitis

Comparisons of Dynamic Treatment Regimes using Observational Data

A Potential Outcomes View of Value-Added Assessment in Education

Two optimal treatments of HIV infection model

Estimating average treatment effects from observational data using teffects

The Late Pretest Problem in Randomized Control Trials of Education Interventions

Approaches to Improving Causal Inference from Mediation Analysis

Investigating the robustness of the nonparametric Levene test with more than two groups

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Implementing double-robust estimators of causal effects

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Multiple Mediation Analysis For General Models -with Application to Explore Racial Disparity in Breast Cancer Survival Analysis

arxiv: v2 [cs.ro] 31 Jul 2018

Ec331: Research in Applied Economics Spring term, Panel Data: brief outlines

Public Assessment Report Scientific discussion. Kagitz (quetiapine) SE/H/1589/01, 04-05/DC

Public Assessment Report. Scientific discussion. Carbidopa/Levodopa Bristol 10 mg/100 mg, 12.5 mg/50 mg, 25 mg/100 mg and 25 mg/250 mg tablets

Chapter 13 Estimating the Modified Odds Ratio

A re-randomisation design for clinical trials

By: Mei-Jie Zhang, Ph.D.

How to analyze correlated and longitudinal data?

Bayesian versus maximum likelihood estimation of treatment effects in bivariate probit instrumental variable models

Public Assessment Report. Scientific discussion. Orlyelle 0.02 mg/3 mg and 0.03 mg/3 mg film-coated tablets. (Ethinylestradiol/Drospirenone)

Instrumental Variables Estimation: An Introduction

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Jake Bowers Wednesdays, 2-4pm 6648 Haven Hall ( ) CPS Phone is

Analysis of TB prevalence surveys

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Identifying Mechanisms behind Policy Interventions via Causal Mediation Analysis

Accommodating informative dropout and death: a joint modelling approach for longitudinal and semicompeting risks data

Pros. University of Chicago and NORC at the University of Chicago, USA, and IZA, Germany

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

Optimal full matching for survival outcomes: a method that merits more widespread use

Public Assessment Report. Scientific discussion. Mebeverine HCl Aurobindo Retard 200 mg modified release capsules, hard. (mebeverine hydrochloride)

Transcription:

Section on Statistics in Epidemiology Propensity score analysis wit ierarcical data Fan Li, Alan M. Zaslavsky, Mary Bet Landrum Department of Healt Care Policy, Harvard Medical Scool 180 Longwood Avenue, Boston, MA 0115 October 9, 007 Abstract Propensity score (Rosenbaum and Rubin, 1983 metods are being increasingly used as a less parametric alternative to traditional regression metods in medical care and ealt policy researc. Data collected in tese disciplines are often clustered or ierarcically structured, in te sense tat subjects are grouped togeter in one or more ways tat may be relevant to te analysis. However, propensity score was developed and as been applied in settings wit unstructured data. In tis report, we present and compare several propensity-scoreweigted estimators of treatment effect in te context of ierarcically structured data. For te simplest case witout covariates, we sow te double-robustness of tose weigted estimators, tat is, wen bot of te true underlying treatment assignment mecanism and te outcome generating mecanism are ierarcically structured, te estimator is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. Tis result olds for any balancing weigt. We obtain te exact form of bias wen clustering is ignored in bot steps. We apply tose metods to study racial disparity in te service of breast cancer screening among elders wo participate Medicare ealt plans. KEY WORDS: double robustness, ealt policy researc, ierarcical data, propensity score, racial disparity, weigting. 1. Introduction Population-based observational studies often are te best metodology for obtaining generalizable results on access to, patterns of, and outcomes from medical care wen large-scale controlled experiments are infeasible. Comparisons between groups can be biased, owever, wen te groups are unbalanced wit respect to measured and unmeasured confounders. Standard analytic metods adjust for observed differences between treatment groups by stratifying or matcing patients on a few observed covariates or wit regression analysis in te case of many observed confounders. But if treatment groups differ greatly in observed caracteristics, estimates of treatment effects from regression models rely on model extrapolations and te resulting conclusions can be very sensitive to model mis-specification (Rubin, 1979. Propensity score metods (Rosenbaum and Rubin, 1983, 1984 ave been proposed as a less parametric alternative to regression adjustment and are being increasingly used in ealt policy studies (Connors et al., 1996; D Agostino, 1998, and references terein. Tis approac, wic involves comparing subjects weigted (or stratified, matced according to teir propensity to receive treatment (i.e., propensity score, attempts to balance subjects in treatment groups in terms of observed caracteristics as would occur in a randomized experiment. Propensity score metods permit control of all observed confounding factors tat migt influence bot coice of treatment and outcome using a single composite measure, witout requiring specification of te relationsips between te control variables and outcome. Propensity score metods were developed and ave been applied in settings wit unstructured data. However, data collected in medical care and ealt policy studies are typically clustered or ierarcically structured, in te sense tat subjects are grouped togeter in one or more ways tat may be relevant to te analysis. For example, subjects maybe grouped by geograpical area, treatment center (e.g., ospital or pysicians, or in te example we consider in tis paper, ealt plan. Generally, subjects are assigned to clusters by an unknown mecanism tat may be associated wit measured subject caracteristics tat we are interested in (e.g., race, age, clinical caracteristics, measured subject caracteristics tat are not of intrinsic interest and are believed to be unrelated to outcomes except troug teir effects on assignment to clusters (e.g., location, and unmeasured subject caracteristics (e.g., unmeasured severity of disease, aggressiveness in seeking treatment. Wen subjects are ierarcically structured, a number of issues appear tat are not present wit an unstructured collection of subjects. First of all, standard error calculations tat ignore te ierarcical structure will be inaccurate, leading to incorrect inferences. A more interesting set of issues arises because tere may be bot measured and unmeasured factors at te cluster level tat create variation among clusters in quality of treatment and ence in outcomes. Hierarcical regression models ave been developed to give a more compreensive description tan non-ierarcical models provide for suc data (e.g., Gatsonis et al., 1993. Despite te increasing popularity of propensity score analyses and te vast literature regarding regional and provider variation in medical care and ealt policy researc (e.g., Nattinger et al., 199; Farrow et al., 1996, owever, to our knowledge, te implications of suc data structures for propensity score analyses ave been rarely studied. Huang et al. (005 applied propensity score metods to clustered ealt service data. But teir goal was to rank te performance of multiple ealt service providers (clusters in- 474

stead of to estimate an overall treatment effect from data wit clustered structure, wic is te goal of tis paper. Specifically, we will present several propensity score models analogues to many of te commonly used regression models for clustered data in Section ; investigate te beavior of tose estimators, especially te bias wen clustering information is ignored in te analysis in Section 3; and apply te metods to study racial disparities in te service of breast cancer screening among elders in Section 4. Summaries and remarks will be provided in Section 5. Our discussion concerns te case were a binary treatment is assigned at individual level. Also, to illustrate te major point and yet witout loss of generality, we focus on data wit two-level ierarcical structure.. Estimators Te class of estimands considered in tis paper is generally referred as a treatment effect, E x [E(Y X, Z 1] E x [E(Y X, Z 0], (1 Section on Statistics in Epidemiology i.e., te average difference in outcome between two treatment groups tat ave same distribution of covariates. Te propensity score e is defined as te conditional probability of being assigned to a particular treatment z given measured covariates x: e(x P (z 1 x. In most observational studies, te propensity score is not known and tus needed to be estimated. Terefore propensity score analysis usually involves two steps. Te first step is to estimate te propensity score, typically by a logistic regression. Te second step is to estimate te treatment effect by incorporating (e.g., by weigting or matcing te. Hierarcical structure leads to a range of different coices of modeling in bot steps. In tis section, we will introduce several most widely used models. Before going into more details, ere we make a note regarding te targeted estimand treatment effect defined above, wic is sligtly different from tose causal treatment effect defined using te conventional potential outcomes framework. Te propensity score originated from and as been widely used in causal inference, but its use is certainly not restricted to studying causal effects. For instance, in many ealt policy studies, te major interest is to compare te difference in te average of a feature (e.g., access to care between two groups (e.g., races, social economical status, rater tan to make a causal statement. Moreover, te treatment is often a non-manipulable variable, e.g., race or gender, wic does not gives a well-defined casual effect in te sense of Rubin (1978 (more discussion in Section 4. Neverteless, propensity score is still a valid and powerful tool to balance te covariates distribution between groups for studies wit non-causal purposes. Terefore, we avoid te subtle issue of causality trougout te paper and note te results obtained ere are applicable for studies wit more general (non-causal purposes. For ease of description, we still refer to our estimands discussed as treatment effects even toug tey are not necessarily causal. Hencefort, let m denote te total number of clusters; n te number of subjects in cluster ; y k te outcome for subject k in cluster (e.g., a clinical diagnosis; x k te corresponding covariates (typically vector-valued, e.g., age, stage of detection, comorbidity scores, etc.; v te cluster-level covariates (e.g., teacing status or measures of tecnical capacity of a ospital; z k te treatment assignment for te subject, z k {0, 1}; and e k te propensity score..1 Step 1. Estimating te propensity score To estimate te propensity score, several logistic regression models are available wit various treatment of te ierarcical structure..1.1 Marginal model As te name suggests, marginal regression models ignore clustering information. A typical marginal propensity score model would be ( ek log β e x k + κ e v, ( 1 e k were e k P (z k 1 x k, v. Tis model in fact assumes te treatment assignment mecanism is te same across all clusters. In oter words, it assumes tat two subjects are excangeable in terms of treatment propensity if tey ave te same vector of covariates, weter or not tey come from te same cluster. Tis propensity score model can be tougt of as a nonparametric alternative to a regression-based adjustment for individual and cluster covariates. Te analogous marginal regression model would be, y k γz k + β y x k + κ y v + ɛ k, (3 were ɛ k N(0, δ ɛ, and γ is te treatment effect. As model (, estimates derived from tis regression model rely on te assumption tat te outcome generating mecanism is te same across all clusters. Models ( and (3 ave a manifest similarity of form. A deeper connection is tat te sufficient statistics to estimate te treatment effect tat are balanced under propensity score estimator are te same tat must be balanced under model (3..1. Pooled witin-cluster model A pooled witin-cluster model for propensity score conditions on bot te covariates and te cluster indicators, e k log( δ e + β e x k, (4 1 e k were δ e is a cluster-level main effect, δe N(0,, and e k P (z k 1 x k,. Tis model implies te treatment assignment mecanism differs among clusters, and te difference is controlled by a cluster-level main effect δ e. Model (4 involves a more general assumption (weaker on te treatment assignment mecanism tan te marginal model (, because te cluster-level covariate v is a function of te cluster indicator. 475

Section on Statistics in Epidemiology In te above model, if we assume te cluster-specific main..1 Marginal estimator effects δ e follow a distribution, δe N(0, σ δ, ten we ave a new propensity score model wit random effects, Similar to te marginal model in step 1, te marginal estimator ignores clustering. A specific nonparametric estimator is te e k log( δ e + β e x k + κ e v. difference of te weigted overall means of te outcome of 1 e k two treatment groups, More generally, β e can be allowed to vary across clusters and zk 1 zk 0 follow a distribution. In practice, results from te above random effects model are usually similar to tose from te pooled.,marg zk 1,k w k y k,k w k y k ˆ zk 0, (7,k w k,k w k witin-cluster model wen te number of clusters is big. A corresponding pooled witin-cluster outcome model adjusting for cluster-level main effects and covariates is of te form: were te weigt w k is a function of te estimated propensity score. Te coice of weigt will be discussed in Section.3. Assume y k is omoscedastic and var(y k σ, ten te y k γz k + δ y + βy x k + ɛ k, (5 large sample variance of te marginal estimator is, were δ y is a cluster-level main effect, δy N(0,. Under s.,marg var( ˆ.,marg tis model, all information is obtained by comparisons witin clusters, since te δ y term absorbs all between-cluster information..1.3 Surrogate indicator model Wen tere are a large number of clusters wit large sample size, te computational task of fitting te pooled witin-cluster model can get demanding for standard software. Alternatively, define d z k n, te cluster-specific proportion of being treated, we can consider te following propensity score model e k d log( λ log( + β e x k + κ e v. (6 1 e k 1 d In te simplest situation were tere is no covariates, e k d for any, k. Terefore, comparing models (4 and (6, te logit of d maybe expected to be a reasonable surrogate for te cluster indicator in te pooled witin-cluster model wit te coefficient λ being around 1. Te inference is same as in te marginal model wit an additional covariate logit(d. Usually te coefficients of te cluster-level covariates κ e are very small since most of teir effects ave been absorbed by λ. Te surrogate indicator model reduces te m parameters (δ s in te pooled witin-cluster model to a single parameter λ, tus greatly reducing te computation required for model fitting. However, tis reduction is based on te assumption tat logit of te empirical cluster-specific proportion of being treated, logit(d, is linearly correlated wit logit of te true propensity score. Wen te underlying trut is far from tis assumption, te surrogate indicator model could perform poorly. Te goodness of fit of tese models can be cecked by conventional diagnostic procedures (e.g., Rosenbaum and Rubin, 1984. For example, one can ceck bot te overall and witin-cluster balance of te distribution of covariates weigted by te in different groups.. Step. Estimating te treatment effect Common approaces estimate treatment effects using propensity score involve weigting, matcing and stratification. We will focus on weigting in tis report. σ z k1,k wk zk0 ( z k 1,k w k + σ,k wk ( z k 0,k w k. (8 In practice σ can be estimated from te sample variance of y k... Clustered estimator A second estimator is to first obtain te cluster-specific weigted difference and ten calculate te weigted average of tese differences based on te sum of weigts in eac cluster. Tat is, for cluster, ˆ zk 1 w k y k zk 1 w k zk 0 w k y k zk 0. w k Te variance of te cluster-specific estimator ˆ under te independent omoscedastic assumption of y k witin cluster is s var( ˆ σ zk 1 wk ( z k 1 w k + σ zk 0 wk w k. ( z k 0 Similarly, σ can be estimated from its empirical counterpart witin eac cluster. Let w be a function of te weigts in cluster, e.g., te sum of weigts w w k, or te precision of te estimator ˆ, w s. Te overall clustered estimator is ten an average of te ˆ s weigted by w, And te overall variance is s.,clu var( ˆ.,clu ˆ.,clu w ˆ w. (9 ( w k s (,k w k. (10 Standard errors of estimators s.,marg and s.,clu also be obtained from resampling metods suc as te bootstrap. 476

..3 Doubly-robust estimators Te weigted mean can be regarded as a weigted regression witout covariates. Terefore in step, we can replace te nonparametric weigted mean (7 or (9 by a parametric regression (e.g., model (3 or (5 weigted by te estimated propensity score. And te coefficient of te treatment assignment γ is te targeted estimand of treatment effect. Tis is essentially te class of doubly-robust estimators proposed by Scarfstein et al. (1999. Doubly-robust estimators allow flexible model coices in bot steps, wic can be very beneficial in applications. Tese estimators are coined doubly-robust in te sense tat tey are proven to be consistent if one but not necessarily bot of te step 1 and models are correctly specified under te Horvitz-Tompson weigt (see below. Detailed discussion of tis property wit ierarcical data is presented in te next section..3 Coice of weigts We now consider te coice of weigts. We call te class of weigts wic balances te distribution of covariates between treatment groups balancing weigts. Te most widely used balancing weigt is te Horvitz-Tompson (inverse probability weigt [ X(1Z 1e(X w k { 1 e k, for z k 1 1 1e k, for z k 0. [ ] XZ Te H-T weigt is a balancing weigt because E e(x ] E. Te H-T estimator compares te expected outcome of te subjects placed in z 0 versus tat of te subjects placed in z 1, averaging over te distribution of covariates in te combined population. Tat is, [ ] Y Z Y (1 Z E E[(Y Z 1 (Y Z 0]. e(x 1 e(x In fact, te doubly-robust estimators in Scarfstein et al. (1999 are restricted to using te H-T weigt because of tis clear causal interpretation. However, te H-T estimator as been well known to ave excessively large variance wen tere are subjects wit extremely small propensity score. Neverteless, te same idea is readily extended to any balancing weigt, altoug alternative weigts migt define different estimands. For example, we can consider te population-overlap weigt, { 1 ek, for z w k k 1 e k, for z k 0. were eac subject is weigted by te probability of being assigned to te oter treatment group. It is also a balancing weigt because E[XZ{1 e(x}] E[X(1 Ze(X]. In teory, te population-overlap weigt gives te smallest variance under a omoscedastic model for Y given X. But it defines a different estimand tan te te Horvitz-Tompson weigt. Specifically, we call tis te population-overlap weigt because it results in an average treatment effect tat is Section on Statistics in Epidemiology averaged over te distribution of covariates in te population were te two treatment groups overlap E[Y Z{1 e(x} Y (1 Ze(X] E[{(Y Z 1 (Y Z 0}e(X{1 e(x}]. Tis population-overlap estimator can be calculated wit acceptable variance wen te H-T estimator cannot be practically estimated, because e(x can approac 0 or 1 for some would become extremely large. In effect te H-T estimator attempts to estimate a treatment effect for types of cases wic are essentially unrepresented in one or te oter group, wile te populationoverlap weigting focuses on te types of cases wit a more balanced distribution of treatment. In addition to its statistical advantage, te latter analysis may be more scientifically relevant since it focuses attention on comparison of outcomes among te kinds of cases wic bot treatments are currently observed, for example tose in clinical equipoise between treatments. part of x space suc tat 1 e(x or 1 1e(x 3. Bias of Estimators In tis section, we investigate te bias of eac of te estimators proposed in te previous section. We first look at te simplest case wit two level-ierarcical structure and no covariates. Let n 1 (n 0 denote te number of subjects wit z 1(z 0 in cluster ; and n +1 n 1, n +0 n 0, n +1 + n +0. Assume te outcome generating mecanism for a continuous outcome follows a random effects model wit cluster-level random intercepts and random treatment effects, y k δ + γ z k + αd + ɛ k, (11 were δ N(0, σδ, ɛ k N(0, σɛ, α is te effect of te cluster-specific proportion of being treated d on te outcome, and te true treatment effect is γ wit γ N(γ 0, σγ. We first look at te situation were clustering information is ignored in bot steps. For te marginal model in step 1, it is easy to sow tat te is te same for eac subject ê k n+1. Consequently, te marginal estimator is ˆ marg,marg zk 1,k y k n +1 n 1 n +1 zk 0,k y k n +0 γ + ( n 1 n 0 δ n +1 n +1 n +0 zk 1 zk 0,k ɛ k,k ɛ k +( n +0 +α n +1n +0 n d (1 d n +1n +0 n 1 < and n +1 < old, ten by te weak law of large num- Assume te common regularity conditions n 0 n +0 477

Section on Statistics in Epidemiology bers for te weigted sum of independent and identicallydistributed n < as n random variables (e.g., Cow and Lai, 1973, ++ n 1 n +1 γ converges to γ 0 as te number of clusters goes to infinity, and ( n 1 n +1 n 0 ˆ n +0 δ goes to 0, so does te tird pool,marg n term in te above formula. In te fourt term, ++ n +1n +0 is in fact te variance of te total number of treated subjects, var(n +1, if all clusters are excangeable, i.e., if all subjects regardless of te clusters follow te same treatment assignment mecanism, z Bernoulli( n+1. Furtermore, n d (1 d is te sum of te variance of te number of treated subjects witin eac cluster, var(n 1, if eac cluster separately follows a treatment assignment mecanism, z Bernoulli( n 1 n. Terefore, bias of te marginal estimator wit propensity score estimated from te marginal model is [ Bias( ˆ var(n+1 marg,marg α var(n ] 1. var(n +1 (1 Te size of te bias is controlled by two factors: (1 te ratio of te variance of te total number of treated subjects under a omogeneous versus a cluster-eterogeneous treatment assignment mecanism; and ( te effect tat te cluster-specific proportion of being treated d as in te response, i.e., α. Tis is intuitive because te first factor measures te variation in te treatment assignment mecanism among clusters and te second measures te variation in te outcome generating mecanism, bot of wic are ignored in te analysis wit marginal models in bot steps. Wen eiter but not necessarily bot of te two mecanisms is omogenous across clusters, te marginal estimator, ˆ marg,marg, is also consistent. However, in reality, it is most likely tat bot of te mecanisms are eterogenous among clusters. We now look at te opposite situation were clustering information is taken into account in bot steps. For te pooled witin-cluster model in step 1, it is easy to sow tat te estimated propensity score is ê k n 1 n. Ten te clustered weigted estimator is ˆ pool,clu ( z k 1 y k n 1 m γ m + ( z k 1 ( z k 0 ɛ k n 1 m y k n 0 ( z k 0 ɛ k n 0 m m n,m γ 0 (13 wic is asymptotically unbiased. Te result is free of te form of weigt. Simple calculation sows tat te clustered weigted estimator combining te marginal model in step 1, ˆ marg,clu, is of exactly te same form as tat in (13 and tus also unbiased. Furtermore, te marginal estimator wit propensity score estimated from te pooled witincluster model, ˆ pool,clu, follows te same form as in (13, but only under H-T weigt and a balanced design (i.e., eac cluster as same number of subjects. Under H-T weigt but an unbalanced design, te estimator is also consistent (assume n γ n,m γ 0. However, te same estimator under te population-overlap weigt is n,m γ 0. ˆ pool,marg n 0( z k 1 n 1 n 0 y k n n 1( z k 0 n 1 n 0 n ɛ k n (γ + z k 1 n 1 z k 0 n 1 n 0 n y k n ɛ k n 0 Even toug tis estimator is also asymptotically unbiased, its small sample beavior can be quite different from tat of te estimator under H-T weigt. Under te omoscedasticity assumption of outcome, te tree H-T estimators ˆ pool,marg, ˆ marg,clu, and ˆ pool,clu tat take into account clustering in at least one step ave te same variance, s σɛ n n ( 1 + 1. ++ n 1 n 0 Similarly as te discussion on bias, tis result is generally not applicable for oter type of weigts. Specifically, te variance of ˆ pool,marg is usually larger tan tat of ˆ marg,clu and ˆ pool,clu. Wen tere are no covariates, te surrogate indicator model gives te as te pooled witincluster model. Tus te results obtained above regarding te pooled witin-cluster model automatically old for te surrogate indicator model. But tis is not te case for te general situation wit covariates. Te proofs are analogous for data wit a iger order of ierarcical levels. For te simplest case witout covariates, above we ave sown te double-robustness of tose propensity score estimators, tat is, wen bot of te true underlying treatment assignment mecanism and outcome generating mecanism are ierarcically structured, te estimator using a balancing weigt is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. Tis can be viewed as bot a special case and an extension of te double-robustness property of te estimator in Scarfstein et al. (1999. Te extension lies in tat our conclusion is instead free of te form of weigt. In te more general cases wit covariates, usually tere is no closed-form solution to te logistic models for estimating te propensity score. Consequently, tere is no closed-form of te bias of tose estimators as above. Neverteless, tis situation can be explored eiter by large-scale simulations, or by adopting a probit (instead of logistic link for estimating te propensity score. Intuitively, te double-robustness property still olds. But te bias of a marginal estimator ˆ marg,marg is 478

expected to also be affected by te size of te true treatment effect γ (negative correlated and te ratio of between-cluster and witin-cluster variance g σ δ σ (positively correlated, ɛ in addition to α and var(n+1 var(n 1 var(n +1 in (1. A compreensive discussion is beyond te scope tis report and is subject to furter researc. 4. Application We now apply te above metods to study racial disparity in ealt services. Disparity refers to racial differences in care attributed to operations of ealt care system. Our application concerns te HEDIS R measures of ealt care provided in Medicare ealt plans. Eac of tese measures is an estimate of te rate at wic a guideline-recommended clinical service is provided witin te appropriate population. We obtained individual-level data from te Centers for Medicare and Medicaid Services (CMS on breast cancer screening of women in Medicare managed care ealt plans (Scneider et al., 00. Our main interest is te disparity between wites and blacks, so we exclude subjects of oter races for wom racial identification is unreliable in tis dataset. We focus on plans wit at least 5 wites and 5 blacks, leaving 64 plans wit a total sample size of 7501. For practical reasons, we drew a random subsample of size 3000 from eac of te tree large plans wit more tan 3000 subjects, leaving a total sample size of 56480. All te covariates considered in te analysis are binary. Te individual-level covariates x k include two indicators of age category (70-80,>80 wit reference group being 60-70; eligibility for Medicaid (1 yes; neigborood status indicator (1 poor. Te plan-level covariates v include nine geograpical code indicators; non/for-profit status (1 for-profit; and te practice model of providers (1 staff-group model; 0 networkindependent practice model. Te outcome y is a binary variable equal to 1 if te enrollee underwent breast cancer screening and equal to 0 oterwise, and te treatment z ere is race (1 black, 0 wite. We want to estimate te difference in te proportion of undergoing breast cancer screening between wites and blacks. As mentioned before, race is not a valid treatment in conventional sense in causal inference, because it is not manipulable (Holland, 1986. However, in tis particular application, our goal is not to study te causal patway between race and ealt service utilization, but simply to estimate te magnitude of disparity under balanced distributions of covariates between te two races. Hence, te propensity score in tis application is merely an analytical tool to acieve tis goal, and it sould not be taken as aving te explicit meaning of te probability of being black. We first estimate te propensity score using te tree models introduced in Section.1 wit all te above covariates included. Details of te fitted models are omitted ere since te focus is te fitted values (. All models suggest tat living in poor neigborood, being eligible for Medicaid and enrollment in for-profit insurance plan are significantly associated wit being black race. Figures 1 and sow istograms of te for Section on Statistics in Epidemiology wites and blacks. Different models clearly give quite different estimates of propensity score in tis data, were te marginal model departs mostly from te oter two models. Te variance of te of blacks is muc bigger tan tat of wites, regardless of te model. We cecked te weigted distributions of covariates. Eac model leads to good balance of te overall weigted covariates distributions between groups. However, te marginal model in general does poorly in balancing covariates between races witin eac cluster, wile te surrogate indicator model does better, and te pooled witin-cluster model does te best. Tis suggests tat tere is important between-cluster variation. 0 000 6000 0 1000 000 0 1000 000 3000 Histogram of Propensity score of Wites Estimated from Marginal Model 0.0 0. 0.4 0.6 0.8 1.0 Histogram of Propensity score of Wites Estimated from Pooled WitinCluster Model 0.0 0. 0.4 0.6 0.8 1.0 Histogram of Propensity score of Wites Estimated from Surrogate Indicator Model 0.0 0. 0.4 0.6 0.8 1.0 Figure 1: Histogram of propensity score estimated from different models for wites. 0 100 300 0 100 00 300 0 100 00 300 Histogram of Propensity score of Blacks Estimated from Marginal Model 0.0 0. 0.4 0.6 0.8 1.0 Histogram of Propensity score of Wites Estimated from Pooled WitinCluster Model 0.0 0. 0.4 0.6 0.8 1.0 Histogram of Propensity score of Wites Estimated from Surrogate Indicator Model 0.0 0. 0.4 0.6 0.8 1.0 Figure : Histogram of propensity score estimated from different models for blacks. Using te, we estimate racial disparity in breast cancer screening among te elder women participating Medicare ealt plans by te estimators proposed in Section.. Altoug te outcome is binary in tis case, te probabilities of outcome are in a range were te linear probability model is an acceptable fit. Hence, for te doubly-robust estimators, we adopt te combinations of te tree propensity score models (, (4 and (6 in step 1 and 479

te two outcome models (3 and (5 in step. Table 1 sows te estimates using te H-T weigt. Eac row represents one step 1 model, and eac column represents one type of step model/estimator. Analogous results using te populationoverlap weigt are given in Table. Section on Statistics in Epidemiology among te elders wo participate in Medicare ealt plans, blacks on average ave a significantly lower cance to receive breast cancer screening tan wites, after adjusting for age, geograpical region, social economical status and ealt plan caracteristics. weigted doubly-robust marginal clustered marginal pooled witin marginal -0.050-0.00-0.04-0.01 (0.008 (0.008 (0.004 (0.004 pooled -0.04-0.01-0.018-0.0 witin (0.009 (0.008 (0.004 (0.004 surrogate -0.017-0.015-0.01-0.015 indicator (0.009 (0.008 (0.004 (0.004 Table 1: Difference in te proportion of getting breast cancer screening between blacks and wites using Horvitz- Tompson weigt All models sow te proportion of receiving breast cancer screening is significantly lower among blacks tan among wites wit similar caracteristics. Te estimates are similar except for te analyses tat ignore clustering in bot steps, wic overestimate te treatment effect. Tis pattern matces te double-robustness property. Results from te surrogate indicator model in step 1 are sligtly different from te oters, suggesting te cluster-specific proportion of being treated d is correlated wit certain covariates. Te doubly-robust estimates ave smaller standard errors because te extra variation is explained by covariates in step. Not surprisingly, te estimates using H-T weigt ave muc larger variances tan tose using te population-overlap weigt. We also notice tat te estimates incorporating clustering in step ave less variation tan tose doing so in step 1. Tis observation suggests, in application, modeling te ierarcical structure for te outcome generating mecanism leads to more stable estimates, even toug in teory correct model specification in bot steps are equivalent in terms of teir effect on consistency. A possible explanation is te impact of misspecifying propensity score is attenuated troug weigting because te ultimate estimand is a function of te outcome, rater tan of te propensity score. Even toug we do not know te underlying trut, te similarity of various estimators suggests our analyses capture te main information regarding disparity in tis data. Tat is, weigted doubly-robust marginal clustered marginal pooled witin marginal -0.043-0.030-0.043-0.03 (0.007 (0.008 (0.004 (0.004 pooled -0.030-0.031-0.031-0.031 witin (0.007 (0.008 (0.004 (0.004 surrogate -0.035-0.030-0.031-0.030 indicator (0.007 (0.008 (0.004 (0.004 Table : Difference in te proportion of getting breast cancer screening between blacks and wites using population-overlap weigt 5. Summary and Remarks Since first been proposed twenty-five years ago, propensity score metods ave gained increasing popularity in observational studies in multiple disciplines. One example is ealt care policy researc, were data wit ierarcical structure are rule rater tan exception nowadays. However, despite te wide appreciation of propensity score among bot statisticians and ealt policy researcers, tere is very limited literature regarding te metodological issues of propensity score metods in te context of ierarcical data, wic motivates our exploration in tis paper. Specifically, we present tree typical models for estimating propensity score and two types of nonparametric weigted (by estimators of treatment effect for ierarcically structured data. Furtermore, for te simplest (conceptual case witout covariates, we sow te double-robustness of tose weigted estimators: wen bot of te true underlying treatment assignment mecanism and outcome generating mecanism are ierarcically structured, te estimator is consistent as long as te ierarcical structure is taken into account in at least one of te two steps in te propensity score procedure. We also quantify te bias of te estimator wen clustering is ignored in bot steps. We ave focused on te case of treatment being assigned at te individual level in tis paper. Treatment assigned at te cluster level (e.g., ospital, ealt care provider is also common in medical care and ealt policy studies, were several new callenging issues can arise. First, te number of clusters is often relatively small despite a large total sample size. Tis could lead to poorly s wit excessively large standard errors. Second, te clusterlevel propensity score only balances te cluster-level covariates and te average individual-level coviariates. Wat are te consequences of te possible imbalance in te overall distributions of individual-level covariates? Tis also as a strong connection to te ecological inference commonly encountered in political science (e.g., King, 1997 were te estimand as an interpretation as an average effect on individual outcomes. Tird, all te nonparametric weigted estimators discussed in tis paper do not make use of te individual-level covariates, wic often contain crucial information. Te doubly-robust estimators wit flexible regression model coice in te second step appear to be preferable in tis case. But wat specific regression model to coose greatly depends on te specific data. Fourt, most interestingly, te foundational stable-unittreatment-value assumption (SUTVA te observation on one unit sould be unaffected by te particular assignment of treatments to te oter units (Cox 1958,.4 often no longer olds under clustered treatment assignment, especially in te studies wit, for instance, beavioral outcomes and infectious disease. In tat case, correct modeling of te interference 480

Section on Statistics in Epidemiology among subjects is crucial for valid analysis. Tose issues are among a range of open questions remained to be explored on tis topic. Furter systematic researc efforts are desired to sed insigt to te metodological issues and to provide guidelines for practical applications. REFERENCES Connors, A., Speroff, T., Dawson, N., and et al. (1996. Te effectiveness of rigt eart cateterization in te initial care of critically ill patients. Journal of te American Medical Association 76, 889-897. Cox. C.P. (1958. Te Analysis of Latin Square Designs wit Individual Curvatures in one Direction. Journal of te Royal Statistical Society. Series B. 0(1, 193-04. Cow, Y. S. and Lai, T.L. (1973. Limiting beavior of weigted sums of independent random variables. Te Annals of Probability 1(5, 810-84. D Agostino, R. (1998. Tutorial in biostatistics: propensity score metods for bias reduction in te comparisons of a treatment to a non-randomized control. Statistics in Medicine 17, 65-81. Farrow, D., Samet, J. and Hunt, W. (1996. Regional variation in survival following te diagnosis of cancer. Journal of Clinical Epidemiology 49, 843-847. Gatsonis, C., Normand, S., Liu, C., and Morris, C. (1993. Geograpic variation of procedure utilization: a ierarcical model approac. Medical Care 31, YS54-YS59. King, G. (1997. A Solution to te Ecological Inference Problem: Reconstructing Individual Beavior from Aggregate Data. Princeton University Press. Holland, P.W. (1986. Statistics and causal inference (wit discussion. Journal of te American Statistical Association 81, 945-970. Huang, I.C., Frangakis, C.E., Dominici, F., Diette, G. and Wu, A.W. (005. Application of a propensity score approac for risk adjustment in profiling multiple pysician groups on astma care. Healt Services Researc 40, 53-78. Nattinger, A., Gottilieb, M., Veum, J., and et al. (199. Geograpic variation in te use of breast-conserving treatment for breast cancer. New England Journal of Medicine 36, 110-117. Rosenbaum, P.R. and Rubin, D.B. (1983. Te central role of te propensity score in observational studies for causal effects. Biometrika 70(1, 41-55. Rosenbaum, P.R. and Rubin, D.B. (1984. Reducing bias in observational studies using subclassification on te propensity score. Journal of te American Statistical Association 79, 516-54. Rubin, D.B. (1978 Bayesian inference for causal effects: te role of randomization. Annals of Statistics 6, 34-58. Rubin, D.B. (1979. Using multivariate matced sampling and regression adjustment to control bias in observational studies. Journal of te American Statistical Association 74, 318-34. Scarfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999. Adjusting for nonignorable drop-out using semiparametric nonresponse models (wit discussion. Journal of te American Statistical Association 94, 1096-1146. Scneider E.C., Zaslavsky A.M., Epstein, A.M. (00. Racial disparities in te quality of care for enrollees in Medicare managed care. Journal of te American Medical Association 87(10, 188-194. 481