Stephen G. West and Felix Thoemmes

24 Equating Groups Stephen G. West and Felix Thoemmes EQUATING GROUPS One of the most central tasks of both basic and applied behavioral science is to estimate the size of treatment effects. The basic procedure is conceptually very straightforward. The researcher identifies a treatment (T) of interest such as a new drug treatment or a new cognitive approach to psychotherapy. In our illustration T is designed as a possible means of reducing depression in a clinical population. The researcher then identifies a comparison (C) condition to which the treatment is to be compared. In the case of the new drug treatment, the researcher might choose a placebo which has no pharmaceutical effect on depression or another drug that is the current standard drug prescribed to help relieve depression. Similarly, in the case of the new psychotherapy, the researcher might choose no psychotherapy, psychotherapy without the new cognitive elements, or the standard psychotherapeutic treatment that is commonly delivered (standard of practice). Each patient s level of depression is then measured following treatment. The difference between the mean level of depression in the treatment and control groups,y T Y C, is then taken as the estimate of the treatment effect. However, the estimate of the treatment effect will be valid if and only if the two groups have been successfully equated prior to the implementation of the treatment. Otherwise stated, only if the groups are equated will Y T Y C be an unbiased estimate of the causal effect of the treatment. This chapter will examine some major methods of equating groups. We will draw from insights in statistics (Holland, 1986; Rosenbaum, 2002; Rubin, 1974, 1978, 2005), psychology (Reichardt, 2006; Shadish et al., 2002; West et al., 2000), public health (Little & Rubin, 2000), and sociology and econometrics (Winship & Morgan, 1999). We will focus on comparisons of a treatment and comparison group in two commonly used research designs: the randomized experiment and the observational study (i.e. nonequivalent control group design). The key feature that distinguishes these two designs is the process through which units are assigned to the T and C groups (Judd & Kenny, 1981). The randomized experiment uses some random process (e.g. flipping a coin, a random number generator) to determine assignment of the units to the T and C groups. The [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 414 414 430

EQUATING GROUPS 415 units are typically individual participants, but they may be larger aggregations such as schools or entire communities. This process implies that the expected mean of the units in the T group will equal the expected mean of the C group on any conceivable measured or unmeasured baseline variable so that Y T Y C may be taken as an unbiased estimate of the treatment effect. In contrast, the observational study uses an unknown process to assign participants to the groups. Participants may choose to receive the T versus C, or participants may receive the treatment because they are located in a single community, school, hospital, or other larger unit that has agreed to participate in the study. The process through which participants end up in the T versus C groups is unknown, implying that researchers should expect that there are potential mean differences on background variables between the T and C groups at baseline, even before treatment commences. Now Y T Y C no longer represents an unbiased estimate of the causal effect of the treatment, but rather a confounded estimate reflecting some combination of the true causal effect of treatment and preexisting differences between the groups on measured or unmeasured variables at baseline (Reichardt, 2006). Only by carefully assessing critical participant characteristics at baseline and developing methods to equate the T and C groups prior to the beginning of treatment can the researcher even approximately estimate the desired effect of the treatment. We begin this chapter by briefly reviewing the randomized experiment. The randomized experiment is often described as the gold standard design and it serves as an important benchmark for the observational study. We identify some ways in which even randomized experiments can be enhanced through the use of additional procedures designed to more closely equate the groups at baseline. We then briefly review studies comparing the treatment effect estimates from randomized experiments to those of observational studies studying similar treatments, to provide information about the conditions under which these two designs may lead to different estimates of the treatment effect. We then introduce modern methods of adjusting treatment effects in observational studies for measured differences at baseline. These methods can substantially reduce any bias in the estimate of the treatment effect. Other approaches attempt to bracket the size of the treatment effect so that it represents a reasonable estimate even if there are variations on important unmeasured differences at baseline. Finally, we consider design enhancements that help rule out likely effects of unmeasured variables that may provide alternative explanations for the observed effect of treatment. RANDOMIZED EXPERIMENTS Randomization approximately equates the T and C groups at baseline. More formally, randomization produces two important results (Holland, 1986; West et al., 2000). First, as we observed above, the expected mean on any participant characteristic at baseline will be equal in the T and C groups, E ( ) Y Tbaseline = E ( ) Y Cbaseline, where E( ) is the expected value of the variable in parentheses. Second, the binary variable X (1 = T; 0 = C) indicating the treatment condition, is expected to be unrelated to all possible participant characteristics at baseline, E ( ) r XYbaseline = 0. These two results imply that Y T Y C at post test will be an unbiased estimate of the treatment effect so that no adjustment of this effect is needed. Note, however, that these results are expectations. They will hold exactly only given very large sample sizes or across a large number of exact replications of the same experiment conducted on a single population. In any single experiment using more modest sample sizes unfortunate randomization in which the T and C groups differ at baseline on some subset of important background variables can be expected to occur with some regularity. For this reason many journals in the public health area formally require that means of the T and C groups on important baseline measures be reported as a check on the success of the randomization in the experiment. Following our presentation of [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 415 414 430

416 THE SAGE HANDBOOK OF SOCIAL RESEARCH METHODS additional requirements for randomized field experiments, we will discuss procedures that use these baseline measures to equate groups more adequately prior to treatment in order to provide more statistically powerful tests of the treatment effects. Additional requirements Randomized experiments involve additional requirements that must be met for valid estimation of the treatment effect (see Chapter 8). These requirements are routinely met in most laboratory experiments, but can be easily violated in community settings. Failure to meet these requirements may necessitate the use of special procedures, the inclusion of additional design features, or the use of special analysis procedures that adjust for the potential bias (Barnard et al., 1998). Four requirements over which the experimenter may only have limited control are of particular importance in randomized field experiments 1. 1 Proper Randomization. The randomization process must be properly carried out and adhered to. Treatment providers must not be permitted to alter the assignment of participants to the T and C conditions. Kopans (1994) presents evidence that reassignment of high-risk women to the treatment condition apparently occurred in a large national randomized trial evaluating the effectiveness of screening mammography. Connor (1977) provides other examples of experiments in which randomization failed or was not maintained by treatment providers. He suggests procedures that potentially minimize the likelihood of such randomization failures. Robins (1989) and Hernán et al. (2001) present methods of adjusting treatment effect estimates in complex longitudinal studies, for example, when participants are reassigned to another treatment, as in certain medical studies in which the patient does not respond to the assigned treatment. 2 Treatment Compliance. The participants must receive the intended treatment. In randomized experiments studying mammography screening, some participants have refused screening (T). Other participants in the C group have sought out mammography screening outside the experiment (Baker, 1998). West and Sagarin (2000; see also Angrist et al., 1996; Jo, 2002) review statistical procedures that can provide proper estimates of the treatment effect when there is treatment noncompliance. 3 Absence of Attrition. All participants who are assigned to T and C conditions must be measured on the outcome variable. Even though randomization serves to equate participants on average at baseline, this equating is potentially lost if some participants are not measured at posttest. Of most concern is differential attrition in which participants with different characteristics drop out of the two groups. For example, in an experiment investigating a new method of mathematics instruction, less mathematically talented students might find the new course too challenging and withdraw prior to the collection of the outcome measure. Y T would only be based on the scores of the more talented students assigned to the T condition, leading to an overestimate of the effectiveness of the course. Modern missing data techniques (Little & Rubin, 2002; Schafer & Graham, 2002) can improve the estimation of the treatment effect, particularly if variables that are highly related to the outcome (e.g. baseline measures on the outcomes of interest), to missingness, or ideally both are measured at baseline. Full information maximum likelihood estimation (FIML), now available in several statistical packages (e.g. Mplus), combines all of the observed data to produce optimal estimates and standard errors for the treatment effect and other parameters of interest in the statistical model. Multiple imputation (MI), also available in several statistical packages (e.g. SAS), makes multiple copies of the dataset. In each copy, the optimal predicted value for each missing datum is calculated, then random error matching that in the complete data is added. The step of adding random error insures that the original variability of the observed data is retained in the values that are imputed. The statistical model testing the treatment effect is then estimated in each copy of the dataset. Finally, the estimates of the treatment effects (and other parameters of interest) in each copy of the dataset are recombined. FIML and MI will both produce unbiased estimates of the treatment effect with proper standard errors if missingness is related to measured variables in the dataset, but not if there are other aspects of the missing variables that are not captured by other variables in the dataset. Consider two potential reasons why participants might be missing from a measurement session in a study of health outcomes in a large company. In the first case, [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 416 414 430

EQUATING GROUPS 417 each participant s baseline measure of health (e.g. number of days of illness the previous year) is the only variable that systematically predicts whether the participant will be present for the session. In the second case, several of the participants in a division of the company are missing because they are suffering health problems from working day and night on an intensive new project. In the first case, neither FIML nor MI will produce unbiased estimates because the source(s) of missingness were measured at baseline and are present in the dataset. In the second case, both FIML and MI will produce biased estimates of the treatment effect unless information about project participation and the current project-related health problems are present in the dataset. Suppose, however, that the researchers had used available substantive theory and research to select an extensive set of baseline variables that were expected to be related to the outcome variables, missingness, or both. Once again, information about project participation and project-related health problems are not available in the dataset. In this case, the use of FIML or MI will typically lead to estimates of the treatment effect that are less biased, perhaps substantially so, than methods that ignore missing data or that use traditional approaches such as listwise deletion, pairwise deletion, and mean imputation to address missing data. 4 Stable-Unit-Treatment-Value Assumption. The response of the participant should not be affected by the treatments (or the participant s knowledge thereof) that other participants receive. This condition is known as the stable-unit-treatmentvalue-assumption (SUTVA); its purpose is to ensure that each participant can only have one true response in the treatment condition (see Rubin, 1978, 1980). Otherwise, the outcomes of the participants in the C group are likely to be atypical. For example, if cancer patients learn that other participants have been assigned to a more promising treatment condition, they may give up hope and stop performing their normal health supportive practices (e.g. proper diet) so that they will have worse outcomes than they would have had in the absence of this knowledge. Some effects of improving group comparability at baseline Randomization combined with meeting the four requirements outlined above assures that the estimate of the treatment effect is unbiased. Unfortunately, this only means that the treatment effect will be correct on average. There is no guarantee that unfortunate randomization will not occur in a particular experiment. If the T and C groups can be closely equated at baseline on variables thought to be important predictors of the outcome, then the likelihood of unfortunate randomization can be substantially reduced. Equating procedures thus reduces the potential of an incorrect estimate of the treatment effect in a specific experiment. Equating procedures can also have the benefit of increasing the statistical power of the test, the probability that a true treatment effect of a specified size can be detected. Finally, they may help reduce some of the uncertainty associated with statistical methods of correcting treatment estimates when the four additional requirements are not met. The use of equating procedures is particularly important when the number of units to be assigned is small, the units are not homogeneous, or the treatment effect is not constant, but rather differs in magnitude as a function of the variable(s) on which equating is based. Consider the following example that captures the importance of equating with a small number of non-homogeneous units. Suppose a randomized experiment is conducted in which the units are six different US cities. Each city receives either an intensive mass media campaign of anti-smoking public service announcements (T) or it does not receive any smoking-related messages in the media (C). The cities chosen for study are from three groups: (a) large cities: Chicago, IL, Los Angeles, CA; (b) medium-sized cities: Baltimore, MD, Portland, OR; and (c) small cities: Terre Haute, IN and San Angelo, TX. Three cities are to be assigned to T and three cities to C. Assume that size of the city is known to be strongly related to the effectiveness of mass media campaigns in health. Following Cochran and Cox (1957), when there are equal numbers ((n) ) of units in 2n the T and C groups, there are possible n randomizations. In the present example, there [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 417 414 430

418 THE SAGE HANDBOOK OF SOCIAL RESEARCH METHODS ( ) 6 are 3 = 6 x 5 x 4 x 3 x 2 x 1 (3 x 2 x 1)(3 x 2 x 1) or 20 possible randomizations. A randomization that compared Chicago, Baltimore, and Terre Haute to Los Angeles, Portland, and San Angelo would be desirable. In contrast, a randomization that compared Chicago, Los Angeles, and Baltimore to Portland, Terre Haute, and San Angelo would be unfortunate. To avoid this problem, the researcher could match the two large cities, the two medium cities, and two small cities. Within each matched pair, one city would be randomly assigned to T and one to C, leading to a randomization in which the T and C groups will be more adequately balanced, particularly on the critical baseline variable of size of city. This procedure of pair matching followed by randomization is very general. For example, in a randomized experiment evaluating a new math instruction program, students could be assessed on a baseline measure of math ability that is expected to be highly related to the outcome variable, here math achievement. The students could be ranked based on their scores and pairs formed (the two highest; the next two highest;... down to the two lowest). Once again, within each pair students would be randomly assigned to T and C groups. This procedure insures that the T and C groups will be closely equated on the important baseline variable of pretest math ability, preventing any possibility of unfortunate randomization with respect to this critical variable. A second advantage of this procedure is that it can lead to far more statistically powerful tests of the treatment 2. For example, Student (1931) showed that an early randomized experiment on 10,000 children studying the effects of pasteurized (T) versus raw (C) milk on height and weight gains could have achieved the same level of statistical power with 50 pairs of identical twins. Matching followed by randomization may also lead to a third benefit, providing a stronger foundation for addressing failures to adequately meet the additional requirements of randomized experiments (presented above). For example, the existence of well-matched pairs may provide a stronger basis for modeling the effects of treatment non-compliance and attrition, particularly in experiments in which sample sizes are moderately rather than extremely large and the size of the treatment effect is not constant, but rather depends on the level of the baseline variable (i.e. a baseline x treatment condition interaction). Conceptually, matching followed by randomization may also have other potential advantages in certain contexts as it implicitly identifies a specific comparison participant with which each treatment recipient may be compared. For example, many clinicians would ideally like to understand the effects of treatments on single cases rather than the average effect of the treatment on patients in general. The matching and randomization procedure can permit a closer approximation of this ideal than simple randomization. When many measures are collected at baseline, matching becomes more difficult. In some cases the multiple measures can be combined a priori into a single composite variable on which matching can occur. For example, in research related to breast cancer, a set of measures including age at menarche, number of first-degree relatives (mother, sister) with breast cancer, number of previous breast biopsies, and age are combined into a single risk score using a formula based on prior epidemiological research (Gail et al., 1989). Alternatively, measures can be collected on the entire sample prior to randomization. The researcher can generate several thousand different possible randomizations and calculate Hotellings T 2 for each randomization using the key variables measured at baseline. Hotellings T 2 describes the magnitude of the multivariate difference between the groups, here on the baseline variables. The randomizations are sorted from low to high in terms of the values of Hotellings T 2. From the 5% or 10% of the randomizations with the lowest values of Hotellings T 2, a randomization is chosen, thereby minimizing potential problems of unfortunate randomization. More complicated blocking and randomization procedures to achieve these same goals in other specialized experimental contexts (e.g. trickle [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 418 414 430

EQUATING GROUPS 419 flow randomization in which participants are recruited over an extended period of time) are described in Friedman et al. (1998) and Matthews (2000). RESEARCH COMPARING THE RESULTS OF RANDOMIZED EXPERIMENTS AND OBSERVATIONAL STUDIES As a starting point for studying methods to improve the results of observational studies, it is useful to review literature comparing the results of randomized experiments with those of observational studies. Properly implemented randomized experiments serve as the gold standard they typically provide the best, unbiased estimates of the magnitude of the treatment effect. In contrast, the unknown rules through which participants in observational studies are assigned to the T or C conditions lead to far greater uncertainty about the treatment effect estimate. The researcher would like to claim that some aspect of the treatment caused the observed results; however, it may be possible that a failure to successfully equate the groups at the beginning of the experiment provides a strong alternative explanation (Reichardt, 2006). Even when adjustments in the treatment effect can be made on the basis of measures collected at baseline, there may be less than complete certainty that the T and C groups have been properly equated. Even though statistical theory clearly identifies failure to equate the T and C groups on important variables at baseline as an important, plausible problem that may occur in observational studies, it provides little guidance as to likely frequency of this problem in practice, nor to the contexts in which estimates of treatment effects are most likely to be biased. To gain some insights into this issue, below we briefly review literature comparing the results of randomized experiments with observational studies that employed similar treatments. We then turn to an examination of modern statistical and design solutions that attempt to address these issues. Two types of comparisons have been made: (a) single investigations of parallel randomized experiments and observational studies using similar (possibly identical) treatments; and (b) extensive meta-analyses of research areas investigating the effect of a treatment. Of note, exact agreement of the estimates of treatment effects in randomized experiments and observational studies should not be expected given sampling error, even exact replications of a randomized experiment using the same population would not be expected to produce identical treatment effects. In addition, other differences between the studies representing the two designs may exist. For example, the populations sampled in the two designs, the treatment delivery, the research setting, or other methodological features (e.g. a less adequate control condition is constructed in the observational study) may differ in addition to the focal difference of randomized versus non-randomized design (Cook et al., 2006; Reichardt, 2006; West et al., 2006). Single comparative studies Studies comparing treatment effect estimates from randomized experiments and observational studies have produced diverse results. A classic example is Meier s (1972) largescale evaluation of the effectiveness of the Salk polio vaccine in the US. In some states, a randomized experiment was used; in others, an observational study. Even though both designs led to the conclusion that the Salk vaccine was effective, the effect size in the randomized experiment was substantially larger. Gilbert et al. (1975) suggested that the difference in effect sizes primarily resulted from the different populations on which the polio rates were based in the C conditions. In the randomized experiment, the comparison group included only children who had permission to be vaccinated in contrast to the observational study in which the full population was represented. Cook et al. (2006) reviewed a unique subset of investigations in which a single randomized treatment group was compared [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 419 414 430

420 THE SAGE HANDBOOK OF SOCIAL RESEARCH METHODS with both a randomized control group (randomized experiment) and a second nonrandomized comparison group (yoked observational study). Those observational studies that created a high-quality comparison group produced comparable results to those of the yoked randomized experiment. Investigations with a poorly selected comparison group, poor statistical adjustment for baseline differences, or which differed in other procedural or design features between the observational study and yoked randomized experiment often produced discrepant findings. Meta-analyses Across diverse substantive research areas, such as skill training, organizational development, psychotherapy, and medical interventions, meta-analyses have produced heterogeneous outcomes in which randomized experiments have shown larger, smaller, and no difference in treatment effect estimates relative to observational studies. An early influential meta-analytic investigation by Sacks et al. (1983) identified six medical therapies that had been studied using both randomized experiments and observational studies. Sacks et al. concluded that observational studies produced biased results in comparison to randomized controlled trials. Attempts to adjust treatment effects in observational studies for available prognostic factors did not remove this bias. More recently, Ioannidis et al. (2001) conducted meta-analyses of 45 medical interventions (e.g. vaccines for meningitis; local vs. general anesthesia) involving a total of 240 randomized trials and 168 observational studies. Overall, there was no consistent pattern of over- or under-estimation of treatment effects by the observational studies relative to the randomized experiments Significant differences between the randomized experiments and observational studies were found in only a small proportion of the meta-analyses. Ioannidis et al. provided evidence of smaller between-study variance in the randomized experiments than in the observational studies, an important finding that suggests that the effect size estimates of observational studies may be associated with more uncertainty than randomized experiments. Reviews of other areas also suggest that the direction of mean bias is by no means certain. Lipsey and Wilson (1993) analyzed 74 metaanalyses of behavioral and educational interventions, finding no difference in the mean effect sizes of randomized experiments and observational studies. Heinsman and Shadish (1996) analyzed four meta-analyses in the areas of drug-use prevention, psychosocial interventions for surgery, coaching for the SAT, and ability grouping in secondary schools. They found a larger effect size for randomized experiments than for observational studies. Taken together, the metaanalytic results suggest that the magnitude of bias resulting from the use of an observational study rather than a randomized is typically not large and its direction is uncertain. They also suggest that area-specific choices of samples and methodological features (e.g. type of comparison group) may be important determinants of any bias that is observed. Methodological features Heinsman and Shadish (1996) coded methodological features that might potentially account for the observed difference in effect sizes between randomized experiments and observational studies in four behavioral science research areas (e.g. SAT coaching, drug use prevention). Of importance, they found in a regression analysis that not allowing self-selection into T vs. C conditions in observational studies, using a control group from the same population as the treatment group, minimizing the baseline effect size difference between the T and C groups, and minimizing both overall attrition and differential attrition made the treatment effect estimates more comparable in the two designs. Shadish and Ragsdale (1996) found similar results in a meta-analysis of randomized experiments and observational studies of marital or family psychotherapy. Consistent with these findings, Heckman and Robb (1986) [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 420 414 430

EQUATING GROUPS 421 also point to conceptual and statistical reasons why allowing participants to self select into T and C groups is particularly likely to lead to biased estimates. These results suggest that it may be possible to improve estimates of treatment effects in observational studies through the careful use of design and analysis strategies. Adjustment strategies for equating groups at baseline Matching Matching is used in observational studies to identify a set of participants in the T and C groups that are comparable. To illustrate, consider two small school classrooms, labeled A and B, one of which implements an innovative new math curriculum, whereas the other implements a standard math curriculum in 6th grade. Table 24.1 illustrates the basic process of simple 1:1 matching. All students in both classrooms are given an IQ test at the beginning of the school year. For each Table 24.1 Illustration of simple matching of two small classroom on baseline IQ scores Pair Classroom A Classroom B 130 1 125 124 2 120 120 3 119 119 4 119 118 5 117 116 6 115 115 7 109 109 8 107 107 9 107 106 10 104 102 11 101 101 12 96 96 90 89 Note: Scores were ordered within units and represent pretest IQ scores of participants. Pairs of participants on the same line represent matched pairs. One person in Classroom A and two persons in Classroom B have no matched pairs. The mean IQ score for all participants in Classroom A is 113; the mean IQ score for all participants in Classroom B is 108. The mean difference (Y A Y B ) for the full unmatched sample is 5. The mean for the matched pairs of Classroom A is 111.6 and for Classroom B is 111.1, yielding a mean difference of 0.5. n A = 13 and n B = 14 for the full sample student in the classroom A, an attempt is made to identify a student in classroom B who is closely equated on IQ. This matching process diminishes the mean difference in baseline IQ between the two groups in our example from M A M B = 5 in the full unmatched sample to M A M B = 0.5 in the reduced, matched sample. A variety of computer algorithms are available that match T and C participants to produce the minimum discrepancy on the pretest variable (see Ming & Rosenbaum, 2001; Rosenbaum, 2002). These computer algorithms are particularly useful when both the T and C groups are large, are of dramatically different sizes, or both. For example, observational studies of initial trials of innovative programs (T) may involve a relatively small number of participants, whereas there are a substantially larger number of participants in the standard program (C) that serves as the comparison. In such cases, the algorithm will select a variable number of optimal matches (e.g. up to 5) for each participant 3. These variable matching procedures lead to more adequate equating of the groups on the matching variable and greater statistical power for the T vs. C comparison, given the larger sample size (Ming & Rosenbaum, 2000). Researchers are encouraged to measure many variables at baseline, particularly those that may be related to treatment group assignment or the outcome variable. Substantive theory and prior research can provide guidance in the selection of a set of measures that will capture as fully as possible potential baseline differences between the T and C groups. However, the availability of a large number of baseline variables makes matching far more complex. In rare cases, a composite variable can be created (e.g. the Gail score for breast cancer risk described earlier). More commonly, propensity scores are used. Propensity scores provide an estimate of the probability that a participant will be assigned to the treatment group (Rosenbaum, 2002; Rosenbaum & Rubin, 1983, 1984; Rubin, 1997; Shadish et al., 2006; Smith, 1997). The researcher uses all baseline variables (or a subset containing the most important ones [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 421 414 430

422 THE SAGE HANDBOOK OF SOCIAL RESEARCH METHODS if this number is very large) and predicts the probability that the participant will be in the T group. This probability is known as the propensity score. There are two major issues in the creation of propensity scores. The first is to make sure that subject matter expertise in the form of prior research and theory has been used to select baseline measures that will capture as fully as possible important baseline differences between the T and C groups. The second is to choose a statistical model that adequately represents the form of the relationship between the variables and each participant s propensity score. Rosenbaum and Rubin (1983) used simple linear logistic regression to produce these estimates. Dehejia and Wahba (1999) used more complicated logistic regression models involving specification of interactions and curvilinear effects of baseline variables. McCaffrey et al. (2004) used automated stepwise nonparametric regression tree methods to model possible complex relationships between the variables and the propensity score. In each case the goal is to achieve T and C groups that are balanced on all important baseline variables and for which the error of prediction in the sample has been minimized (Shadish et al., 2006). As an important check on the success of this procedure, the data are divided into five strata and the balance of the baseline variables within each stratum is compared. When balance is achieved, there is a strong basis for comparing the groups. If balance is not achieved within one (or more) stratum, the comparison of the treatment and control groups is carried out only over those strata on which balance has been achieved. Each participant s propensity score may then be taken as the best summary of the baseline information. The propensity score is used as the basis for equating the groups. The groups may be equated using the standard 1 to 1 or variable many to 1 matching procedures described above. Alternatively, analysis of covariance or blocking on the strata may be used (but see footnote 3). As an illustration of the matching strategy, Wu et al. (2006) constructed propensity scores for retention in first grade from a large set of baseline variables measured early in the school year. In the full sample (n = 769) of children at risk for grade retention, there were large differences between students on the Woodcock Johnson reading score at baseline. Students who were later retained in first grade had substantially lower scores than students who were later promoted to second grade, Y baseline retained = 420 versus Y baseline promoted = 438. Optimal 1 to 1 matching on propensity scores yielded 97 matched pairs with Y baseline = 422.4 for the retained students and Y baseline = 423.4 for the promoted students. Similar reductions in baseline differences were achieved for other variables measured at baseline. Theoretically, propensity scores will provide a proper adjustment for the unknown assignment rule if all important baseline variables have been included and the form of the propensity model has been correctly specified. Matching has substantial strengths in that it does not require specification of the form of the relationship between the baseline and outcome variables; it clearly delimits the range of the baseline variables over which T and C can be appropriately compared, and it leads to efficient estimates of the treatment effect because of the small number of parameter estimates that are involved. Hypothesized treatment group x baseline level interactions can also be examined within the matched propensity score framework. There are two primary limitations of the matched propensity score framework. First, it does not adjust the treatment effect for measurement error in the baseline variables giving rise to potential regression to the mean effects if very reliable and stable measures of important baseline variables are not available. Second, it does not adjust for other important variables (hidden variables) that are not measured at pretest, again emphasizing the importance of selection of the full range of potential baseline variables based on subject matter expertise. Statistical adjustment strategies based on measured baseline differences A variety of statistical models may be developed that attempt to adjust for baseline differences in measured variables. Perhaps, [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 422 414 430

EQUATING GROUPS 423 the simplest is analysis of covariance (Huitema, 1980; Reichardt, 1979) which is used to provide an adjustment of the treatment effect for one or more baseline variables. Typically, a simple linear model is used, Ŷ = b 0 + b 1 COV + b 2 X, where Y is the outcome variable, COV is the covariate measured at baseline and X is the binary treatment indicator. This model can be extended to include multiple covariates, other parametric relationships (e.g. addition of a b 3 COV 2 term to represent a quadratic relationship between X and Y), and treatment x covariate interactions (Cohen et al., 2003; Huitema, 1980; Reichardt, 1979). Nonparametric methods can be used to model more complex relationships between X and Y (see Little et al., 2000). The primary limitation of ANCOVA methods is that their success in equating the T and C groups depends heavily on the correct specification of the adjustment model. For example, if the relationship between COV and Y is nonlinear and a simple linear ANCOVA model is used, the treatment effect estimate will be biased. The basic ANCOVA approach shares the limitation with matching that baseline variables may be measured with less than perfect reliability. This problem is most serious when the T and C groups are selected from different populations, so that regression to the mean will occur (see Campbell & Kenny, 1999; Shadish et al., 2002). Even if the statistical adjustment model is otherwise correctly specified, measurement error will typically lead to under-adjustment of the treatment effect for baseline differences. Huitema (1980) provides an introduction and Fuller (1987) provides a more advanced treatment of methods for correcting for measurement error in the context of ANCOVA. Alternatively, when multiple indicators are available for each important construct measured at pretest, structural equation models can be used to provide measurement error-free estimates of the treatment effect. Aiken et al. (1994) provide a good discussion of the use of this approach and apply it to the evaluation of a drug treatment program. One limitation of the structural equation modeling approach is that the models to date have specified a linear relationship between the baseline measures and the outcome. Lee et al. (2004), Marsh et al. (2004), and Wall and Amemiya (2007) describe extensions of structural equation models that may account for curvilinear and interactive effects. Correction for measurement error can also be desirable when treatment participants are selected on the basis of a variable that is unstable over time. For example, if T participants are selected based on high scores on a measure of depression (or because they are seeking treatment because of a severe depressive episode), it is likely that some of the participants are in a temporary state of high depression and would return to their typical level of depression in the absence of any treatment simply given the passage of time. Reliability correction methods that adjust the estimate of the treatment effect for the testretest reliability for the time interval between the baseline and outcome measures in the absence of treatment can improve the estimate of the treatment effect. If repeated measures are collected on multiple indicators of the outcome variable at baseline and multiple other time points, special structural equation models can be used that partition the variance at each time point into state (temporary) and trait (true score) components (cf. Khoo et al., 2006; Steyer et al., 1992). Adjusting for unmeasured baseline differences (hidden variables) The matching and the statistical adjustment strategies described above can provide appropriate correction of the estimate of the treatment effect for variables measured at baseline. However, it is also possible that variables that are not measured at baseline could account for all or part of the estimated treatment effect. Three general strategies exist for addressing this problem. First, a variety of methods have been proposed for conducting sensitivity analyses of treatment effect estimates (Marcus, 1997; McCafferty et al., 2004; Rosenbaum, 2002). As an illustration of one simple method, imagine a researcher has found a [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 423 414 430

424 THE SAGE HANDBOOK OF SOCIAL RESEARCH METHODS 0.8 standard deviation difference (large effect size) between the T and C groups on the outcome variable. The researcher would then identify the largest standardized difference between the T and C groups on the set of variables measured at baseline. Suppose the largest baseline difference were d = 0.5 standard deviations. Then the researcher identifies the maximum correlation between any of the baseline measures and the posttest measure of the outcome of interest. Suppose the maximum correlation were r = 0.6. The product of these two quantities, adjustment = Y baselinet Y baselinec SD r baseline outcome, here adjustment = 0.5 0.6 = 0.3, provides a rough estimate of the maximum extent that this estimate of the standardized treatment effect would need to be reduced given what is a worst case scenario for an important hidden variable. If the standardized treatment effect were reduced by this amount, to 0.8 0.3 = 0.5 in our example, we would have a plausible estimate of its lower bound. If this value were still statistically significant, it would provide evidence that the treatment effect is robust. Note that there is no theoretical reason why the actual adjustment required for hidden variables could not exceed this value. However, in practice, if a number of variables are measured at baseline and they can be presumed to be representative of important hidden variables, the adjustment will nearly always be an overestimate of the adjustment needed in practice. Econometric approaches (e.g. Barnow et al., 1980; Heckman, 1979, 1989, 1990; Muthén & Jöreskog, 1983) have been proposed that adjust for the effects of both measured and unmeasured variables at baseline. Two separate equations are used in these models. The first (selection model) equation uses measured baseline variables to predict the assignment of the participant to the treatment or control group. The second uses this selection probability, an indicator variable (T = 1; C = 0) for treatment condition, and potentially other covariates to estimate the outcome. A key feature of this approach is the requirement of an instrumental variable 4, a variable that strongly predicts treatment assignment in the first equation but which has no separate relationship to the outcome (see Figure 24.1). In essence, the instrument can be thought of as a naturally occurring randomization (Heckman, 1996). The instrumental variable can only affect the outcome indirectly through its effect on treatment assignment, an assumption known as the exclusion restriction. If the assumptions of this approach are met, the treatment effect estimate will include proper adjustment for both measured and unmeasured baseline variables. However, in practice, this method is extremely sensitive to violations of its underlying assumptions, particularly the exclusion restriction (Heckman, 1997; Stolzenberg & Relles, 1990; Winship & Residual 1 Residual 2 Treatment Indicator Instrumental Variable B OT Outcome Figure 24.1 Illustration of econometric selection bias model Note: The instrumental variable directly affects only the Treatment Indicator (T = 1; C = 0). This condition is known as the exclusion restriction. Residual 1 is the error of the prediction of the Treatment Indicator including error produced by hidden variables. The hidden variables may also be associated with the residual of the Outcome (Residual 2). If the model is correctly specified, an adjustment of the regression coefficient B OT will yield an unbiased estimate of the treatment effect controlling for the hidden variables. If the assumptions of the model are violated (notably the exclusion restriction), the estimate of the treatment effect may be severely biased. [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 424 414 430

EQUATING GROUPS 425 Mare, 1992). When assumptions are violated, the treatment effect estimates of econometric models can be far more biased than those based on simpler approaches likeancovaor matching. In addition, even if the assumptions of the approach are met, the standard errors of the estimate of the treatment effect can be extremely large if the instrument is not very strongly related to treatment assignment. Finally, the econometric approach assumes that the treatment effect is constant across all participants. A third approach suggested by Manski (1994), Manski and Nagin (1998), Manski and Pepper (2000) has explored the effects of making weaker assumptions about instrumental variables in econometric selection models. This approach results in the estimation of a plausible range of values for the treatment effect within upper and lower bounds. However, in some cases, the bounds may be very large so that little information is conveyed about the size of the treatment effect. Adjusting for growth A final issue occurs when participants show different rates of natural growth (e.g. young children in math skills) or decline (e.g. Alzheimer s patients in memory) on the outcome variable of interest. With observations taken only at baseline, no measure of the natural growth rate in the absence of treatment is available for the participants. Change score analysis (Judd & Kenny, 1981) can be used to estimate the treatment effect. Participants are measured on the same measure at baseline and outcome. These baseline and outcome measures are then transformed so that their variances are equated (see Huitema, 1980). The mean change in the T group is then compared with the mean change in the C group to provide an estimate of the treatment effect. This approach adequately models special situations in which growth is occurring at a constant rate across all participants or is of the fan spread variety in which growth is occurring at a rate proportional to the participant s baseline score (e.g. those advantaged at baseline gain more). Treatment effects for other forms of growth are not well represented using this approach. More adequate modeling of growth requires the collection of additional data at multiple time points, ideally both before and after the treatment (Shadish et al., 2002; West et al., 2000). If sufficient additional time points are collected, the natural pattern of growth prior to treatment can be estimated; this pattern can be compared to the pattern of growth following the introduction of treatment in the T group. Singer and Willett (2002) describe multilevel modeling methods that estimate the treatment effect while allowing for differences between participants in growth rates. Design enhancements In line with the topic of this chapter, we have focused on methods of equating the T and C groups at baseline. However, we would be remiss if we did not remind readers of an important alternative strategy emphasized by Shadish and Cook (1999) and Shadish et al. (2002). This strategy involves adding design features that address specific threats to validity that arise in observational studies. Shadish and Cook (1999) argue that the use of design enhancements will often be preferable to the use of statistical adjustment strategies. We present three methods of enhancing the design of the basic observational study here (see Shadish & Cook, 1999, for an extensive list). Multiple control groups When a treatment and control group are selected in an observational study, they will be similar at baseline in some respects and different in others. This feature gives rise to the possibility that some hidden variable may be accounting for the result. If multiple control groups can be identified and the estimates of the treatment effects are similar when different control groups are used, the researcher s confidence that the treatment effect is not biased is increased. For example, using a large database, Roos et al. (1978) compared children receiving tonsillectomies (T) with two different comparison groups: (a) children having a matched history of [15:26 10/9/2007 4997-Alasuutari-Ch24.tex] Paper: a4 Job No: 4997 Alasuutari: Social Research Methods (SAGE Handbook)Page: 425 414 430