Running head: USING REPLICATION TO INFORM DECISIONS ABOUT SCALE-UP

Using Replication 1 Running head: USING REPLICATION TO INFORM DECISIONS ABOUT SCALE-UP Using Replication to Help Inform Decisions about Scale-up: Three Quasi-experiments on a Middle School Unit on Motion and Forces Bill Watson Curtis Pyke Sharon Lynch Rob Ochsendorf The George Washington University This work was conducted by SCALE-uP: A collaboration between George Washington University and Montgomery County Public Schools (MD); Sharon Lynch, Joel Kuipers, Curtis Pyke, and Bonnie Hansen-Grafton, principal investigators. Funding for SCALE-uP was provided by the National Science Foundation, the U.S. Department of Education, and the National Institute of Health (REC-0228447). Any opinions, findings, conclusions, or recommendations are those of the authors and do not necessarily reflect the position or policy of endorsement of the funding agencies.

Using Replication 2 Using Replication to Help Inform Decisions about Scale-up: Three Quasi-experiments on a Middle School Unit on Motion and Forces Research programs that include experiments are becoming increasingly important in science education as a means through which to develop a sound and convincing empirical basis for understanding the effects of interventions and making evidence-based decisions about their scale-up of in diverse settings. True experiments, which are characterized by the random assignment of members of a population to a treatment or a control group, are considered the gold standard in education research because they reduce the differences between groups to only random variation and the presence (or absence) of the treatment (Subotnik & Walberg, 2006). For researchers, these conditions increase the likelihood that two samples drawn from the same population are comparable to each other and to the population, thereby increasing confidence in causal inferences about effectiveness (Cook & Campbell, 1979). For practitioners, those making decisions about curriculum and instruction in schools, the Institute for Educational Sciences at the US Department of Education (USDOE) suggests that only studies with randomization be considered as strong evidence or possible evidence of an intervention s effectiveness (Institute for Educational Sciences, 2006). Quasi-experiments are also a practical and valid means for the evaluation of interventions when a true experiment is impractical due to the presence of natural groups, such as classes and schools, within which students are clustered (Subotnik & Walberg, 2006). In these circumstances, a Quasi-experiment that includes careful sampling (e.g., random selection of schools), a priori assignment of matched pairs to a treatment or control group and/or a pretest used to control for any remaining group differences can often come close to providing the rigor of true experiment (Subotnik & Walberg, 2006). However, there are inherent threats to internal validity in Quasi-experimental designs that the research must take care to address with supplemental data. Systematic variation introduced through the clustering of subjects that occurs in Quasi-experiments can compete with the intervention studied as a cause of differences observed. Replications of quasi-experiments can provide opportunities to adjust procedures to address some threats to the internal validity of Quasi-experiments and can study new samples to address external validity concerns. Replications can take many forms and serve a multitude of purposes (e.g., Hendrick, 1990; Kline, 2003). Intuitively, a thoughtful choice of replication of a quasi-experimental design can produce new and improved result or increase the confidence researchers have in the presence of a treatment effect found in an initial study. Therefore, replication can be important in establishing the effectiveness of an intervention when it fosters a sense of robustness in results or enhances the generalizability of findings from stand-alone studies (Cohen, 1994; Robinson & Levin, 1997). This paper presents data to show the utility in combining a high quality quasiexperimental design with multiple replications in school-based scale-up research. Scale-up research is research charged with producing evidence to inform scale-up decisions; decisions regarding which innovations can be expected to be effective for all students in a range of school contexts and settings what works best, for whom, and under what conditions (Brown, McDonald, & Schneider, 2006, p. 1). Scaling-up by definition is the introduction of interventions whose efficacy has been established in one context into new settings, with the goal of producing similarly positive impacts in larger, frequently more diverse, populations (Brown et al., 2006).

Using Replication 3 Our work shows that a good first step in scaling-up an intervention is a series of experiments or quasi-experiments at small scale. Replication in Educational Research Quasi-experiments are often the most practical research design for an educational field study, including scale-up studies used to evaluate whether or not an intervention is worth taking to scale. However, because they are not true experiments and therefore do not achieve true randomization, the possibility for systematic error to occur is always present, and, with it, the risk of threats to internal and external validity of the study. For the purposes of this discussion, we consider internal validity to be the validity with which statements can be made about whether there is a causal relationship from one variable to another in the form in which the variables were manipulated or measured (Cook & Campbell, 1979, p. 38). External validity refers to the approximate validity with which conclusions are drawn about the generalizability of a causal relationship to and across populations of persons, settings, and times (Cook & Campbell, 1979). Unlike replications with experimental designs, which almost always add to the efficacy of a sound result, the replication of a quasi-experiment may not have an inherent value if the potential threats to validity found in the initial study are not addressed. Replication: Frameworks In social science research, replication of research has traditionally been understood to be a process in which different researchers repeat a study s methods independently with different subjects in different sites and at different times with the goal of achieving the same results and increasing the generalizability of findings (Meline & Paradiso, 2003; Thompson, 1996). However, the process of replication in social science research in field settings is considerably more nuanced than this definition might suggest. In field settings, both the intervention and experimental procedures can be influenced by the local context and sample in ways that change the nature of the intervention or the experiment, or both from one experiment to another. Before conducting a replication, an astute researcher must therefore ask: In what context, with what kinds of subjects, and by which researchers will the replication be conducted? (Rosenthal, 1990). The purpose of the replication must also be considered: Is the researcher interested in making adjustments to the study procedures or intervention to increase the internal validity of findings or will the sampling be adjusted to enhance the external validity of initial results? A broader view of replication of field-based quasi-experiments might enable classification of different types according the multiple purposes for replication when conducting research in schools. Hendrick (1990) proposed four kinds of replication that take into account the procedural variables associated with a study and contextual variables (e.g., subject characteristics, physical setting). Hendrick s taxonomy proposes that an exact replication adheres as closely as possible to the original variables and processes in order to replicate results. A partial replication varies some aspects of either the contextual or procedural variables, and a conceptual replication radically departs from one or more of the procedural variables. Hendrick argued for a fourth type of replication, systematic replication, which includes first a strict replication and then either a partial or conceptual replication to isolate the original effect and explore the intervention when new variables are considered. Rosenthal (1990) referred to such a succession of replications as a replication battery: "The simplest form of replication battery requires two replications of the original study: one of these replications is as similar as we can make it to the original study, the other is at least

Using Replication 4 moderately dissimilar to the original study" (p. 6). Rosenthal (1990) argued that if the same results were obtained with similar but not exact Quasi-experimental procedures, internal validity would be increased because differences between groups could more likely be attributed to the intervention of interest and not to experimental procedures. Further, even if one of the replications is of poorer quality than the others, Rosenthal argued for its consideration in determining the overall effect of the intervention, albeit with less weight than more rigorous (presumably internally valid) replications. More recently, Kline (2003) also distinguished among several types of replication according to the different research purposes they address. For example, Kline s operational replications are like Hendrick s (1990) exact replication: the sampling and experimental methods of the original study are repeated to test whether results can be duplicated. Balanced replications are akin to partial and conceptual replications in that they appear to address the limitations of quasi-experiments by manipulating additional variables to rule out competing explanations for results. In a recent call for replication of studies in educational research, Schneider (2004) also suggested a degree of flexibility in replication, describing the process as "conducting an investigation repeatedly with comparable subjects and conditions" (p. 1473) while also suggesting that it might include making "controllable changes" to an intervention as part of its replication. Schneider s (2004) notion of controllable changes, Kline s (2003) description of balanced replication, Hendrick s (1990) systematic replication, and Rosenthal s (1990) argument in favor of the replication battery all suggest that a series of replications taken together can provide important information about an intervention s effectiveness beyond a single Quasiexperiment. Replication: Addressing Threats to Internal Validity When multiple quasi-experiments (i.e., replications) are conducted with adjustments, the threats to internal validity inherent in quasi-experimentation might be more fully addressed (Cook & Campbell, 1979). Although changing quasi-experiments in the process of replicating them might decrease confidence in the external validity of an initial study finding, when a replication battery is considered, a set of studies might provide externally valid data to contribute to decision making within and beyond a particular school district. The particular threats to internal validity germane to the studies reported in this paper are those associated with the untreated control group design with pretest and posttest (Cook & Campbell, 1979). This classic and widely implemented quasi-experimental design features an observation of participants in two non-randomly assigned groups before and after one of the groups receives treatment with an intervention of interest. The internal validity of a study or set of studies ultimately depends on the confidence that the researcher has that differences between groups are caused by the intervention of interest (Cook & Campbell, 1979). Cook and Campbell (1979) provided considerable detail about threats to internal validity in quasi-experimentation that could reduce confidence in claims of causality (p. 37-94). However, they concluded that the untreated control group design with pretest and posttest usually controls for all but four threats to internal validity: selection-maturation, instrumentation, differential regression to the mean, and local history. Table 1 briefly describes each of these threats. In addition, they are not mutually exclusive. In a study of the effectiveness of curriculum materials, for example, the extent to which the researchers are confident differential regression to the mean is not a threat relies upon their confidence that sampling methods have produced two samples similar on performance and demographic variables

Using Replication 5 (selection-maturation) and that the assessment instrument has similar characteristics for all subjects (instrumentation). Cook and Campbell (1979) suggest that replication plays a role in establishing external validity by presenting the simplest case: An exact replication (Hendrick, 1990) of a quasiexperiment in which results are corroborated and confidence in internal validity is high. However, we argue that the relationship between replication and validity is more complex, given the multiple combinations of outcomes that are possible when different kinds of replications are conducted. Two dimensions of replication seem particularly important. The first is the consistency of results across replication. The second is whether a replication addresses internal validity threats that were not addressed in a previous study (i.e., it improves upon the study) or informs the interpretation of the presence or absence of threats in a prior study (i.e., it enhances interpretation of the study). In an exact replication, results can either be the same as or different from results in the original quasi-experiment. If results are different, it seems reasonable to suggest that some element of the local history - perhaps schools, teachers, or a cohort of students - could have an effect on the outcomes, in addition to (or instead of) the effect of an intervention. A partial replication therefore seems warranted to adjust the quasi-experimental procedures to address the threats. A partial replication would also be appropriate if the results are the same, but the researchers do not have confidence that threats to internal validity have been adequately addressed. Indeed, conducting partial replications in either of these scenarios is consistent with the recommendation of Hendrick (1990) to consider results from a set of replications when attempting to determine the effectiveness of an intervention. Addressing threats to validity with partial replication, is, in turn, not a straightforward process. What if results of a partial replication of a quasi-experiment are not the same as those found in either the original quasi-experiment or its exact replication? If the partial replication addresses a threat to internal validity where the original quasi-experiment or its exact replication did not, then the partial replication improves upon the study, and its results might be considered the most robust. If threats to internal validity are still not adequately addressed in the partial replication, the researcher must explore relationships between all combinations of the quasiexperiments. Alternatively, if the partial replication provides data that help to address threats to the internal validity of the original quasi-experiment or its exact replication, then the partial replication enhances interpretation of the original study, and its results might be considered with the results of the previous study. Figure 1 provides a possible decision tree for researchers faced with data from a quasiexperiment and an exact replication. Because multiple replications of quasi-experiments in educational research are rare, Figure 1 is more an exercise in logic than a decision matrix supported by data produced in a series of actual replication batteries. However, the procedures and results described in this paper will provide data generated from a series of quasi-experiments with practical consequences for the scale-up of a set of curriculum materials in a large, suburban school district. We hope to support the logic of Figure 1 by applying it to the example to which we now turn. Replication in Practice: The SCALE-uP Studies The Scaling-up Curriculum for Achievement, Learning, and Equity Project (SCALE-uP) conducted quasi-experiments with replication to study the effectiveness of three sets of middle school science curriculum materials as the first step in studying how they scale-up in a large

Using Replication 6 suburban school district. The SCALE-uP research program called for the selection of three sets of curriculum materials that have a favorable rating according to the instructional analysis of the American Association for the Advancement of Science (AAAS) Project 2061 curriculum analysis protocol (Kesidou & Roseman, 2002). First, the original research questions for each quasi-experiment sought to determine the effectiveness of the materials: 1) Does use of curriculum materials that meet a majority of criteria in the AAAS Project 2061 instructional analysis produce higher mean scores on a test of concept understanding when compared to a district s regular curriculum offerings? 2) Does disaggregating the outcome data reveal differences in achievement for different subgroups of students not observed in the reports on aggregate mean scores? Next, replications were conducted to verify the initial findings and provide evidence of reliability of the procedures. In retrospect, the intent was closest to an operational (Kline, 2003) or an exact (Hendrick, 1990) replication. Finally, the evidence obtained from quasi-experiments and their replications was to be used to inform a decision to scale-up each unit within the school district. In the first quasi-experiment, SCALE-uP found that the first unit studied, Chemistry That Applies (State of Michigan, 1993) was effective for all students when compared to curriculum materials used in the control group. Replication with a new cohort of students in the same schools yielded similar results, providing support for a decision to scale-up the unit to all schools in the district (Lynch, Kuipers, Pyke, & Szesze, 2005). Data collected during the initial study and replication of the second unit, The Real Reasons for the Seasons (Lawrence Hall of Science, 2000) suggested that the unit was not as effective as the materials used in the control group for any subgroup of students (Pyke, Lynch, Kuipers, Szesze, & Watson, 2005a). The second unit was therefore not recommended for scale-up in the district. The results for the study of the third unit, Exploring Motion and Forces: Speed, Acceleration, and Friction (Harvard-Smithsonian Center for Astrophysics, 2001) were not as straightforward. They motivated an investigation of the internal validity of the studies and the role that replication can play in identifying suitable data for informing decisions about scale-up in a field setting. SCALE-uP Studies of Motion and Forces Motion and Forces (Harvard-Smithsonian Center for Astrophysics, is a six-week physical science curriculum unit designed for use with students in fifth through eighth grades that received an acceptable rating according to the SCALE-uP application of the Project 2061 curriculum analysis (Ochsendorf, et al., 2001). Its 18 Explorations are inquiry-centered and activity-based, with an emphasis on students direct experience with phenomena. The curriculum materials consist of a Teacher Manual and a student Science Journal, but no traditional student textbook. Materials required to conduct the Explorations (e.g., sliding disks, ramps, rolling carts) are often constructed and assembled by the students. The unit's target concepts are closely associated with the following target concept from Benchmarks for Science Literacy (AAAS, 1993): Changes in speed or direction of motion are caused by forces. The greater the force is, the greater the change in motion will be. The more massive an object is, the less effect a given force will have (4F, 3-5, #1; p. 89). an object at rest stays that way unless acted on by a force an object in motion will continue to move unabated unless acted on by a force (p. 90).

Using Replication 7 The first SCALE-uP quasi-experiment conducted on Motion and Forces suggested that the unit was effective for some subgroups of students but not for others. The replication of the quasi-experiment confirmed this unusual result. Because scale-up (and SCALE-uP) is concerned with the effectiveness of the intervention for all students, neither the researchers nor the school district administrators were comfortable interpreting the data as positive evidence for scaling-up the unit. Given the investment that the school district had made in the materials and the potential for the unit to be effective, a balanced, or partial, replication (Hendrick, 1990; Kline, 2003) of the study (this time, with greater attention to potential threats to internal validity) seemed warranted. By conducting the second replication, SCALE-uP was in a unique position to identify and address threats to validity across all three quasi-experiments and to consider data from the set of quasi-experiments in making a decision about the scale-up of the unit. General Design, Population, and Sampling for the Motion & Forces Studies Each study of the effectiveness of Motion & Forces employed an untreated control group design with pretest and posttest (Cook & Campbell, 1979), intended to test for differences in outcomes between equivalent groups. There were three Quasi-experiments conducted in three consecutive school years. Quasi-experiment 1 and Quasi-experiment 2 (an exact replication) were conducted in the same set of schools randomly selected from the sampling frame (see below), while Quasi-experiment 3 (a partial replication) was conducted in a different set of schools, also randomly selected from the sampling frame. Population and Sample The population under investigation was 6 th grade students in Montgomery County Public Schools (MCPS). MCPS is a large Maryland school district (approximately 136,000 students total, 32,000 in grades 6-8) located in the Washington, DC, metropolitan area, with a student population that is rapidly becoming more diverse, culturally, linguistically, and socioeconomically. MCPS consistently occupies a position among the top-performing school districts in the State of Maryland. The study sample for each quasi-experiment consisted of 6 th grade students from MCPS middle schools. Schools were used as the sampling unit, with schools randomly selected from a sampling frame consisting of 5 different School Profile Categories (SPCs). The SPCs were developed by SCALE-uP researchers and MCPS administrators to identify five groups of schools, each containing schools that fit a similar demographic and achievement profile. Inclusion in an SPC was determined by a proxy for socio-economic status, the percentage of students in attendance eligible for the Free and Reduced Price Meals System (FARMS) and math and reading achievement data from 5 th grade nationally norm-referenced tests. Two schools were randomly selected from each SPC. One school from each pair was randomly selected as a treatment school; the other was assigned to the comparison condition. This sampling method was developed to produce two samples representative of the study population, with enough students to provide power for significance tests on data for subgroups disaggregated by FARMS, ethnicity, and eligibility for English for Speakers of Other Languages (ESOL) and special education services. The Motion and Forces Assessment

Using Replication 8 Student understanding of the target idea was determined by a score on the Motion and Forces Assessment (MFA) a curriculum-independent posttest given by teachers at the end of instruction. An analysis and development procedure developed in collaboration with the AAAS Project 2061 (DeBoer, 2005; Stern & Ahlgren, 2002) was used to develop the assessment. The MFA consists of 10 items (6 constructed response and 4 selected response) that provided the students with 4 different physical phenomena to respond to questions about motion and force. Raters coded student responses according to a rating guide that categorized student responses according to their alignment with a scientifically appropriate understanding of the target benchmark (average inter-rater reliability = 0.82). A weighting scheme that distributes contribution of students' ideas about different parts of the benchmark and balances the contribution of selected response and constructed response items and the difficulty and discrimination of each item was applied to raw scores to calculate a scale score for student understanding. A standard-setting process (Plake & Hambleton, 2001; Pyke & Hanson, 2005) established cut scores that distinguish among four levels of understanding of the target ideas: 0-20 = no understanding; 21-50 = context-limited understanding; 51-70 = some fluency with ideas; 71-100 = flexible understanding. General SCALE-uP Data Analysis Procedures The statistical analyses to test the research hypotheses used Analysis of Variance (ANOVA) and Analysis of Covariance (ANCOVA) techniques, with pretest scores as the covariate. Data were analyzed for overall differences in the mean posttest MFA score between the treatment and comparison conditions. They were also disaggregated according to gender, FARMS, ethnicity, and eligibility for ESOL and special education services. Assumptions for these analyses were generally acceptable (see Pyke, Lynch, Kuipers, Szesze, & Watson, 2004, 2005b, 2006, for complete analyses) according to the guidelines established in the ANOVA/ANCOVA literature (c.f., Tabachnick and Fidell, 1989). For simplicity in presenting results in this paper, only data disaggregated by Free and Reduced-price Meals Status (FARMS) subgroups are considered. (Now FARMS represents the subgroup that had never been eligible for services; Prior FARMS, the subgroup that was previously but not currently eligible for services; and Now FARMS, the subgroup eligible for services at the time of the study.) In addition to ANOVA/ANCOVA analyses, when significant main effects and interactions were found, exploratory follow-up analyses for simple main effects were conducted to explain the effects. Finally, effect size was calculated for each subgroup for all dependent variables by subtracting the adjusted comparison mean from the adjusted treatment mean and dividing the difference by the study sample standard deviation. All analyses were performed using SPSS for Windows, Version 12. Quasi-experiment 1 Demographics of the Quasi-experiment 1 Sample The study sample for Quasi-experiment 1 was selected according to the procedure described above. Comparison of the students in the treatment and comparison conditions suggested that the random selection of schools from five SPCs and the random assignment of schools to the treatment condition resulted in two samples of students probabilistically similar in demographic and prior performance variables (see Tables 1 and 2).

Using Replication 9 Pretest Procedures Copies of the MFA were shipped to schools by the MCPS program evaluation team. The MFA was administered to students in the treatment and comparison conditions by classroom teachers on or before the date that instruction with the target benchmark began according to instructions provided by SCALE-uP. All assessments were collected at the school by the coordinator and sent to the program evaluation team through the MCPS mail delivery system, where they were picked up by SCALE-uP researchers. Pretests were rated by trained science graduate students and MCPS 8 th grade teachers. Intervention Procedures: Instructional Attributes Materials used to teach Motion & Forces were distributed to teachers in the treatment condition from the MCPS central materials center before the fourth quarter of the 2002-2003 school year. Teachers in the treatment condition were instructed to teach Motion & Forces according to a set of fidelity guidelines developed by SCALE-uP in conjunction with MCPS administrators and teachers (O'Donnell, Lynch, Watson, & Rethinam, 2007). Students in the treatment condition received photocopies of pages of the Student Journal for each lesson. Teachers in the comparison condition chose curriculum materials from a list of materials considered acceptable in MCPS to teach the target benchmark, as their usual practice dictated. Posttest Procedures Several weeks before the end of the quarter, copies of the MFA were shipped to schools by the MCPS program evaluation team. The MFA was administered to students in the treatment and comparison conditions by classroom teachers on or immediately after the date that instruction ended according to instructions provided by SCALE-uP. All assessments were collected at the school by the coordinator and sent to the program evaluation team through the MCPS mail delivery system, where they were picked up by SCALE-uP researchers. The same raters who rated the pretests rated the posttests. Results Pretest. A 1 X 2 between-groups Analysis of Variance (ANOVA) indicated no statistically significant difference in pretest score between the treatment and comparison conditions. Posttest. A 1 X 2 between-groups Analysis of Covariance (ANCOVA) indicated a statistically significant main effect in favor of the treatment condition, with F(1, 2169) = 6.44, p <.05, Cohen's d =.10. ES =.10. The adjusted mean score for the treatment condition (M = 56.98, SD = 22.17) and the comparison condition (M = 54.72, SD = 22.64) were both within the same level of understanding, some fluency with ideas. A 2 X 3 between-groups Analysis of Covariance (ANCOVA) indicated a statistically significant interaction between curriculum condition and FARMS, with F(2, 2165) = 8.094, p <.05. Follow-up tests were conducted to determine the nature of the interaction. These tests revealed that differences between the Never FARMS subgroup and the Prior FARMS and Now FARMS subgroups were the same in both conditions, but that only the Never FARMS subgroup mean was significantly higher in the treatment condition than in the comparison condition. There was no statistically significant difference in the means for the other two subgroups between the treatment and comparison conditions (see Figure 2).

Using Replication 10 Potential Threats to Internal Validity Selection-Maturation. Table 1 indicates that the treatment and comparison groups were similar in terms of their standardized test scores prior to the study. However, the average prior science grade point average (GPA) during the current school year was significantly higher in the comparison condition, perhaps suggesting a difference in prior knowledge. Pretest scores indicated no significant differences between students with similar levels of FARMS eligibility. Instrumentation. Reliability data and skewness of score distributions suggest a possible instrumentation threat. The MFA was shown to have acceptable validity (Pyke & Ochsendorf, 2006), but only modest reliability (Cronbach's a =.54). Although in the aggregate the tests of assumptions for parametric statistics were not violated, examination of pretest scores indicated that pretest scores for the Now FARMS subgroup in both conditions were negatively skewed (i.e., skew statistics were more than two times the standard error of skewness, cf. Tabachnick & Fidell, 1989). The skewness of the pretest scores for this subgroup, combined with the low mean score (M = 34.19, SD = 17.19), suggests that the MFA did not reliably detect differences in performance for students with the lowest scores. Because the distributions were similarly skewed, comparisons between Now FARMS subgroups does not appear to be jeopardized, but pretest main effects for FARMS are more difficult to interpret. Further, skewed distributions for pretest scores used as covariates in the ANCOVA raises concerns about the interpretation of adjusted posttest scores. Scores for the Now FARMS subgroup in both conditions could be underadjusted, leading to more conservative estimates of posttest scores. Differential Regression to the Mean. The negatively skewed pretest and posttest distributions for the Now FARMS subgroups in the treatment and comparison conditions resulted in an unexpected proportion of scores from this subgroup below 20, the lowest cut-point on the MFA scale. Practically, differences between scores below 20 are not distinguishable, so MFA scores are likely to underestimate student understanding and include more error at pretest than at posttest for the Now FARMS subgroup. This subgroup is likely to have experienced gain due to statistical regression to the mean rather than (or in addition to) the treatment unit. However, this effect is more relevant to within group comparisons between levels of FARMS and not differential regression to the mean because the floor effect is present in both conditions. Local History. There are four potential local history concerns in Quasi-experiment 1 that could jeopardize the validity of the data, three of which concern differences in implementation of Motion & Forces among the treatment schools. First, little information beyond retrospective reflection was available about the curriculum materials used in the comparison condition and the fidelity with which Motion & Forces was implemented. (Fidelity of Implementation is used in this context to mean whether or not the intervention was implemented in accordance with the original program design. For a full discussion of Fidelity of Implementation in the Motion & Forces studies, please see O Donnell et al., 2007.) Therefore, it is unknown whether or not there was any diffusion of the intervention or results for students in the treatment condition due to the implementation of the unit. Second, according to MCPS staff and teachers who implemented the unit, the entire unit was completed in only 5 of 54 classrooms in which it was implemented; most classrooms completed between 9 and 13 of the unit's 18 explorations. MCPS personnel attributed this variation to a combination of two factors. The unit took teachers longer than the suggested 6-8 weeks to implement. Because it was taught in the fourth quarter of the school year, many teachers were unable to finish it before the end of the year. This situation introduces a threat to the fidelity with which the unit was taught in that it could not have been implemented with full

Using Replication 11 fidelity if some of it was not implemented at all. It also introduces a potential difference between instruction provided in the treatment and comparison conditions, with students in the comparison condition potentially receiving broader instruction toward mastery of the target benchmark. Third, students in the treatment condition did not receive individual, bound, published student journals as suggested by the developer of the unit. This is a broader local history threat in that neither the district nor SCALE-uP had the resources to purchase student journals for all students. It was recommended that teachers make copies of individual pages for students to use. The fourth threat is a potential threat in classrooms and schools within and across conditions. The pretest was distributed to all students by their classroom teacher. Pretest conditions were not standardized, suggesting that perhaps some students had an opportunity to learn from the pretest. In addition, teachers had access to the pretest, suggesting that they could have gleaned information about the target benchmark and its assessment that could have affected teaching. Cook and Campbell (1979) suggested that diffusion of treatment could occur when teachers have the opportunity to talk with each other about interventions or are made aware of the salient features of the intervention. We consider a possible pretest effect to be a contamination effect related to diffusion. Quasi-experiment 2: Exact Replication Demographics of the Quasi-experiment 2 Sample The study sample for Quasi-experiment 2 was selected according to the procedure described above. For Quasi-experiment 2, retrospectively considered an exact replication of Quasi-experiment 1, the study sample consisted of students from the same treatment and comparison schools. The two samples of students were probabilistically similar in prior performance variables (see Tables 1 and 2). Changes to Procedures There were no differences in pretest procedures between Quasi-experiment 1 and Quasiexperiment 2. Despite the local history threat due to teachers' potentially changing instruction according to the pretest or students learning from the pretest, SCALE-uP did not have the resources to alter pretest procedures for Quasi-experiment 2. The implementation procedures changed in that materials used to teach Motion & Forces were distributed to teachers in the treatment condition from the MCPS central materials center before the third quarter of the 2003-2004 school year. The change from fourth quarter implementation to third quarter implementation was made to address concerns about the unit not being finished in most treatment classrooms. Other instructional attributes in Quasi-experiment 2 were the same as those in Quasi-experiment 1. Posttest procedures for Quasi-experiment 2 were also the same as those for Quasi-experiment 1. Results Pretest. A 1 X 2 between-groups Analysis of Variance (ANOVA) indicated a statistically significant difference in pretest score between the treatment and comparison conditions. Follow-up analyses suggested that the aggregate mean difference at pretest might be attributed to significantly greater pretest scores for the Now FARMS subgroup in the treatment condition than in the comparison condition (F, 1, 567 = 6.270, p <.05).

Using Replication 12 Posttest. A 1 X 2 between-groups Analysis of Covariance (ANCOVA) indicated no statistically significant main effect in for quasi-experimental condition, with F(1, 2251) = 2.546, p =.11. The adjusted mean score for the treatment condition (M = 51.18, SD = 22.54) and the comparison condition (M = 52.53, SD = 21.68) were both within the same level of understanding, some fluency with ideas. A 2 X 3 between-groups Analysis of Covariance (ANCOVA) indicated a statistically significant interaction between curriculum condition and FARMS, with F(2, 2247) = 10.313, p <.05. Follow-up tests were conducted to determine the nature of the interaction. These tests revealed that differences between the Never FARMS subgroup and the Prior FARMS and Now FARMS subgroups were the same in both conditions. There was no significant posttest difference in mean scores for the Never FARMS subgroup, but the Prior FARMS subgroup and the Now FARMS subgroup means in the comparison condition were greater than those in the treatment condition (see Figure 3). The 95% confidence interval for the means for the Prior FARMS and Now FARMS subgroups in the comparison condition and for the Prior FARMS subgroup in the treatment condition suggest that group means could be in either the second or third level of understanding. Potential Threats to Internal Validity Selection-Maturation. There was a statistically significant difference in prior science GPA in Quasi-experiment 2, just as there had been in Quasi-experiment 1. Further, the pretest difference observed suggested another potential selection threat. Finally, additional analysis of the pretest difference on the MFA indicated that scores for the Now FARMS subgroup were greater in the treatment condition than in the comparison condition. This suggests a possible selection-maturation effect for Now FARMS students in the treatment condition that would only be partially mitigated by adjusting for pretest scores in the ANCOVA analysis. Instrumentation. Reliability data and skewness of score distributions suggest a possible instrumentation threat in Quasi-experiment 2 similar to the threat described for Quasi-experiment 1. Additionally, there was a possible differential posttest floor effect because the Now FARMS subgroup distribution was positively skewed in the treatment and not the comparison condition. Although the MFA was shown to have acceptable validity (Pyke & Ochsendorf, 2006), its reliability in this sample was, again, modest (Cronbach's a =.60). Differential Regression to the Mean. The implications of the possible floor effects for regression to the mean were similar to those described for Quasi-experiment 1. The effect was again considered to be more relevant to within group comparisons between levels of FARMS and not differential regression to the mean because the floor effect is present in both conditions. Local History. Data collected during Quasi-experiment 2 do suggest that some local history threats present in Quasi-experiment 1 were addressed. First, teachers in the comparison condition were interviewed to determine the curriculum materials they used. The interviews suggested that the curriculum materials were sufficiently different from the treatment materials to attenuate concerns about intervention diffusion (Lynch et al., 2006). Second, because Motion & Forces was implemented during the third quarter, all teachers reported having completed all 18 of the explorations (Lynch et al., 2006). Third, conducting an exact replication in the same schools reduced the threat that the results of Quasi-experiment 1 were due to the particular cohort of students in sixth grade. However, there could be effects of individual schools or teachers that affect the results. Such effects would not be distinguishable in an exact replication in the same schools.

Using Replication 13 Quasi-experiment 3: Partial Replication Demographics of the Quasi-experiment 3 Sample Quasi-experiment 3 was conducted as a partial replication of Quasi-experiments 1 and 2 in order to address the limitations of the latter manipulating additional variables to rule out competing explanations for results (Hendrick, 1990; Kline, 2003). A new sample of schools, randomly selected from within the sampling frame and randomly assigned to the treatment or comparison condition was selected for Quasi-experiment 3. Comparison of the students in the treatment and comparison conditions suggested that the sampling procedure resulted in two samples of students probabilistically similar in demographic and prior performance variables (see Tables 1 and 2). Changes to Procedures To address potential local history threats to internal validity due to potential student learning from the pretest, a study-within-a-study was conducted to empirically identify pretest effects (Ochsendorf & Pyke, 2007). The pretest for Quasi-experiment 3 was distributed to a randomly selected sub-sample of students representative of the population (n = 295). The pretests were prepared by SCALE-uP researchers and sent to the MCPS program evaluation team. To address teacher learning from the pretest, representatives of the evaluation team distributed the pretests to the sub-sample of students according to a standardized procedure. All assessments were returned to the MCPS program evaluation team office by the representatives, where they were picked up by SCALE-uP researchers. Posttest procedures for Quasi-experiment 3 were the same as those for Quasi-experiments 1 and 2. To address the potential local history threats of intervention diffusion or lack of implementation of the intervention, fidelity of implementation data were collected by observing one full lesson taught in a sample of classrooms (n = 60) during implementation of instruction in the target benchmark. Trained raters collected data on teachers' and students' adherence to the structure of the curriculum materials (in the treatment condition) and use of instructional strategies consistent with the treatment unit (treatment and comparison conditions) (O Donnell et al., 2007). In addition, all students in the treatment condition were provided with a bound, printed Student Journal as recommended by the developers of Motion and Forces. Results Pretest. A 1 X 2 between-groups Analysis of Variance (ANOVA) indicated no statistically significant difference in pretest score between the sub-sample of students pretested in the treatment and comparison conditions. Posttest. A 1 X 2 between-groups Analysis of Variance (ANOVA) indicated a statistically significant main effect in favor of the treatment condition, with F(1, 1759) = 24.26, p <.05, Cohen's d =.23. The adjusted mean score for the treatment condition (M = 56.09, SD = 22.47) and the comparison condition (M = 50.85, SD = 22.18) were both within the third level of understanding, some fluency with ideas. However, the 95% confidence interval around the mean for the comparison condition suggested that the true mean could be in either the second or third level of understanding. ANOVA indicated a statistically significant interaction between curriculum condition and FARMS, F(2, 1755) = 3.80, p <.05. Follow-up tests conducted to determine simple main effects for the interaction indicated that the interaction could be explained by a lack of a statistically

Using Replication 14 significant difference between the treatment and comparison conditions for the Prior FARMS subgroup. The main effect for curriculum condition was consistent for the Never FARMS and Now FARMS subgroups (see Figure 4). Potential Threats to Internal Validity Selection-Maturation. Tables 1 and 2 indicate that the treatment and comparison groups were similar demographically and in terms of their standardized test scores and prior GPA prior to the study. In the aggregate, pretest scores also indicated no significant differences between students with similar demographics and a significant difference between the scores from only one matched pair of schools. Instrumentation. The MFA again showed only modest reliability for this sample (Cronbach's a =.52). In the aggregate, the tests of assumptions for parametric statistics were not violated, but examination of posttest scores indicated posttest scores for the Now FARMS subgroup in the comparison condition were positively skewed. The skewness of the posttest scores for the Now FARMS subgroup makes comparisons across conditions and within the treatment condition more difficult to interpret. Differential Regression to the Mean. Differential regression to the mean cannot be determined for Quasi-experiment 3 because pretest scores are unknown for the majority of the sample. Local History. The potential local history threat of teachers learning from the pretest was addressed by administering the pretest to a limited sample of students. Further, comparison of results from the sub-sample of pretested students suggested that the threat of students learning from the pretest was not present in Quasi-experiment 3. Local history threats of diffusion of the intervention or low fidelity to the intervention in treatment classrooms were addressed with data collected during observations of treatment and comparison classrooms and during interviews with teachers. These data indicated that Motion & Forces was implemented with sufficient fidelity to the structure of the unit (e.g., activities, procedures, sequence) to justify a conclusion that the treatment was implemented and that Motion & Forces was not used in comparison classrooms, thereby reducing the threat of diffusion. Finally, all students in the treatment condition received and used journals as recommended by the Motion & Forces curriculum materials (Harvard-Smithsonian Center for Astrophysics, 2001). The collection of fidelity data provides evidence that threats due to differential implementation in different locations are somewhat alleviated when considering the data from Quasi-experiment 3. On the other hand, the presence of observers could have had an different effect on different teachers and classrooms in different schools. Differences in outcomes between Quasi-experiments 1 and 2, particularly for the Prior FARMS and Now FARMS subgroups, and Quasi-experiment 3 suggest the possible threat that factors at the school level not detected by our sampling procedures could have influenced outcomes. However, demographic similarities at the student level, at which data were analyzed, partially alleviate this concern. Analysis: Considering the Results of 3 Quasi-Experiments The analyses of data in Quasi-experiments 1, 2, and 3 paint different pictures of the effectiveness of the Motion and Forces curriculum unit. The data from quasi-experiments 1 and 2 suggest that the unit is either effective or does no harm for students from the highest socioeconomic level (i.e., Never FARMS) but could actually be worse than the standard fare

Using Replication 15 for students of lover socioeconomic levels. The data from Quasi-experiment 3, however, suggest that the unit is more effective than materials used in the comparison condition in the aggregate and for the Never FARMS and Now FARMS subgroups, while they are no better or no worse for the Prior FARMS subgroups. Faced with these data, it seems important to analyze the threats to internal validity to suggest which of the data are most valid and therefore suitable to use for making decisions about scaling-up the unit. Addressing Threats to Internal Validity through Replication Table 3 summarizes the threats to internal validity present in each of the quasiexperiments on the effectiveness of Motion & Forces conducted by SCALE-uP when aggregate data are considered. Table 4 summarizes the same threats when data disaggregated by FARMS status are considered. If data indicate that a threat did not exist, a no was placed into the appropriate cell on the table. When data suggested that the threat did exist, then a qualitative description of the threat level as low, moderate, or high, agreed upon through discussion among the researchers, was assigned to the appropriate cell. The reader should remember that in some cases, a threat in one area interacts with a threat posed in another. For example, the level of threat posed by differential regression to the mean between experimental conditions is dependent in part on a pretest difference between groups. The first three categories of threats selection-maturation, instrumentation, and differential regression to the mean are generally independent across quasi-experiments. That is, the existence of differences between groups, floor or ceiling effects, and the subsequent differential regression to the mean are affected primarily by the specific sample studied in a given quasi-experiment. Overall, selection-maturation threats are generally low, and instrumentation and differential regression to the mean threats appear to be low to moderate. The contrast between the description of threats for the aggregate data and those for the disaggregated data are noteworthy. When data are not disaggregated, the threat levels for floor and ceiling effects and for differential regression to the mean are low. The potential threats to validity when data are disaggregated, however, suggest that perhaps the instrument is not as effective for the Now FARMS subgroup (i.e., the lowest SES status) and that gain registered for this subgroup might be differentially interpreted as regression to the mean. In any case, the threats to the validity of the data when data are disaggregated underscore the importance of disaggregation when making decisions about data that inform scale-up of an intervention to all students in a population. Potential threats due to local history are not as consistent across quasi-experiments, although they are consistent across aggregate and disaggregated data. Threats are lower in the partial replication for all but one of the local history variables. This is not surprising, given the purpose of the partial replication to address the limitations of quasi-experiments by manipulating additional variables to rule out competing explanations for results (Kline, 2003). In Quasiexperiment 3, any reduction in confidence in the data caused by not administering a pretest to all students (and therefore raising potential concerns about selection-maturation) appears to be addressed not only by the results from the pretested sub-sample, but also by the added confidence resulting from the elimination of the threat of a pretest effect for students or teachers. Further, we might conclude that the absence of a pretest effect on students in the study-within-astudy conducted in Quasi-experiment 3 (Ochsendorf & Pyke, 2007) suggests that it was not present in Quasi-experiments 1 and 2, thereby reducing the potential threat in those quasiexperiments. While not eliminating the threat, we are inclined to conclude that the threat of a