Improved Randomization Tests for a Class of Single-Case Intervention Designs

Size: px

Start display at page:

Download "Improved Randomization Tests for a Class of Single-Case Intervention Designs"

Quentin O’Connor’
5 years ago
Views:

Journal of Modern Applied Statistical Methods Volume 13 Issue 2 Article 2 11-2014 Improved Randomization Tests for a Class of Single-Case Intervention Designs Joel R.

edu Follow this and additional works at: http://digitalcommons.wayne.

1 Journal of Modern Applied Statistical Methods Volume 13 Issue 2 Article Improved Randomization Tests for a Class of Single-Case Intervention Designs Joel R. Levin University of Arizona, jrlevin@u.arizona.edu John M. Ferron University of South Florida, ferron@usf.edu Boris S. Gafurov George Mason University, bgafurov@gmu.edu Follow this and additional works at: Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Levin, Joel R.; Ferron, John M.; and Gafurov, Boris S. (2014) "Improved Randomization Tests for a Class of Single-Case Intervention Designs," Journal of Modern Applied Statistical Methods: Vol. 13 : Iss. 2, Article 2. DOI: /jmasm/ Available at: This Invited Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState.

2 Improved Randomization Tests for a Class of Single-Case Intervention Designs Erratum In the original published version of this article, Panels B and C of Figure 15, p. 38, were reversed, and references to Figure 11 on p. 32 should have referred to Figure 12. This has been corrected. This invited article is available in Journal of Modern Applied Statistical Methods: iss2/2

3 Journal of Modern Applied Statistical Methods November 2014, Vol. 13, No. 2, Copyright 2014 JMASM, Inc. ISSN Invited Article: Improved Randomization Tests for a Class of Single-Case Intervention Designs Joel R. Levin University of Arizona Tucson, AZ John M. Ferron University of South Florida Tampa, FL Boris S. Gafurov George Mason University Fairfax, VA Forty years ago, Eugene Edgington developed a single-case AB intervention design-andanalysis procedure based on a random determination of the point at which the B phase would start. In the present simulation studies encompassing a variety of AB-type contexts, it is demonstrated that by also randomizing the order in which the A and B phases are administered, a researcher can markedly increase the procedure s statistical power. Keywords: Single-case intervention research, design and statistical analysis, randomization tests, statistical power, internal validity, scientific credibility Introduction Single-case designs that focus on behavioral and academic interventions are prevalent in a variety of clinical and educational fields (see, for example, Kratochwill & Levin, 2014). In contrast to conventional group intervention designs, single-case designs typically include only one or a few units (e.g., individuals, small groups, classrooms) to whom the intervention is administered. In addition, single-case intervention designs are intensive and implemented over longer periods of time, with more numerous assessments of the outcome measures (Horner & Odom, 2014; Kratochwill et al., 2010). Single-case intervention designs that currently incorporate formal criteria to enhance their scientific Joel R. Levin is a Professor Emeritus in the Department of Educational Psychology. at jrlevin@u.arizona.edu. John M. Ferron is a Professor in the Department of Educational and Psychological Studies in the College of Education. at ferron@usf.edu. Boris S. Gafurov is an Assistant Professor in the College of Education and Human Development. at bgafurov@gmu.edu. Erratum: In the original published version of this article, Panels B and C of Figure 15 were reversed. This error is now corrected. 2

4 LEVIN ET AL. credibility (Levin, 1994) include ABAB designs, alternating treatment designs, and multiple-baseline designs (Kratochwill et al., 2013). As the methodological rigor of single-case intervention designs has evolved over the years (Kratochwill & Levin, 2010), so too have the formal statisticalanalysis procedures that accompany them (see, for example, Kratochwill & Levin, 2014; and Manolov, Evans, Gast, & Perdices, 2014). Although various visual/graphical approaches remain an analytic staple of single-case data (e.g., Auerbach & Zeitlin, 2014; Kratochwill, Levin, Horner, & Swoboda, 2014; Parker, Vannest, & Davis, 2014), improved statistical methods have increasingly been considered as viable supplements to visual analysis. These improved statistical methods include econometric time-series analyses (e.g., McCleary & Welsh, 1992), adapted regression- and hierarchical linear modeling procedures (e.g., Maggin et al., 2011; Manolov & Solanas, 2013; Moeyaert, Ferron, Beretvas, Van den Noortgate, & Beretvas, 2014; Shadish, Kyse, & Rindskopf, 2013), and nonparametric permutation and randomization tests (e.g., Edgington & Onghena, 2007; Ferron & Levin, 2014; Heyvaert & Onghena, 2014). The last of these statistical approaches is the focus of the present study. Overview of the Present Study The motivation for single-case researchers to adopt a randomization test as one component of their analytic armament is that randomization tests provide strict control of the Type I error rate (i.e., the probability of concluding that phase-tophase differences in level, trend, variability, etc. are present when those differences are simply chance fluctuations) as long as: (1) the design includes randomization; (2) the accompanying statistical test is conducted in a manner that is consistent with the design frame; and (3) the test statistic is chosen without knowledge of the results (Edgington, 1980; Ferron & Levin, 2014). In contrast, demonstration of Type I error control has been elusive in studies of visual analysis (e.g., Ferron & Jones, 2006; Fisch, 2001; Stocks & Williams, 1995). Moreover, with regression and hierarchical models, Type I error control hinges on a relatively strong set of assumptions (Ferron, Moeyaert, Van den Noortgate, & Beretvas, 2014). The modeling assumptions include: (1) the error distribution is correctly specified (e.g., normally distributed, homogeneous variances across phases, and a first-order autoregressive function); (2) the baseline trajectory is correctly specified; (3) the baseline trajectory can be extrapolated (i.e., had the intervention not been implemented, the baseline trajectory would have continued, implying that there were no confounding effects of external events on the time 3

5 IMPROVED RANDOMIZATION TESTS series); and (4) the treatment phase trajectory is correctly specified. Accordingly, a single-case researcher may plan a multicomponent analysis in which visual analysis serves as the primary analysis tool, a randomization test is employed to ensure that the Type I error rate is controlled, and a regression-based or hierarchical linear model is examined to summarize and estimate the size of the effect(s). A concern with the addition of randomization tests to the analytic plan is that such tests require the researcher to introduce randomization into the design, and if the randomization is not carefully planned it can lead to a design that falls short of single-case design standards (e.g., Ferron & Levin, 2014; Kazdin, 1980; Kratochwill et al., 2010). As a consequence, researchers are encouraged to reflect carefully on the practical constraints of the context in which the study is conducted, on the desired design features (e.g., minimum phase lengths), and then tailor the randomization strategy to meet these constraints. Restricted randomization schemes have been developed to ensure that: (1) the desired number of phases and minimum phase lengths are included in reversal designs (Onghena, 1992); (2) the treatment alternates quickly enough in an alternating treatment design (Onghena & Edgington, 1994); (3) the baseline series stabilizes prior to commencement of the intervention phase (Ferron & Ware, 1994); (4) the intervention start points are staggered by a minimum amount of time in multiplebaseline designs (Koehler & Levin, 1998), and (5) researchers are able to obtain visually acceptable patterns by extending phases in multiple-baseline designs (Ferron & Jones, 2006) and reversal designs (Ferron & Levin, 2014). The present Monte Carlo simulation study employs nonparametric randomization tests in the company of a recently proposed methodological addition that greatly enhances the internal validity of AB and ABAB single-case intervention designs (Ferron & Levin, 2014; Levin, Evmenova, & Gafurov, 2014). In these designs, A typically represents a baseline, control, or standard treatment phase containing repeated outcome measurements and B represents an intervention, experimental, or new treatment phase also containing repeated outcome measurements. Here we examine the methodological addition s effect on the statistical conclusion validity (manifested by both Type I error control and increased statistical power) of randomization tests in single-case AB and ABAB designs, in both their single-case (N = 1) and multiple-case (N > 1) forms. In the following section, we first describe the methodological addition that enhances the internal validity (scientific credibility) of single-case intervention research and then outline how the addition is incorporated into a randomization test to improve the test s statistical conclusion validity. Our decision to start our investigations 4

6 LEVIN ET AL. with a single-participant (N = 1) AB design was not because we are advocating for the use of such a design, but because it provides the simplest point to begin study of the impact of the methodological addition. Once we have established the effects on statistical conclusion validity in the simplest situation, we will progressively add complexities to strengthen the design, building to the multipleparticipant (N > 1) ABAB design. Edgington s (1975) Random Intervention Start-Point Model Of four different types of randomization that can be incorporated into randomization in single-case AB experimental studies (specifically, within-case phase randomization, between-case intervention randomization, case randomization, and intervention start-point randomization (see Ferron & Levin, 2014), the last, highly creative, type was originally developed by Edgington (1975) and requires that the researcher: (1) randomly select an intervention start point from two or more that had been previously deemed acceptable; and then (2) assign to the case the start point that was actually selected. Although not applied in the conventional treatment randomization manner, this unique form of randomization increases a single-case study s internal validity and, when accompanied by the statistical test described in the following paragraph, it can increase the study s statistical conclusion validity as well. Moreover, this randomized intervention start-point approach can function to provide a true (i.e., scientifically credible) experimental comparison of two or more intervention (or intervention and control) conditions based on either one case or multiple cases per condition (for examples and discussion, see Ferron & Levin, 2014; Koehler & Levin, 1998; Levin, Lall, & Kratochwill, 2011; Levin & Wampold, 1999; and Marascuilo & Busk, 1988). With the randomized intervention start-point model, a randomization statistical test is conducted on the difference between the means of all B and all A series outcomes for each of the intervention start-point divisions (or transitions) that could have resulted from the random-selection process (see also Edgington & Onghena, 2007). [Moreover, any other summary measure of relevance to the researcher s hypothesis about the nature of change from Phase A to Phase B (e.g., change in the series medians, slopes, variances) can also be the focus of a randomization-test analysis.] With the resulting set of mean differences yielding a randomization distribution, the mean difference associated with the actual intervention start point is examined to see where it falls within the set. The probability of obtaining a 5

7 IMPROVED RANDOMIZATION TESTS mean difference as extreme as or more extreme than the actual mean difference represents the unlikelihood of the outcome. Either signed or unsigned mean differences are considered for one- and two-tailed hypothesis tests, respectively. For example, for an AB design with one case, 25 outcome-assessment periods, and 20 potential intervention start points, if the actual start point were found to produce the largest mean difference (in the predicted direction) between the B and A series outcomes, then the one-tailed significance probability associated with that event would be given by p = 1/20 =.05. For a two-tailed test, as or more extreme opposite-sign mean differences would also need to be taken into account. For instance, if there were a mean difference equal in magnitude but opposite in sign to the one just indicated for the actual intervention start point, then the twotailed significance probability would be 2/20 =.10. In Edgington s (1975) random intervention start-point model for a one-case AB design, it is assumed that the A phase consists of a baseline series, the B phase consists of an intervention series, and that the former logically precedes the latter. With those assumptions, the number of possible outcomes (B A mean differences) in the randomization distribution is k, the number of potential intervention start points. Accordingly, with one case, 30 total observations, and k = 10 potential intervention start points, if the actual B A mean difference produced were the largest of the 10 and in the predicted direction, then the onetailed significance probability of that outcome would be p = 1/10 =.10. In order to achieve statistical significance at a traditional α =.05 level (one-tailed), one would need to include at least k = 20 potential intervention start points in the randomization distribution (i.e., so that if the most extreme mean difference in the predicted direction were obtained, then p would equal 1/20 =.05). To achieve statistical significance with α =.05 via a two-tailed test, a longer series with a minimum of k = 40 potential intervention start points would be required (i.e., so that p = 2/40 =.05 is possible). Randomized Order (Dual Randomization) Addition to the Edgington Model Edgington (1975) proposed his random intervention start-point design-andanalysis procedure 40 years ago. It has been incorporated into a variety of singlecase intervention designs (e.g., Koehler & Levin, 1998; Levin & Wampold, 1999; Marascuilo & Busk, 1988; Onghena, 1992) and is being implemented in its original form to this day. However, it will be shown here that an addition to the procedure (referred to here as a modified procedure), which enhances its internal 6

8 LEVIN ET AL. validity by eliminating bias due AB phase-order effects, is possible and one that is applicable in a number of single-case intervention investigations. To illustrate, suppose that instead of A representing a baseline or control phase, it represents one type of experimental intervention say, a behavioral intervention for combatting a particular phobia. In contrast, B might represent a cognitive intervention targeting the same phobia. Within that context, the case receives both interventions. To have a legitimate (unconfounded) comparison of Intervention A and Intervention B, it is imperative that the order in which the two interventions are administered to the case is randomly (rather than arbitrarily) determined. The preceding statement applies whether the investigation includes only one case or multiple cases (although in multiple-case situations, systematic counterbalancing of intervention orders across cases might be implemented to achieve the same goal). In addition, it is worth noting that A and B need not refer only to two competing interventions. Rather, suppose that A represents a baseline, standard, or control condition and B an intervention condition. As has been suggested previously (e.g., Kratochwill & Levin, 2010), further suppose that prior to the commencement of the actual experiment, a mandatory baseline (or adaptation/warm-up) phase (A') is required of all cases. With A' included, it would then be possible, appropriate, and presumably acceptable to researchers to begin the experiment proper by randomizing each case s subsequent A and B phases (i.e., an A randomly selected to be first means that the case remains in the baseline condition, followed by the B intervention condition; and a B randomly selected to be first means that the case begins with the intervention condition, followed by the A baseline condition). Accordingly, the modified orderrandomization procedure is applicable in either one- or two-intervention AB designs, with the prospect of improving both design (internal validity) and analysis (statistical-conclusion validity) of two-phase single-case intervention studies. With intervention-order randomization built into the just-discussed one-case example based on 30 total observations and 10 potential intervention start points, in addition to the intervention start points associated with the conventional AB order of intervention administration, one would also need to consider the possibility that Intervention B had been randomly selected to be administered first. If that had happened, there would be a corresponding 10 potential intervention start points for the BA order of intervention administration, resulting in a total of k = 20 potential start-point outcomes that would be included in the complete randomization distribution. 7

9 IMPROVED RANDOMIZATION TESTS Multiple-Case Extension of the Modified Edgington Model As we will show, the order-randomization procedure applies to multiple-case (replicated) AB situations as well, increasing the total number of possible randomization-distribution outcomes by a factor of 2 N, where N represents the number of cases. Specifically, with N cases and one of k i potential intervention start points randomly selected for each case, with Marascuilo and Busk s (1988) multiple-case extension of Edgington s (1975) single fixed-order intervention N start-point model, a total of k randomization-distribution outcomes are i 1 i possible, and in the special case for which all k i are equal to k, this quantity reduces to k N. With the addition of an order-randomization process to create the present dual randomization model, the total number of possible randomizationdistribution outcomes increases to k 2 and k N 2 N = (2k) N for the general N N i 1 i and special-case situations, respectively. Hypothetical example We illustrate the present random-order randomization-test procedure for a replicated single-case AB design by means of a hypothetical example. Suppose that a language researcher wishes to improve the baseline vocalization output (A phase) of two low word-producing children through some type of positive-reinforcement intervention (B phase). For the random-order version of the present example we assume that a mandatory A' baseline (warm-up) phase was initially administered, followed by a random determination of whether the first phase of the actual study would be a baseline (A) or an intervention (B) phase, thereby producing either an A'AB or A'BA design. Although in comparison to a traditional fixed-order AB design, this type of randomized AB design is more scientifically credible (especially when replicated across cases), the latter design was not considered in the current What Works Clearinghouse (WWC) single-case intervention design Standards (Kratochwill et al., 2010). Our hypothetical study is presented simply to illustrate both the original (Edgington, 1975) fixed-order and the present random-order randomization-test procedures, without taking into account the study s internalvalidity characteristics. Consideration of internal-validity issues is included later in the Discussion section. In this hypothetical study, the number of single-word vocalizations by each child during a 5-minute play period is recorded, with Child 1 observed in each of 25 daily sessions and Child 2 observed in each of 15 daily sessions, and where both children must be observed in at least 3 A sessions and 3 B sessions (thereby resulting in 20 and 10 potential intervention transition points for Child 1 and 8

10 LEVIN ET AL. Child 2, respectively). In addition, because the researcher wishes to randomize the intervention order (AB or BA) for each child, three preliminary five-minute A' warm-up sessions are provided prior to the start of the children s actual experimental sessions. An initial coin toss determines that Child 1 will be administered an AB intervention order, with the 20 potential intervention transition points specified from between the 4 th and 23 rd sessions inclusive and the randomly selected actual intervention transition point occurring just prior to Session 10. For Child 2, a BA intervention order results from a second coin flip, with the 10 potential intervention transitions specified from between the 4th and 13th sessions inclusive and an actual randomly selected intervention transition point just prior to the 7 th observation. The A- and B-phase observations are presented in Table 1. Given the present random-order AB intervention start-point randomization model, the data were analyzed with Gafurov and Levin s (2014) single-case ExPRT (Excel Package of Randomization Tests) package see Levin et al. (2014) for complete information about ExPRT. In Table 2 are presented the B A mean differences associated with each of the potential intervention transition points for the two children. The first Table 2 entry of 2.41 for Child 1, which corresponds to an A-to-B intervention transition point just prior to Observation 4, was calculated by taking the average of Child 1 s Observations 4 through 25 (mean B phase = 6.41) minus the average of that child s Observations 1 through 3 (mean A phase = 4.00). The same process was followed for each of the subsequent 19 potential intervention points for Child 1, which ends with the average of that child s Observations 23 through 25 (mean B phase = 8.00) minus the average of that child s Observations 1 through 22 (mean A phase = 5.86), resulting in Child 1 s final mean difference of 2.14 in Table 2. Next, and as indicated in Table 2 s Footnote a, 20 additional mean differences were calculated for Child 1 under the assumption that instead of an A B intervention order, the reverse B A order had been selected. Under that assumption, the first mean difference for Child 1 would be = 2.41, which is exactly the same numerically but opposite in sign to the previously calculated child s first value in Table 2. The same is true for all of Child 1 s calculated reverse-order values, including the 20th one, which is now The same process applied to Child 2 s data yields the 10 actual B A mean differences presented in Table 2 (i.e., = 1.08 for the first one), as well as 10 reverse-order and opposite-sign A B mean differences. 9

11 IMPROVED RANDOMIZATION TESTS Table 1. Hypothetical data for Child 1 s 25-observation series, with a randomly selected AB intervention order, 20 potential intervention transition points (between Observations 4 and 23 Inclusive), and a randomly selected actual intervention transition point just prior to Observation 10; and for Child 2 s 15-observation series, with a randomly selected BA intervention order, 10 potential intervention transition points (between Observations 4 and 13 Inclusive), and a randomly selected actual intervention transition point just prior to Observation 7 Child 1 Child 2 Observation Phase Vocalizations Observation Phase Vocalizations 1 A 4 1 B 6 2 A 3 2 B 5 3 A 5 3 B 7 4 A 5 4 B 5 5 A 2 5 B 6 6 A 5 6 B 5 7 A 3 7* A 4 8 A 4 8 A 5 9 A 4 9 A 3 10* B 5 10 A 5 11 B 6 11 A 4 12 B 7 12 A 5 13 B 6 13 A 6 14 B 7 14 A 5 15 B 8 15 A 6 16 B 7 17 B 9 18 B 8 19 B 6 20 B 8 21 B 9 22 B 8 23 B 7 24 B 9 25 B 8 *Actual intervention transition point. 10

12 LEVIN ET AL. Table 2. The B A mean difference associated with: (1) each of Child 1 s 20 potential intervention transition points (O4-O23) for a randomly selected AB intervention order; and (2) each of Child 2 s 10 potential intervention transition points (O4-O13) for a randomly selected BA intervention order Child 1 Child 2 Potential Intervention Point B-A Mean Difference a B-A Mean Difference b O O O O * O O O * O O O O O O O O O O O O O *Mean difference associated with the actual intervention transition point. a The 20 A B mean differences are also calculated and added to these to form a 40-outcome randomization distribution; all of the A B mean differences are the same as the corresponding B-A mean differences given here but opposite in sign. b The 10 A B mean differences are also calculated and added to these to form a 20-outcome randomization distribution; all of the mean A B differences are the same as the corresponding mean B A differences given here but opposite in sign. The resulting joint randomization distribution therefore contains 40 mean differences for Child 1 combined with 20 mean differences for Child 2, for a total of = 800 averaged mean differences (i.e., Child 1 s 1 st mean difference averaged with Child 2 s 1 st mean difference, Child 1 s 1 st mean difference averaged with Child 2 s 2 nd mean difference, all the way up to and including Child 1 s 40 th mean difference averaged with Child 2 s 20 th mean difference). When that is done by the ExPRT program, it is found that the actual joint mean 11

13 IMPROVED RANDOMIZATION TESTS difference that was obtained in the study is 2.19, which is Child 1 s mean difference associated with that child s actual intervention transition point of O 10 (3.49) averaged with Child 2 s actual intervention transition-point mean difference of O 7 (.89). Of the 800 outcomes in the joint randomization distribution, a value of 2.19 is the 10 th highest, which results in a one-tailed significance probability of p = 10/800 = For this example, had a one-tailed Type I error probability (α) of.05 been selected, it could be concluded that the positivereinforcement intervention (B) distribution values differed statistically from those in the baseline distribution (A), with the additional inference that the former distribution s values were higher. We note that both here and in the various simulations conducted in the present series of investigations, one-tailed tests are conducted because it is assumed that [especially in single-case A (baseline) B (intervention) research] the researcher has a clear and defensible rationale for the direction of change that is associated with the intervention. Insofar as randomization tests are not tailored to test for the equality of two populations specific parameters, all that can be tested for is the equality of the two population distributions per se. For the present randomization test, the test statistic involves sample-mean differences and because that is the test that produced a statistically significant result here (favoring the intervention phase over the baseline phase), a reasonable inference is that there was an A- to B-phase upward shift in the children s level of responding. Advantages of the Order Randomization Modification The present order-randomization approach enhances the internal validity of a single-case AB design by virtue of its removing bias stemming from interventionorder effects. As an important byproduct, the approach also elevates the status of the basic AB single-case intervention design from a WWC Standards acceptable design standpoint (Kratochwill et al., 2010), particularly when replicated across independent participants at different points in time. According to the WWC Standards, two-phase A (Baseline) B (Intervention) designs are not scientifically credible (and therefore unacceptable) because they suffer from too many potential sources of internal invalidity. For extended discussion of acceptable designs, see Kratochwill, et al. (2010, 2013). Including outcomes from both intervention-administration orders in the randomization distribution also provides fundamental pragmatic advantages for single-case intervention researchers. First, with the original Edgington (1975) model, a researcher would need to designate 20 potential intervention start points 12

14 LEVIN ET AL. (based on at least 21 total observations) to produce a randomization test that is capable of detecting an intervention effect with a one-tailed Type I error probability less than or equal to.05. With the present procedure, a researcher would need to designate only half as many potential intervention start points (here, 10, based on a total of 11 total observations, resulting in 20 possible outcomes) to detect an intervention effect. A related reason why the present procedure has practical importance for single-case intervention researchers is that (and as will be demonstrated here) relative to the original Edgington (1975) model, the modified approach may produce statistical-power advantages as well. Thus, for no more expense than a coin to flip, a researcher might reap both methodological and statistical benefits by adopting the present dual-randomization procedure rather than either the original single-randomization Edgington model or Marascuilo and Busk s (1988) multiple-case extension of it. Relationship to Traditional Experimental Designs and Statistical Analyses Although unrecognized at the time that the present order-randomization approach was initially conceptualized, its logic maps directly onto a statistical procedure in the traditional group randomized treatment-design literature. In particular, consider a randomized two-treatment correlated-samples (or within-subjects) design based on N participants, to which a nonparametric randomization test is applied as an appropriate alternative in (especially small-sample) situations where the normality assumption of a correlated-samples t test (or a one-sample repeatedmeasures analysis) is questionable. To illustrate that situation, we revisit an example that was recently presented by Ferron and Levin (2014, p. 174). Suppose that in a sample of N = 8 adults, each participant is administered two different fear-reducing treatments, A (a behavioral treatment) and B (a cognitive intervention), with the former posited to be more effective than the latter. It is determined in advance that the equaleffectiveness hypothesis will be tested with a randomization test based on a onetailed α of.05. To produce a scientifically credible experiment, the order in which the two treatments are administered is again randomly determined on a case-bycase basis by means of coin flips: say, heads represents an AB order and tails a BA order. On the basis of that process, let us suppose that 5 participants ended up in the AB condition and 3 in the BA condition. Following the administration of each treatment, participants fear responses are assessed on a 7-point Likert scale, with higher numbers indicating greater fear. With the measure of interest defined 13

15 IMPROVED RANDOMIZATION TESTS as the difference between each participant s B and A ratings (i.e., B A), the following outcomes were obtained for the 8 participants: The observed test statistic is given by the average of these differences, which is equal to +17/8 = A randomization distribution is created from the 2 N = 2 8 = 256 possible ways in each + and signs could be attached to these 8 numerical values. For example, the first outcome in the randomization distribution (with all + signs) would be: yielding a mean difference of +24/8 = 3.000, and the last (with all minus signs) would be: yielding a mean difference of 24/8 = The remaining 254 possible outcomes would fall somewhere between these two extremes. The actually obtained mean difference of appears to be on the higher side of this distribution. In fact, it turns out to be among the 9 highest possible outcomes (specifically, an outcome that is exceeded by only 5 outcomes and that is tied with 3 others). Accordingly, a one-tailed test of the hypothesis that the A and B treatments have equal distributions would be associated with a p-value (consistent with the alternative hypothesis that Treatment B is producing higher fear ratings than Treatment A) that is equal to 9/256 =.035. Because this value is less than the predetermined α of.05, it would be concluded that the actually obtained mean difference of is statistically significant. Note that for this conventional-group design and associated randomization test, the all-possible assignment of + and signs to the 8 absolute B A differences corresponds exactly to the logic and operationalization of the singlecase AB order-randomization procedure to be investigated here. In particular, the procedure incorporates two separate forms of randomization for each of the N participating cases, Edgington s intervention start-point randomization and AB order randomization. In the simplest situation where there is only one potential intervention start point for each case (as in the just-presented N = 8 example), the total number of possible start-point randomizations is equal to k N = 1 8 = 1. The 14

16 LEVIN ET AL. present order-randomization procedure involves each of the 8 participants contributing two differences (i.e., B A and A B) to the randomization distribution, resulting in 2 N = 2 8 = 256 joint randomization outcomes, and which, according to the previously given special-case dual-randomization formula, k N 2 N, yields a total of = 256 possible randomization outcomes. This total is identical to the number of possible randomization-distribution outcomes associated with the just-presented example. It is instructive to note that the total number of possible randomization outcomes associated with order randomization N N can be alternatively expressed as x 0 x, where N = the number of cases and x = the number of positive B A differences that could be associated with the N 8 8 actual outcomes. For the present example, this expression is equal to x 0 x, or Thus, when there is only one potential intervention point for each case and the AB design includes multiple observations, the present randomized-order test based on the difference between the A- and B-phase means maintains the same correspondence with a conventional-group correlated-samples randomization test as was shown here. Implicit in the conventional correlated-samples test is that with random assignment to treatment conditions, outcomes representing both orders of treatment administration need to be considered in the randomization test distribution. As such, the present order-randomization procedure is not really a special case at all, but rather the single-case analog of a correlated-samples randomization t test. Focus of the Present Investigations The focus of our series of simulation investigations was to examine the Type I error and statistical power characteristics of the dual-randomization modification (intervention start-point plus intervention order) relative to those of Edgington s (1975) and Marascuilo and Busk s (1988) original single-randomization (intervention start-point) test procedures. In this study we present randomized intervention-order findings not just for a basic two-phase AB design, but also for a randomized pairs variation of that design (Levin & Wampold, 1999), a single- 15

17 IMPROVED RANDOMIZATION TESTS case adaptation of the conventional-group crossover design, and Onghena s (1992) four-phase ABAB design. Investigations 1-3: Randomized Intervention Order for the Basic AB Design Investigation 1 Method In Investigation 1, the focus was on 30-observation designs for a single participant (i.e., N = 1), where the intervention start point was randomly selected from the middle 20 observations. The series length of 30 was chosen for initial examination because: (1) 20 start points is the minimum number needed to obtain a statistically significant result with a one-tailed α of.05 for an AB randomized start-point design with one case; and (2) the WWC Standards require a minimum of five observations in each phase (Kratochwill et al., 2010, 2013). Data were generated using SAS IML (SAS, 2013), where the time-series data were obtained by adding an error vector to an effect vector. The error vector was created such that it was distributed normally and had an autocorrelation of 0 or.3 by using SAS s autoregressive moving-average simulation function (ARMASIM). The autocorrelation values of 0 and.3 were motivated by a survey of actual single-case studies where it was reported that the average autocorrelation was.2, after adjusting for bias in the estimates (Shadish & Sullivan, 2011). To obtain simulated errors based on an autocorrelation of.3, the autoregressive parameter matrix was set to {1.3}, the moving average parameter matrix was set to {1 0}, and a standard deviation of the independent portion of the error was set to 1.0 (for details on the simulation algorithm see Woodfield, 1988). The effect vector was coded to have values of 0 for all baseline observations, and values of d for all intervention phase observations, and thus d corresponds to the mean shift between intervention and baseline observations in standard deviation units, (μ B μ A )/σ (see Busk & Serlin, 2005), where the standard deviation is based on the independent portion of the within-case error term (see, for example, Levin, Ferron, & Kratochwill, 2012) (for an alternative operationalization of d that corresponds mathematically to a conventional groups effect-size measure, see Shadish et al. (2014)). The value of d was varied to examine the one-tailed Type I error probability for d = 0 and the powers for ds ranging from.5 to 5 in increments of.5. For reference, if the d used for the present data generation is estimated for each of the 200 Phase A-to-Phase B contrasts examined in the survey of singlecase interventions reported by Parker and Vannest (2009), the empirically 16

18 LEVIN ET AL. observed values of d (assuming no autocorrelation for simplicity) for the 10 th, 50 th, and 90 th percentile ranks are estimated to be 0.46, 1.70, and 3.88, respectively. By crossing each design (single, dual), with each level of autocorrelation (r = 0,.3), and each effect size (d = 0 to 5, in increments of.5), = 44 conditions were obtained, and for each of these conditions the data for 10,000 studies were simulated. The data for each simulated data set were analyzed using a randomization test in which the obtained test statistic (M B M A ) was compared to the complete randomization distribution. The proportion of simulated studies in which the randomization test led to a one-tailed p-value of.05 or less was determined to estimate the rejection rate (Type I error or power) of the randomization test for each of the 44 experimental conditions. Figure 1. Investigation 1: Comparison (α =.05, one-tailed) of randomization tests for a one-case (N = 1) AB randomized intervention start-point design (Single) and the randomized intervention start-point plus randomized intervention- order design (Dual), where the start point was randomly selected between the 6 th through the 25 th observations inclusive in a 30-observations study. The rejection rate of the null hypothesis is shown as a function of the effect size and level of autocorrelation. 17

19 IMPROVED RANDOMIZATION TESTS Results Results are shown in Figure 1 for Edgington s (1975) original procedure (single) and for the present randomized-order modification (dual). As may be seen in that figure, when the effect size is 0, all situations are associated with empirical powers (which, for d = 0 are equivalent to Type I error probabilities) that correspond to their nominal.05 values. Not surprisingly, based on previous findings (e.g., Ferron & Sentovich, 2002; Ferron & Ware, 1995; Levin et al., 2011), it may also be seen that for ds > 0 power is uniformly higher for r = 0 than for r =.3. As the effect size increases, so does power, although more rapidly for the dual-randomization procedure than for its singlerandomization counterpart. The largest power differences, favoring the former, reach.21 in the r = 0 situation for ds of 1.5 and 2.0; and in the r =.3 situation the largest power difference is.18 for a d of 2.5. Investigation 2 Method In Investigation 2, series length (i.e., the number of observations) was systematically varied for a single-participant (N = 1) design, while holding the effect size constant at d = 2. A d of 2 was chosen because it is a large enough effect to typically be of interest to a single-case researcher. Yet, a d of 2 is small enough that it is not readily detectable (power <.80) in a single-participant 30- observations design when there is a moderate autocorrelation of.30 and applying either the single- or dual-randomization approach (as may be seen in Figure 1, where powers are.50 and.67, respectively). The simulation methods paralleled those of the initial investigation (including a one-tailed α of.05), but d was held constant at 2.0 for all conditions and series length was varied from 20 to 150 in increments of 10. The number of potential intervention start points was always the series length minus 10 to ensure at least five observations in the baseline and intervention phases. Results Results for this set of simulations are provided in Figure 2, where with an autocorrelation of.30, power of at least.80 is attained for the dualrandomization approach with 60 observations (power =.81), in contrast to the single-randomization design where.80 power is not quite attained even with 150 observations (power =.79). For 30 to 100 observations, the power difference between the two randomization schemes (favoring dual) ranges from.13 to.31 when the autocorrelation is 0 and from.17 to.30 when the autocorrelation is

20 LEVIN ET AL. Figure 2. Investigation 2: Comparison (α =.05, one-tailed) of randomization tests for a one-case (N = 1) AB randomized intervention start-point design (Single) and the randomized intervention start-point plus randomized intervention-order design (Dual). The rejection rate of the null hypothesis is shown as a function of series length and level of autocorrelation. The effect size is 2.0 and the number of potential intervention start points (x) is equal to the series length minus 10 and encompasses the middle x observations. It should be noted that the power is 0 for the single-randomization scheme with 20 observations because there are only 10 possible intervention start points and thus statistical significance cannot be obtained at the one-tailed.05 level. In addition, the undulation in the power curves for the single-randomization approach makes sense when one recognizes that: (1) for a series length of 30, statistical significance with α =.05 can be attained only for the most extreme of the 20 permutations; and (2) with a series length of 40, statistical significance can again be attained only for the most extreme permutation, but now there are 30 permutations and so the most extreme is somewhat more difficult to achieve. Although power drops for the 40-observation series, with a series length of 50, statistical significance can be attained for either of the two most extreme permutations and thus power jumps back up again. 19

21 IMPROVED RANDOMIZATION TESTS Investigation 3a Method In Investigation 3a, the effect of multiple-case replications (i.e., N > 1) on the power of the single- and dual-randomization procedures was examined. More specifically, a design with 15 observations and 5 potential intervention start points, randomly selected from observations 6 through 10, was examined with 2, 3, 4, 5, and 6 participants based on a one-tailed α of.05. For the single-randomization approach, 7 and 8 participants were also included. These numbers of participants seemed reasonable given the survey by Shadish and Sullivan (2011), in which it was found that the number of cases in single-case studies averaged 3.64, with a range of 1 to 13. In the present study, effect sizes varied from 0 to 3 in increments of.5 and the autocorrelation was set either to 0 or.3. Figure 3. Investigation 3a: Comparison (α =.05, one-tailed) of randomization tests for the Single and Dual basic AB randomized designs replicated across N cases. The rejection rate of the null hypothesis is shown as a function of effect size and N, for a 15- observations design with 5 potential intervention start points designated from between the 6 th and 10 th observations inclusive and an autocorrelation of 0. 20

22 LEVIN ET AL. Results Results from simulations where the autocorrelation is 0 are shown in Figure 3, whereas those for an autocorrelation of.3 are shown in Figure 4. In both figures, it may be seen that for all sample sizes the empirical Type I error probabilities are well controlled at.05 for both the single- and dual-randomization approaches. The important thing to note is that in both figures, for all effect sizes the dual approach based on as few as N = 3 participants has associated power that is greater than or equivalent to the single approach based on N = 8 participants. For example, in Figure 4 it may be seen that with an autocorrelation of.3, N = 3 dual- and N = 8 single-randomization powers are.66 and.61, respectively, for an effect size of 1.0; and they are.90 and.89, respectively, for an effect size of 1.5. Figure 4. Investigation 3a: Comparison (α =.05, one-tailed) of randomization tests for the Single and Dual basic AB randomized designs replicated across N cases. The rejection rate of the null hypothesis is shown as a function of effect size and N, for a 15 observations design with 5 potential intervention start points designated from between the 6 th and 10 th observations inclusive and an autocorrelation of.3. 21

23 IMPROVED RANDOMIZATION TESTS Investigation 3b Method In this investigation, the simulations of Investigation 3a were replicated with the sole difference being that a two-tailed test with α =.05 was conducted, as opposed to a one-tailed test. Results The results are summarized in Figure 5 for an autocorrelation of 0 and in Figure 6 for an autocorrelation of.3. Again, it may be seen that all of the empirical Type I errors are at the expected.05 level for both autocorrelation values. Although the Investigation 3a results (i.e., the equivalence of dualrandomization N = 3 and single-randomization N = 8) were not identical here, the general pattern was. In this case, however, the appropriate power equivalence involves dual N = 4 and single N = 8. Specifically, in Figure 6 it may be seen that with an autocorrelation of.3, the former and latter powers are.65 and.61, respectively, for an effect size of 1.0; and they are.93 and.89, respectively, for an effect size of 1.5. Figure 5. Investigation 3b: Comparison (α =.05, two-tailed) of randomization tests for the Single and Dual basic AB randomized designs replicated across N cases. The rejection rate of the null hypothesis is shown as a function of effect size and N, for a 15 observations design with 5 potential intervention start points designated from between the 6 th and 10 th observations inclusive and an autocorrelation of 0. 22

24 LEVIN ET AL. Figure 6. Investigation 3b: Comparison (α =.05, two-tailed) of randomization tests for the Single and Dual basic AB randomized designs replicated across N cases. The rejection rate of the null hypothesis is shown as a function of effect size and N, for a 15 observations design with 5 potential intervention start points designated from between the 6 th and 10 th observations inclusive and an autocorrelation of.3. Thus, in the present investigation we observe that for two-tailed tests the dual-randomization power benefits (relative to single randomization) are comparable to those reported for Investigation 3a s one-tailed tests. It is important to point out, however, that the situations examined here were all based on multiple-case (N > 1) designs. It turns out that for the special-case N = 1 situation, although the dual- over single-randomization power advantage is evident when one-tailed tests are conducted (as was true in Investigations 1 and 2), the dualand single-randomization schemes yield equivalent power results with two-tailed tests. Because the two-tailed test is based on randomization-distribution absolutevalue outcomes, the dual-randomization distribution contains every outcome of the single-randomization distribution as well as its opposite-order complementary outcome, thereby yielding exactly the same p-value for each test. (To illustrate these notions, see Child 1 s hypothetical data, including Footnote a in Table 2. The 40 unsigned mean differences (i.e., 20 B A plus 20 A B ) would constitute the dual-randomization distribution for a two-tailed test). Because there are across-case combinations when N > 1, there is no longer a one-to-one 23

25 IMPROVED RANDOMIZATION TESTS correspondence between the single- and dual-randomization distributions and so their powers will generally differ, with the latter being greater (as was observed in Figures 5 and 6). Investigation 4: Randomized Intervention Order and/or Randomized Intervention Assignment in Levin and Wampold s (1999) AB Pairs Design Another type of dual-randomization strategy is possible when a case consists of a pair of participants, as in Levin and Wampold s (1999) simultaneous intervention start-point model. With the Levin-Wampold model, N participant (or other unit) pairs are created and the members of each pair are randomly assigned to two different intervention conditions (or to an intervention and control condition), X and Y. With this model, Levin and Wampold presented two hypotheses that would be of interest to researchers: (1) a general intervention effectiveness hypothesis, namely that averaged across the two intervention conditions, there is no difference between Phase A and Phase B performance (analogous to the time main effect in a conventional two-treatment pretest-posttest design); and (2) a comparative intervention effectiveness hypothesis, namely that the change in participants performance from Phase A to Phase B is the same in the two intervention conditions (analogous to the treatment-by-time interaction in a conventional two-treatment pretest-posttest design). Unrecognized by Levin and Wampold at the time, the randomization test of each of these hypotheses could potentially benefit from an additional randomization component. For the general intervention effectiveness hypothesis, that component is AB order randomization of the kind that we have considered in Investigations 1-3, either with or without a mandatory A' baseline phase; and for the comparative intervention hypothesis, that component consists of within-pair intervention randomization, wherein pair members are randomly assigned to the two intervention conditions. Implementing either of these randomization types increases the total number N of possible outcomes from k for Levin and Wampold s (1999) original i 1 i single randomization-test procedure (i.e., the number of potential intervention start points for each pair) to 2 N N k for the present dual approach (i.e., either i 1 i the number of possible random assignments of AB orders or the number of possible random assignments of interventions to pair members, times the number of potential intervention start points for each pair). In Investigation 4, we examine the statistical power consequences associated with the dual approach s additional 24

Statistical Randomization Techniques for Single-Case Intervention Data. Joel R. Levin University of Arizona

Statistical Randomization Techniques for Single-Case Intervention Data Joel R. Levin University of Arizona Purpose of This Presentation (Dejà Vu?) To broaden your minds by introducing you to new and exciting,