Working Paper: Assessing Correspondence between Experimental and Non-Experimental Results in Within-Study-Comparisons

Size: px
Start display at page:

Download "Working Paper: Assessing Correspondence between Experimental and Non-Experimental Results in Within-Study-Comparisons"

Transcription

1 dpolicyworks Working Paper: Assessing orspondence between xperimental and Non-xperimental Results in Within-tudy-omparisons Peter M. teiner 1 & Vivian. Wong 2 In within-study comparison (W) designs, tatment effects from a non-experimental design, such as an observational study or a gssion-discontinuity design, a compad to sults obtained from a well-designed randomized control trial (RT) with the same target population. The goal of the W is to assess whether the non-experimental and experimental designs yield the same sults in field settings, and the contexts and conditions under which non-experimental methods plicate benchmark RT estimates. A common analytic challenge with Ws, however, is identifying appropriate criteria for determining whether non-experimental and benchmark sults plicate. This paper examines methods for assessing corspondence in benchmark RT and non-experimental sults in W designs. We examine measus for assessing corspondence in sults, and the lative advantages and limitations of these approaches. We identify two classes of measus: conclusion- and distance-based corspondence approaches. onclusion-based measus indicate corspondence in sults if a searcher or policy-maker draws the same conclusion from the experiment and non-experiment. istance-based measus investigate whether the diffence in estimates is small enough to claim corspondence between methods. We use a simulation study to examine the statistical properties of corspondence measus, and commend a new approach that combines traditional significance testing and equivalence testing in the same framework. The paper concludes with practical advice on assessing and interpting sults in W contexts. 1 University of Wisconsin-Madison 2 University of Virginia Updated April 2016 dpolicyworks University of Virginia PO Box harlottesville, VA dpolicyworks working papers a available for comment and discussion only. They have not been peer-viewed. o not cite or quote without author permission. Working paper trieved from: Acknowledgements: This search was supported by a collaborative NF grant # dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia Working Paper 2016 Rector and Visitors of the University of Virginia. For mo information please visit or contact dpolicyworks@virginia.edu

2 Analysis of Within-tudy omparisons AING ORRPONN BTWN XPRIMNTAL AN NON-XPRIMNTAL RULT IN WITHIN-TUY-OMPARION Peter M. teiner & Vivian Wong INTROUTION In 1986, LaLonde introduced a search design called the within-study comparison (W) to empirically evaluate the performance of non-experimental methods in field settings. In his approach, he used data from a randomized control trial (RT) of the National upported Work emonstration (NW) to estimate a causal benchmark effect, and compad the sult to those obtained from an observational study with the same target population. The goal was to determine whether observational methods, such as cross-sectional gssion and diffence-in-diffences models, plicated RT benchmark sults in field settings. LaLonde concluded that observational methods failed to plicate comparable sults to those obtained from the RT benchmark. His conclusions we based on mostly heuristic approaches. He compad the diction of tatment effects from the experiment and non-experiment, the magnitude of tatment effects obtained from each study design, and the statistical significance patterns of tatment effects. Moover, LaLonde looked at the size of the diffence in non-experimental and experimental sults judging a tatment effect diffence of $1600 in earnings as being substantially large, and a diffence of $600 as being less consequential (p. 617). Other Ws soon followed, including studies by Fraker and Maynard (1987), Friedlander and Robins (1995), Bell, Orr, Bloomquist, and ain (1995), Heckman, Ichimura, mith, and Todd (1998), among others. Many of these studies took advantage of data from RT job training evaluations and extant datasets that shad similar outcome measus of earnings. However, the criteria for assessing corspondence in benchmark and non-experimental sults varied from study to study, with some looking at corspondence in size of tatment effects and statistical significance patterns in the RT and non-experiment, while others constructed bias measus of the nonexperiment, or tested dictly statistical diffences in benchmark and non-experimental sults. Given the diffent corspondence measus used for evaluating non-experimental performance, interpting W sults across studies proved challenging. In an early meta-analysis of W sults, Glazerman, Levy, and Myers (2003) limited their analysis to include only studies that used job training data with earnings outcomes. They established a corspondence criterion of a $1,000 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 1

3 Analysis of Within-tudy omparisons diffence in benchmark and non-experimental sults because a diffence of $1,000 or mo can make a dramatic diffence in the policy commendation (p. 74). But is $1,000 an appropriate thshold for evaluating the success of non-experimental methods in all job training field settings? What about W cases whe the outcome is not in earnings, but in test scos (Wilde & Hollister, 2007; Aiken, West, chwalm, arroll, & Hsuing, 1998), voter participation rates (Arceneaux, Gerber, & Gen, 2010), or criminal cidivism (Berk, Barnes, Ahlman, & Kurtz, 2010)? And, how does one consider corspondence in W designs when even two perfectly implemented study designs will produce diffent effects because of sampling and measument error? At issue a valid and consistent criteria for evaluating the performance of non-experimental methods in within-study comparison designs. ince LaLonde s (1986) study, mo than 60 Ws have emerged evaluating the performance of non-experimental methods in field settings. These studies have spanned the disciplines of international development, job training, education, political science, environmental policy, health policy, and criminology. They have examined the performance of non-experimental and quasi-experimental approaches, including gssion-adjustment, gssion-discontinuity designs, matching methods, and peated measus approaches. Results from these studies have had profound influence on funding policy and search methodology in program and policy evaluation. For example, the epartment of ducation (Paige, 2005), Office of Management and Budget (2004), and What Works learinghouse (Institute for conomic tudies, 2005) have cited sults from W studies for guidance on methodology choice in program evaluation. Because of strong methodological and policy intests in W sults, cent search has focused on improving the design of these empirical tests of methods. Wong and teiner (under view) formalize diffent types of W designs and identify the assumptions needed for Ws to provide valid, causally interptable sults. Less clear, however, a appropriate methods for analyzing W designs. Thus far, W analysts have lied on a constellation of ad hoc approaches for judging corspondence in benchmark and non-experimental sults. But this has sulted in tmendous variation in how searchers and policy-makers have interpted W sults, subjecting the validity of these conclusions to unconscious, or conscious, searcher bias. Moover, the is no methodological guidance for the contexts and conditions under which diffent corspondence metrics should be adopted in the analysis of W designs. When is it desirable to just compa the diction and magnitude of benchmark and non-experimental sults, as opposed to looking at the diffence in benchmark and non-experimental estimates? When should statistical tests of diffence 2 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

4 Analysis of Within-tudy omparisons in benchmark and non-experimental sults be used as the criterion for making judgments? And, how do the size of the true tatment effect, the diction of selection bias in the observational study, and statistical power of the W affect the performance of the diffent corspondence criteria? This paper has the aims for improving the analysis and interptation of within-study comparison designs for evaluating non-experimental methods. The first is to describe the multiple methods for assessing corspondence in W studies, including the purposes of each method as well as the advantages and disadvantages of the approach. The second is to psent sults from a Monte arlo simulation that demonstrates the contexts and conditions under which diffent corspondence criteria produce interptable sults. The third is to propose a new corspondence test for evaluating whether non-experimental methods succeed in plicating benchmark sults in field settings. The paper argues for the use of a corspondence measu that incorporates statistical tests of diffence and equivalence within the same framework (Tyron, 2001; Tyron & Lewis, 2008). The paper concludes by offering W searchers with practical commendations for the analysis and interptation of Ws. Finally, in this paper, we distinguish between two types of W designs. The first approach is the independent W design, whe units a randomly assigned to a randomized experiment or a non-experimental condition (hadish, lark, & teiner, 2008). In the benchmark arm, units a randomly assigned again to tatment or control conditions, while in the non-experiment, units select into a pferd tatment or control condition. orspondence is assessed by comparing the average tatment effect from each design. The second approach is called the dependent W design. This approach was implemented by LaLonde (1986), whe some portion of the benchmark sample is shad between RT and non-experimental arms of the W. For example, units may select into a randomized experiment, and within the RT, units a randomly assigned to tatment and control conditions. The non-experiment is constructed by matching units who did not enter the benchmark RT, but sha the same outcome measus and we not exposed to the tatment. Figu 1 summarizes the independent and dependent arm designs, and Wong and teiner (under view) describe these designs in further detail. ORRPONN MAUR IN W IGN Although the existing W literatu lacks consensus on what constitutes corspondence in benchmark and non-experimental estimates, multiple W searchers have noted the challenge with interpting sults. Wilde and Hollister (2007) introduce an early discussion of corspondence 3 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

5 Analysis of Within-tudy omparisons measus in their paper titled How close is close enough? In the paper, the authors use RT data from the Tennessee lass ize experiment to examine the performance of propensity sco methods. To judge how close is close enough between benchmark and non-experimental sults, Wilde and Hollister compad the diction, size, and statistical significance patterns of tatment effects obtained from the RT and non-experimental study, as well as conducted dict statistical tests of diffence in benchmark and non-experimental sults. However, Wilde and Hollister also advocated for substantive thsholds to assess corspondence in benchmark and non-experimental sults. Using sults from Krueger s (1999) cost-benefit analysis of the Tennessee lass ize xperiment, Wilde and Hollister defined an impact thshold of 5.4 percentiles (or.20 standard deviations) as a tatment effect large enough to warrant a policy change. They then evaluated the non-experiment s performance by examining whether both the RT and propensity sco approaches produced tatment effects as large as 5.4 percentiles to produce the same policy decision for adopting the intervention. ook, hadish, and Wong (2008) also urged searchers to attend to corspondence measus in the analysis of W designs. They argued that for W designs to produce interptable sults, the approach quis consistent and clear standards for judging when nonexperimental methods succeed in plicating benchmark estimates. Although their guidance did not specify which approaches W analysts should use for evaluating non-experimental methods, they suggested that the criteria should include a statistical thshold that is, a formal statistical test of benchmark and non-experimental sults to account for chance diffences from sampling error. In this paper, we argue that the types of corspondence measus used, and the criteria for determining corspondence, depend on the purpose of the W design itself. In general, W designs addss two types of search questions. The first set of W questions examines the policy issue of whether the RT and non-experiment produce comparable sults in field settings. He, the goal is to assess whether the policy-maker would draw the same conclusions from the RT and nonexperiment. To addss this question, conclusion-based corspondence measus a useful. These measus include looking at the diction and magnitude of effects, as well as statistical significance patterns of tatment effects in the experiment and non-experiment. These measus qui both substantive and statistical criteria for evaluating non-experimental performance. Other W search questions a methodological in natu they test whether the nonexperimental method produces unbiased sults in field settings. Appropriate measus for dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 4

6 Analysis of Within-tudy omparisons addssing this search question estimate the size of non-experimental bias in field settings. 1 For example, the searcher may compute the diffence in non-experimental and experimental effect estimates, or the percentage of bias duced by the non-experimental method. We distinguish this class of measus as distance-based corspondence measus. For these measus, substantive criteria often a helpful for interpting whether the magnitude of the bias estimates is of practical levance for the searcher. To account for sampling error in distance-based measus, the searcher also may use statistical tests of diffence or equivalence between non-experimental and experimental sults. Below, we formalize approaches for implementing conclusion-based and distance-based measus in W designs. We also discuss the lative advantages and limitations of each approach, and demonstrate their performance through a Monte arlo experiment in the following section. onclusion-based orspondence Measus for Addssing Policy Relevant Questions iction of effects. For this measu, the W searcher assesses corspondence by looking at whether the experiment and non-experiment produce tatment effects in the same diction, that is, whether the tatment effects have the same sign. The psumption he is that even the diction of an effect influences policy-makers impssion about the efficacy of a program. Thus, corspondence between the experimental and non-experimental estimates, τˆ and τˆ, is achieved if the signs of the tatment effect a identical: sgn( ˆ τ ) = sgn( ˆ τ ), whe sgn(x) is the sign function, with sgn(x) = 1 if x > 0, sgn(x) = -1 if x < 0, and sgn(x) = 0 if x = 0. The conclusionbased ( in the subscript) corspondence in dictions () is defined as: = [sgn( ˆ τ ) = sgn( ˆ τ )], (1) 1 whe, the indicator function 1[.] turns 1 and indicates corspondence if the logical comparison is true, and 0 if false. Thus, the same sign, and = 1 if the experimental and non-experimental effect estimates have = 0 if the signs of the effect differ (or one of the signs is exactly zero). Magnitude of effects. Researchers may also assess corspondence by comparing the size of the effect estimates produced by the experiment and non-experiment. However, one limitation of this approach is the subjectiveness of assessing How close is close enough? In this approach, the searcher decides in advance a thshold value ( λ ) that must be exceeded for the tatment effect to be meaningful. He, meaningful may be based on sults from a cost-benefit analysis (Wilde & Hollister, 2007), or an effect size that has been determined to be scientifically or programmatically 1 Wong & teiner (under view) identify conditions under which estimates of non-experimental bias a valid. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 5

7 Analysis of Within-tudy omparisons levant for the context under investigation. orspondence is achieved if both the experimental and non-experimental tatment effect estimates a either above or below a substantively meaningful thshold λ. Thus, corspondence in magnitude (M) is given when: 1[( ˆ τ λ & ˆ τ λ) or ( ˆ τ < λ & ˆ τ < λ)]. (2) M = If both estimates a above a positive thshold λ, then corspondence according to M implies M corspondence in signs, such that for λ > 0 and ( ˆ τ λ & ˆ τ λ) we have ( = 1) ( = 1). orspondence in signs is not implied if both estimates fall below a positive thshold. imilarly, M for a negative thshold, if λ < 0 and ( ˆ τ λ & ˆ < τ < λ) we have ( = 1) ( = 1). tatistical significance patterns. omparing the significance patterns of estimates is one of the most popular methods for assessing corspondence between experimental and non-experimental sults. He, the decision criterion about a program s efficacy is based on statistical significance rather than a thshold value for a meaningful impact. The experimental and non-experimental estimates have identical significance patterns with gard to the nil hypothesis H : τ 0 if (a) both 0 = estimates have the same sign and a statistically significant, or (b) both estimates a insignificant according to a two-tailed hypothesis test. Formally, corspondence in significance () is measud as: 1[{(sgn( ˆ τ ) = sgn( ˆ τ ) & p α & p α}or ( p > α & p > α)], (3) = whe p and p a the two-tailed p-values of the significance test (typically a t-test) for the effect estimates of the experiment and non-experiment, spectively, and α is the type-i error rate. Thus, corspondence in significance depends on the size and diction of the point estimates, their standard errors, the type-i error rate of jecting the null hypothesis and each study s power for detecting the unknown true tatment effect. The power for detecting a minimum effect size is irlevant he what matters is the power for detecting the unknown true effect. The corspondence measu may be adapted to flect a one-tailed hypothesis test that allows us to drop the sign condition. However, since almost all significance tests a conducted as two-tailed tests in practice, we formulated corspondence in significance in terms of a two-tailed test. The corspondence measu also works for null hypotheses other than the nil hypothesis. For instance, we may use the thshold value λ from the magnitude-lated measu and test whether both effects, τ and τ, a significantly diffent from λ ( H 0:τ = λ ). In order dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 6

8 Analysis of Within-tudy omparisons to accommodate the mo general case of hypothesis testing, we only need to change the sign condition in to sgn( ˆ τ λ) = sgn( ˆ τ λ). Then, corspondence in due to two significant effect estimates implies corspondence in magnitudes and signs: ( = 1) ( = 1) ( = 1). However, corspondence in corspondence in magnitudes nor in signs. due to two insignificant estimates neither implies ombined measus. In some W contexts, it may be useful to combine multiple conclusionbased corspondence measus. For instance, consider the combination of magnitude- and significance-lated measus, for example = 1[ = 1 & M M M = 1], which quis that both M and indicate corspondence. However, this measu has the disadvantage that it may signal corspondence even in ambiguous situations, such as when both estimates exceed the thshold (suggesting a positive effect), but they a not significantly diffent from zero ( H 0:τ = λ ). A mo useful corspondence measu may be constructed if we use only parts of the and M criteria: M ' = 1[{(ˆ τ {( ˆ τ λ & ˆ τ < λ & ˆ τ λ) &( p < λ) &( p α & p > α & p α)}or > α)}] (with H 0 : τ = 0 ) orspondence is achieved if either both estimates a significantly diffent from zero and exceed the thshold or both estimates a insignificant and fall below the thshold. 2 One challenge with conclusion-based corspondence measus discussed above is that they may indicate a lack of corspondence even in cases whe estimates a identical or very similar. For example, if the experimental effect estimate is slightly gater than zero and the non-experimental estimate is slightly less than zero, then the measu may indicate a lack of corspondence in the diction of the effects, even though the point estimates themselves may be consided as equivalent. In another example, the experimental and non-experimental estimates may be exactly identical in the point estimates, but the benchmark sult is statistically insignificant while the nonexperimental sult is significant. He, will indicate lack of corspondence which could be due to the insufficient power (sample sizes) in the benchmark design not because of poor performance in the non-experimental method itself. Although conclusion-based measus inform searchers 2 It is also possible to construct asymmetric corspondence measus, but we caution against using such measus because they may sult in a misleading comparison of experimental and non-experimental methods. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 7

9 Analysis of Within-tudy omparisons about whether a policy-maker would arrive at the same decision from a benchmark and nonexperimental design, these measus may be less useful for assessing the performance of the nonexperimental method itself. In these cases, distance-based measus may help interpt W sults. istance-based orspondence Measus for Addssing Methodological Questions In this section, we describe corspondence measus for evaluating methodological questions about non-experimental performance in field settings. These measus all involve some variant of a distance-based metric that calculates the size of the diffence in non-experimental and experimental sults. We begin by introducing the most fquently used distance measus, followed by a discussion of descriptive and infential corspondence measus. istance measus. A common metric for examining the discpancy between a W s experimental and non-experimental estimate is the diffence in effect estimates: M ˆ τ ˆ τ =. Given a valid benchmark estimate from the randomized experiment, the diffence is interpted as bias in the non-experimental estimate (Wong & teiner, under view). When the tatment group is shad between the experimental and non-experimental arm of a dependent W design, the analyst may estimate the diffence in outcomes for the independent non-experimental comparison and benchmark control groups: M = ˆ τ τˆ = ( Y Y ) ( Y Y ) = Y c t c t c Y c. He, c Y and the mean outcomes of the non-experimental comparison and experimental control groups, spectively, and t Y is the mean outcome of the experimental tatment groups. c Y a As mo applications of Ws have emerged, other distance measus that standardize the raw distance M have been employed. Below, we list the most important and common standardized measus of non-experimental bias. (a) tandardized diffence in effect estimates, ˆ τ ˆ τ =, whe s is the sample standard deviation s of the outcome in the experimental control group or the pooled standard deviation of the experimental tatment and control group (hadish et al., 2011; Hallberg et al., in progss). 3 Because the standardized diffence,, is in effect size units, searchers may interpt the distance on a common metric across multiple W studies. This is particularly useful for 3 In estimating the standard deviation, only data from the randomized experiment should be used because the standard deviations of the non-experimental tatment and comparison groups often depend on the stngth of the selection mechanisms. This is because strong selection processes in the observational study may sult in latively mo homogenous tatment and comparison groups than in a randomized experiment. Thus, the standard deviation of the non-experimental control group is usually not psentative of the population standard deviation of the potential control outcome. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 8

10 Analysis of Within-tudy omparisons qualitative and quantitative syntheses, or in fields such as education, whe the effect size is a common metric for interpting sults. pf ˆ τ ˆ τ (b) Percentage of maining bias, RB = 100, psents the bias maining after the non- ˆ τ ˆ τ experimental adjustment in percent of the initial bias, whe τˆ pf is the prima facie effect from the non-experimental study (i.e., the unadjusted initial effect estimate; hadish et al., 2008). The denominator psents the initial bias (with fence to the experimental estimate) and the numerator the maining bias after the non-experimental adjustment. A negative percentage indicates an over-adjustment lative to the benchmark estimate. If the W analyst is not intested in over- or under-adjustments from the non-experimental method, then she may use the absolute value of the maining bias percentage, RB. The W analyst also may compute the percentage of the bias duced by the nonexperimental adjustment, such that: 100 RB. For individual W studies, RB is a useful measu for examining how strongly the non-experimental method duces or incases initial bias lative to the experimental estimate. Across W studies, however, RB is challenging to interpt given the heterogeneity of the initial bias across diffent Ws. That is, RB can take on extme positive or negative values when the prima facie effect of the non-experiment is close to the experimental estimate ( ˆ τ ˆ τ 0). This happens if tatment selection in the pf non-experiment is weak, or if selection introduces positive and negative bias that approximately offset each other. Thefo, this distance measu should be used only when the initial bias is sufficiently large. ˆ τ ˆ τ (c) Percent diffence, P = 100, expsses the diffence in experimental and non- ˆ τ experimental effect estimates in percent of the absolute value of the experimental estimate (Wilde & Hollister, 2007). As with the RB measu, the challenge with P is that the distances can be extmely large if the experimental estimate is close to zero ( ˆ τ 0). While all distance measus assess the discpancy between non-experimental and experimental estimates, how close should these sults be for a W searcher to judge that the two estimates plicate? ven in cases whe the benchmark and non-experiment perform equally well (i.e., both designs produce unbiased estimates in the given context), one would not expect the two estimates to be identical given sampling and measument error. Thus, without some thshold 9 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

11 Analysis of Within-tudy omparisons values for describing meaningful diffences in experimental and non-experimental sults, it is often difficult to assess corspondence using distance measus alone. One solution is to define a meaningful thshold on the distance measu, another is to test the diffence or equivalence in the experimental and non-experimental estimates. Magnitude of diffence. With spect to the diffences discussed above, the W searcher may define a thshold δ and claim corspondence in experimental and non-experimental estimates if the absolute diffence is less than or equal to the thshold, such that = [ δ ]. (4) 1 He, is one of the distance measus discussed above ( M,, RB, P ), and δ is the thshold with spect to distance measu. For example, for the standardized distance measu, one could use a thshold of δ =. 1 and claim corspondence as long as the standardized mean diffence does not exceed.1. Although this approach is intuitively straight-forward, it fails to account for the uncertainty due to sampling or measument error in the outcome. Insignificance of diffence. Null hypothesis significance testing (NHT) helps in assessing whether the observed diffence in effect estimates is due to systematic diffences in methods or to random sampling error. For an independent W design whe units a randomly assigned into experimental and non-experimental conditions (Wong & teiner, under view), a searcher may assess corspondence through a two-sample t-test of the diffence in the experimental and nonexperimental tatment effect estimates. The standard null hypothesis is that the is no diffence between experimental and non-experimental tatment effects, H : τ τ 0. The test statistic is ˆ τ ˆ τ t =, whe s s. e. 0 = s = s e + is the standard deviation of the diffence, s is the variance (squad standard error) of the non-experimental tatment effect and depends on the magnitude of the diffence in effect estimates, the variance of the two tatment 10 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia s 2. e. 2. e. is the variance of the experimental tatment effect. The significance-lated corspondence measu may be defined as: 1[ p > α]. (5) = He, p is the two-tailed p-value from the corsponding t-test. orspondence is achieved if the p- value is gater than the type-i error rate α such that the null hypothesis cannot be jected. ince the corspondence measu is a function of the p-value and the type-i error rate, the measu

12 Analysis of Within-tudy omparisons effects (which themselves depend on the sample sizes), and the type-i error rate. Importantly, does not depend on the experiment s and non-experiment s power to detect the true but unknown tatment effect (the true effect is diffenced out by taking the diffence in the experimental and non-experimental estimates). The diffence in effect estimates may be significant even if both the non-experimental and experimental estimates a not significantly diffent from zero, that is, the conclusion-based corspondence versa). does not imply distance-based corspondence (and vice For dependent W designs whe the non-experimental arm shas some portion of the sample with the experimental arm (Wong & teiner, under view), the statistical tests should account for the corlation in the two tatment effects. We have the same test statistic, 2 2, but with standard deviation s = s. e. + s. e. 2r. s. e. s. e., whe r., denotes the ˆ τ t = corlation between experimental and non-experimental effect estimates. The third term of the standard deviation formula psents the covariance between the effect estimates and occurs because the experiment and non-experiment sha the same tatment group or parts of the sample. ince the corlation or covariance between the two effect estimators is usually unknown to searchers, this test cannot dictly be implemented without making some assumptions about the corlation r. (i.e., the corlation is positive but unlikely close to one). One way of dealing with this issue is to bootstrap the t-statistic. Another possibility for dependent W designs that sha a tatment group is to test whether mean outcomes in the benchmark control and non-experimental ˆ τ s comparison groups differ. That is, we employ the t-test statistic c c Y Y t =, whe s s = s e + is the standard deviation of the mean diffence, and s. e. Y and s. e. Y a the s. e. Y Y standard errors of the experimental control and non-experimental comparison groups mean outcomes, spectively. null hypothesis We may also generalize the significance-lated corspondence measu to the mo general H : τ τ δ, that is, we test whether the absolute diffence in the two effect 0 estimators exceeds a specific thshold δ. However, we do not discuss such significance tests further because (a) they a overly conservative in the sense that the test indicates corspondence though the estimates might actually be quite diffent and (b) they a raly used in practice. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 11

13 Analysis of Within-tudy omparisons Although significance tests of diffence in benchmark and non-experimental sults a common in the W literatu, a mo caful consideration of the typical W search question suggests serious weaknesses with this approach. In the standard NHT framework, the searcher does not ject the null hypothesis if the diffence in effects is insignificant. Traditionally, NHT protects against Type I error rates (whe the searcher concludes that the is a diffence when the actually is no diffence), and the is less concern about Type II error rates (whe the searcher concludes that the is no evidence for a diffence when a diffence actually exists). As a sult, under the NHT framework, W sults may be inconclusive when tests of diffence a used to assess whether two effect estimates a the same. In cases whe the experimental and nonexperimental tatment effects a extmely diffent and the W is adequately powed, then the may be evidence to ject the null of equivalent means. But when tatment effect estimates a similar, or when power is low in the W, then the searcher may be uncertain how to interpt the lack of evidence to ject the null hypothesis. It could mean that the is no diffence in tatment effect estimates between the experiment and non-experiment, or it could mean that the W is underpowed for detecting a diffence. Thus, an insignificant sult does not imply that the null hypothesis of a zero diffence is true (even if one incases the Type-I error rate to.1 or.2). Moover, standard NHT of a zero diffence fails to addss the conceptual problem of testing equivalence (as opposed to testing a diffence). To overcome this conceptual problem, the W searcher may employ equivalence tests that use the same equivalence hypothesis as the general significance-lated corspondence measu, H : τ, but formulates the 0 of equivalence, a similar approach applies, but to conclude that the is no or only a negligibly small 12 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia τ δ equivalence as alternative hypothesis and the diffence as null hypothesis. ignificance of quivalence. Although statistical tests of equivalence have been used in public health (Barker, Luman, Mcauley, & hu, 2002), psychology (Tyron, 2001), and medicine (Munk, Hwang, & Brown, 2000), these tests a raly applied in the analysis of Ws. The only exceptions a by Berk et al. (2010) and mo cently by ong and Lipsey (2016), who used equivalence tests in their W analyses. quivalence tests a useful for contexts whe a searcher wishes to assess whether a new or an alternative approach (such as a non-experiment) performs as well as the gold standard experimental approach. For a gular t-test in the standard hypothesis framework, to conclude that the is a substantial diffence between two estimates, the W searcher must observe a diffence large enough to rule out sampling error as an alternative explanation for the diffence. However, in tests

14 Analysis of Within-tudy omparisons diffence in estimates, one must observe a small enough diffence to ject the competing explanation that the equivalence or closeness is not just due to sampling error. quivalence testing first quis the determination of a tolerance thshold δ, which is consided as a trivial or nonconsequential diffence between the two estimates. For example, a diffence of δ =.1 between the experimental and nonexperimental estimate might be consided as tolerable. Then, the diffence in estimates is consided equivalent if the composite null hypothesis that the absolute diffence in effects is larger than the thshold, H : τ τ δ, 0 is jected. The alternative hypothesis states the equivalence of estimates, H : τ τ < δ, or, the absolute diffence in estimates is smaller than the thshold. The composite null hypothesis of inequivalence may be formulated as two one-sided null hypotheses, H H A : τ τ δ and 01 : τ τ δ, which a dictly testable (chuirmann, 1987; Tryon & Lewis, 2008). If both 02 one-sided null hypotheses a jected, the data do not provide enough evidence for a diffence that exceeds the thshold δ, and the W analyst concludes that the two methods a equivalent (within the tolerance thshold δ ). ach of the two one-tailed tests is conducted with a nominal Type-I error rate of α. Thus, corspondence in equivalence is defined as: [ p α & p ], = α whe p 1 and p 2 a the one-tailed p-values with spect to the two one-sided null hypotheses. The p-values a typically obtained from two separate t-tests. One challenge with equivalence tests and perhaps the ason why searchers have avoided these tests for assessing corspondence is that the tolerance thshold must be defined in advance. However, the use of statistical equivalence tests may be one way to operationalize Wilde and Hollister s early advice to define meaningfully large diffences in experimental and nonexperimental sults befo a W analysis is conducted. He, the substantive thshold is built within the testing framework of benchmark and non-experimental sults. Another limitation of the approach is that small tolerance thsholds of.1 or.2 qui large samples from both the randomized experiment and the non-experiment. Otherwise, the equivalence test will not be sufficiently powed (see the simulation sults below). ince a lack of evidence for equivalence does not imply that the two methods perform diffently, it is useful to consider not only a dichotomous test outcome (equivalence vs. non-equivalence) but four possible dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 13

15 Analysis of Within-tudy omparisons test outcomes of a composite corspondence test: equivalence, diffence, trivial diffence, and indeterminacy, which we discuss below. orspondence Test: quivalence, iffence, Trivial iffence, and Indeterminacy. Tryon (2001) and Tryon and Lewis (2008) suggest a testing procedu that combines the outcomes of an equivalence test and a standard significance test for a diffence in parameters. He, we define a combined corspondence measu from the significance in equivalence, diffence,, into a single framework: quivalence iffence = Trivial iffence Indeterminacy if if if if = 1& = 0 & = 1& = 0 &, and the significance in For this measu, if corspondence is achieved in both metrics, =1 & = 1, the W analyst concludes that the two methods a equivalent. That is, the equivalence test jects the null hypothesis of a significant diffence larger than the thshold, and the significance test for the diffence does not ject the nil hypothesis. If both corspondence metrics indicate non- corspondence, = 0 & = 0, then the analyst concludes that the two estimates differ. A trivial diffence is given if the estimates differ significantly (i.e., non-corspondence in diffence, = 0), but the equivalence test suggests corspondence ( = 1). Finally, if the data do not provide enough evidence for a significant equivalence ( = 0) or a significant diffence ( = 1), then the W analyst concludes that corspondence is statistically indeterminate. ince indeterminacy sults from failing to ject the null hypothesis of both tests, an indeterminate outcome is most likely when sample sizes a small (underpowed tests). We fer to the corspondence measu dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia = 1 = 0 = 0 = 1 as a corspondence test because both equivalence and diffence tests a used for drawing a conclusion about the methods corspondence. In comparison to corspondence in equivalence ( ) and diffence ( ), the corspondence test ( ) has the advantage that equivalence is indicated only if the equivalence test is sufficiently powed and the diffence in methods is insignificant. ince an adequately powed equivalence test (with a asonable thshold) also implies a highly powed diffence test, corspondence is achieved only if the experiment and non-experiment sult in approximately the 14

16 Analysis of Within-tudy omparisons same effect estimates. With small sample sizes, it is unlikely to achieve equivalence even if the estimates a nearly identical. Thus, when the benchmark and non-experimental designs a underpowed, the corspondence test will likely yield a sult of statistical indeterminacy (the data do not provide sufficient evidence for a significant equivalence or significant diffence). However, with large samples, or a large thshold value δ, it is possible to observe a significant diffence in effect estimates as well as achieve corspondence in the equivalence test. In this case, the W analyst would conclude that a trivial diffence between the methods exists. However, in the W context, even trivial diffences should count as non-corspondence between methods unless a very small thshold value is used. The advantages of the corspondence test,, a appant. First, it provides the searcher with a statistical test that explicitly addsses the question of intest in most Ws: oes the non-experimental method perform as well as the experiment in field settings? econd, it quis that the searcher addsses in advance what constitutes as a meaningful diffence in experimental and non-experimental sults. Third, it allows the searcher to conduct both statistical tests of diffence and equivalence. Fourth, it provides a clear decision criterion for when experimental and non-experimental sults a equivalent, diffent, or inconclusive. Though we did not explicitly discuss the equivalence and corspondence tests for dependent Ws, the very same criteria can be applied to the diffence in the mean outcomes of the independent control and comparison group. Alternatively, one can also bootstrap the diffence in the average tatment effects of the dependent experimental and non-experimental study arms, and then compute the corspondence measus. IMULATION TUY In this section, we examine the performance of conclusion- and distance-based corspondence measus under diffent W conditions for an independent design. For both the randomized experiment and non-experiment, we randomly sampled cases from a large joint target population and then randomly assigned the benchmark cases to the tatment or control condition, and systematically selected non-experimental cases into tatment conditions based on two covariates. To account for selection bias, we estimated the tatment effect in the non-experiment via inverse-propensity weighting and an additional covariance adjustment. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 15

17 Analysis of Within-tudy omparisons imulation esign To simulate the W design, we began by generating a target population of one million cases with two baseline covariates, X 1 and X 2, and a potential control and tatment outcome, Y 0 and Y 1. The values of the two covariates we sampled from a bivariate normal distribution with zero means, unit variances, and a corlation of.3. The pair of potential control and tatment outcomes was computed according to Y 0 i = 1X1i + β2x 2i β + ε and Y i 1 i = Y + β1x1 i + β2x 2i τ + ε, whe the additive tatment effect was assumed to be τ times the standard deviation of the potential control outcome, Y. Thus, τ psents the effect in terms of standard deviations. The error term ε i was generated according to a normal distribution with zero mean and a standard deviation of.5. Note that the data-generating coefficients β1 and β as well as the error term ε 2 i a the same for the potential control and tatment outcome. The observed outcome, Y, was determined according to the tatment assignment status of the cases in the randomized experiment and non-experiment. In the randomized experiment, half of the cases we randomly assigned to the tatment condition, the other half to the control condition (i.e., each case had an assignment probability of.5). In the non-experiment, the assignment probabilities we determined according to the logistic model with two baseline covariates, logit( Zi) γ 0 + γ1x1 i + γ 2X 2i =. To investigate the performance of the corspondence measus under diffent datagenerating scenarios, we cated the diffent target populations by varying the magnitude of the tatment effect and the diction of the selection process. Table 1 contains the parameter settings for the the populations. He, Population 1 is characterized by a medium effect size (.3 ) for both the benchmark and non-experiment, and a positive selection process in the non-experiment with X 1 being a stronger confounder than X 2 (X 1 has the larger coefficients in both the outcome and selection model). The selection process is of medium stngth, producing a positive confounding bias of about.43 in the tatment effect. Population 2 has the same non-experimental selection process, but the tatment effect is only.05 in the benchmark and non-experimental conditions. Thus, for Population 2, it will be harder to demonstrate significant tatment effects in the experiment and non-experiment. Finally, Population 3 has the same effect size as Population 1 (.3 ) but instead of a positive selection process, we have a negative selection that induces a negative selection bias. A negative selection bias sults because the confounders coefficients a negative for the selection model but positive for the outcome model. ince the populations coefficients of the i dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 16

18 Analysis of Within-tudy omparisons selection model differ only in their sign but not in their magnitude, the stngth of the selection process and the absolute bias is the same across all the populations. For each population, we ran separate simulations. In each iteration of the simulation, we first dw a random sample of n + n cases, and then randomly assigned the experimental and nonexperimental cases according to their assignment probabilities. epending on the assignment status, the potential control or tatment outcome was corded as the observed outcome. For the experimental and non-experimental data, we then used gssion estimators to estimate the tatment effects. Table 2 summarizes the gssion models used to estimate tatment effects for both the experiment and non-experiment. For the experiment, we estimated tatment effects without any covariance adjustment. For the non-experimental data, we consided estimators of four models the doubly robust estimators that combine inverse-propensity weighting with a covariance adjustment, and one model that did not control for any confounding covariates. These four models produce non-experimental tatment effects with diffent deges of maining bias. The first model controls for both confounding covariates, X 1 and X 2, that is, both covariates we used for P estimation and covariance adjustment. Thus, the corsponding tatment effect estimator is unbiased because both confounding covariates a used and corctly modeled. The second model only controls for X 1 (in both the P and outcome model), which sults in an estimator with an expected bias of.08. The third model controls only for X 2 (the weaker confounder), leading to a bias of.24. Finally, the fourth model does not control for any covariates, which implies a bias of.43 in the tatment effect. Varying the extent of maining bias allows us to assess the performance of the corspondence measus when the observed covariates do not succeed in moving all the selection bias in the non-experiment. For each iteration, we computed a series of measus for assessing the corspondence between the benchmark and the four non-experimental estimates. We used the conclusion-based corspondence in diction ( ), magnitude ( M, with a thshold of λ =.2 ), and significance ( ) based on the gssion-based t-test with a Type-I error of α =.05. With spect to the distance-based measus, we computed the corspondence in magnitude ( M ) with a thshold of δ M =.1, the corspondence in significance ( ) based on a t-test with α =.05, as well as the corspondence measus for a significant equivalence ( ) with thsholds dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia δ of.1,.3,.5. 17

19 Analysis of Within-tudy omparisons Based on the significance in diffence and significance in equivalence, we then conducted the corspondence test ( ) for the the equivalence measus. For each of the the populations we ran K = 10,000 iterations with diffent sample sizes for the experiment and non-experiment. For the benchmark RT, we used sample sizes of 100, 400 and 1600 cases and for the non-experiment, we simulated 21 diffent sample sizes between 20 and For each simulation setting, we then computed the corspondence probabilities as the proportion of corspondence indicated by each measu. For instance, for the conclusion-based corspondence in significance, the estimated corspondence probability is given by 1 K k = 1 P( = 1) = K, whe k k is the corspondence indicator for iteration k. For the corspondence test, which has four possible outcomes, we corspondingly computed the probabilities of equivalence, diffence, trivial diffence, and indeterminacy. Results onclusion-based orspondence Measus. The first the Figus (2 to 4) show the performance of the conclusion-based corspondence measus for the the diffent populations: positive selection with a tatment effect of.3, positive selection with a very weak tatment effect of.05, and negative selection with an effect of.3. The performance plots for the Population 1 a depicted in Figu 2. The plots in the columns psent variations in experimental sample size and thus the experiment s power to detect the true tatment effect. The column headings indicate the RT sample sizes of 100, 400 and 1600, which corspond to powers of.29,.81,.999, spectively. Thus, the first column of plots psents a situation whe the experiment is underpowed with spect to the true effect, the second column fers to an RT with adequate power, and the third column to an extmely well powed RT. It is important to alize that the power figus fer to the power with spect to the unknown true effect, and not to the power of a minimum detectable effect size as used in power calculations. The rows of Figu 2 differ in the extent of maining bias in the non-experiment. For the first row of plots, no bias is maining because the analyses conditioned on both confounding covariates X 1 and X 2. The second row shows the sults for the non-experimental analyses whe X 1 has been included but X 2 has been omitted. He, the expected bias is.08. The third row fers to the non-experimental analyses with X 2 included but X 1 omitted which sults in a bias of.24. Finally, the last row veals the performance of the corspondence measus when the bias is.43 which we obtained by not controlling for any of the confounding covariates. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 18

20 Analysis of Within-tudy omparisons ach single plot shows the corspondence probabilities (ordinate) as a function of the nonexperiment s sample size (abscissa, which is on a log scale). The abscissa shows two scales, the nonexperiment s sample size and the corsponding power to detect the true effect if both confounders a conditioned (i.e., the power for the corctly specified model). 4 The the lines in the plot trace the corspondence probabilities of the the conclusion-based measus solid lines for dashed lines for M, and dotted lines for., For the first row of plots, whe the non-experiment s effect estimates a unbiased, we expect the measus to indicate corspondence because, given sufficient power, we should obtain the same sults from the benchmark RT and the non-experiment. Indeed, all the corspondence measu,, M, and, almost always indicate corspondence, but only if both studies have a power close to one (top right plot with an RT sample size of 1600 and a nonexperimental sample size of at least 500). If the power of one of the studies, RT or nonexperiment, is less than.8, then corspondence in magnitude and significance never exceed a probability of.8. This is not surprising because with a power of.8, the RT or non-experiment draws the corct conclusion about the existing effect (.3 ) only in 80% of the cases. The corspondence in magnitude ( M ) is rather robust and almost always indicates corspondence, except when the power of one of the studies is very low, such that the estimates might be negative (due to the latively large sampling error). It is also worth noting that, with an underpowed RT (power of.29, top left plot), the corspondence probability of diminishes as non-experiment s power incases. That is, if both studies a highly underpowed, we will obtain corspondence in significance because both studies produce insignificant effect estimates. But as the sample size (and thus the power) of the non-experiment incases, the corspondence probability decases because the non-experimental estimate will likely be significant while the RT estimate mains insignificant. As non-experimental sample size and power incases, the corspondence probability converges to the RT s power (in this case.29). 4 Though we characterize the corspondence probabilities as a function of sample sizes (of both the RT and nonexperiment), the sample size quiments for obtaining sufficient power also depend on the portion of explained variance (R 2 ) after controlling for covariates. In our simulations, the R 2 s of the analytic models a 2% for the RT (without any covariance adjustment), 45% for the non-experimental model with both confounders X 1 and X 2, 38% with X 1, 24% with X 2, and 11% with none of the two confounders included. With higher R 2 s, sample sizes can be smaller, but with lower R 2 s, sample sizes must be larger to achieve the same corspondence probabilities. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 19

21 Analysis of Within-tudy omparisons The second to the fourth rows of plots display the corspondence probabilities as the bias in the non-experimental estimates incases form.08 to.24 and.43. Given a biased nonexperimental estimate, we expect that the conclusion-based corspondence measus should no longer indicate corspondence. However, this expectation does not bear out because the conclusion-based measus only focus on the corspondence of conclusions drawn about the tatment s impact. They a insensitive to the magnitude of biases in estimates when the bias does not sult in a diffent policy conclusion. ince the plots in Figu 2 fer to a data-generating model that used a positive tatment effect (.3 ) and a positive selection process (i.e., the confounders introduce a positive selection bias), failing to move all of the selection bias does not invalidate the conclusions drawn from the non-experiment. To the contrary, the mo selection bias that is left, the mo likely we conclude the is a significant positive tatment effect that exceeds our magnitude thshold. Thus, it is not surprising that the corspondence probabilities incase as the bias incases (from row one to row four), provided that the RT is sufficiently powed (at least gater than.8). In cases whe the non-experimental sample is large and the power in nearly one (such that the non-experiment will almost always draw the right conclusion), the corspondence probabilities a entily determined by the sample size (power) of the RT. With an underpowed RT, potential bias in the non-experimental tatment effect has almost no consequences on the corspondence probabilities. The sults discussed so far only hold for a positive tatment effect of.3 and a positive selection bias. The corspondence probabilities a quite diffent if the true tatment effect is close to zero (.05 ), or the selection process induces negative selection bias. Figu 3 shows the sults for Population 2 with a tatment effect of.05. Because we used the same sample sizes as for Population 1, the power to detect a significant tatment effect is now much lower for both the RT and non-experiment. The the columns of plots psent RTs with a power of.06,.08 and.16. The corspondence probabilities differ strongly from the ones of Population 1. With an incasing sample size for the non-experiment, the corspondence in significance,, tends to diminish since the non-experiment s power to detect even a small effect incases while the RT has no power for demonstrating an effect. Thus, the conclusions often do not corspond. ue to the small true effect and the RT s low power, corspondence in the estimates diction ( ) and magnitude ( M ) is also less likely. If the is maining bias in the non-experiment s tatment effect (second to third row of Figu 3), then the conclusions based on the experimental and non- 20 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

22 Analysis of Within-tudy omparisons experimental estimates no longer corspond, particularly when the non-experimental sample size is large. ue to the positive bias, the non-experimental analyses indicate a positive tatment effect while the underpowed RT analyses suggest the absence of an effect. Figu 4 shows the sults for Population 3 whe the selection process induces a negative selection bias. Thus, failing to control for all confounding covariates sults in an undestimation of the tatment effect. The four rows of plots psent the same four bias scenarios as befo except that the biases a now negative. Thus, given a true effect of.3 and the biases of 0, -.08, -.23, and -.41, the non-experiment s average effect estimates a.3,.22,.07, and -11, spectively (i.e.,.3 0,.3.08,.3.23, and.3.41). ompad to Population 1, mo bias now implies a smaller rather than larger effect estimate that even becomes negative if none of the two confounders is used to move bias. onsequently, with an incasing bias in the non-experimental estimate, we now expect to see decasing rather than incasing corspondence. This is appant from Figu 4. The corspondence probabilities diminish as one goes from the plots in the first row (no bias) down to the plots in the fourth row (bias of -.41 ). However, in cases whe all selection bias is moved by the non-experimental analysis (first row), the corspondence measus do not depend on the diction of the selection process. Overall, the plots in Figus 2 to 4 demonstrate that the conclusion-based corspondence measus depend on: 1) both benchmark RT and the non-experiment s power to detect the true but unknown effect in particular the true but unknown tatment effect and the sample size, 2) the diction of the selection process, and 3) the extent of maining bias in the non-experimental effect estimate. Importantly, a highly biased non-experimental estimate does not always sult in lack of corspondence in conclusion-based measus. In the case of a positive (negative) tatment effect and positive (negative) selection bias, mo bias can imply a higher corspondence probability. The plots psented he illustrate this characteristic of conclusion-based measus. For all the populations, the characteristics of the RT and non-experiment we held constant, but the conclusions drawn from the analyses diffed dramatically. istance-based orspondence Measus. ince the characteristics of the RT and nonexperiment we held constant across the the populations, the sults for the distance-based corspondence measus do not differ across the the populations. That is, other than for conclusion-based corspondence measus, distance-based measus do not depend on the size of the true effect and the diction of the selection process. Only the absolute distance between the two effect estimates matters. Thefo, we discuss sults for Population 1 only (tatment effect of 21 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

23 Analysis of Within-tudy omparisons.3 and a positive selection process). Irspective of the conclusions drawn from the RT and non-experimental study (i.e., whether a tatment is effective or ineffective), distance-based corspondence measus assess whether the effect estimates a similar or diffent enough to conclude that the two methods perform equally well or diffent. Thus, if both the RT and nonexperimental estimates a unbiased, we would expect that the distance-based measus indicate corspondence. If one estimate is biased (often the non-experimental estimate, provided the experiment is valid), we expect the distance measu to indicate non-corspondence. Figu 5 shows for Population 1 the corspondence probabilities of the corspondence in magnitude (, with a thshold of.1 ), the corspondence in significance ( ), and the the M equivalence measus (i.e., with thsholds of.1,.3, and.5 ). First, we consider the top row of plots whe the non-experimental estimates a unbiased and whe we would expect the measus to indicate corspondence in estimates. M indicates corspondence only if the sample sizes of the RT and non-experiment a large enough such that the diffence in estimates is liably estimated (solid lines). With a larger thshold, we would obtain gater corspondence probabilities. If the sample size of one of the studies is small, the performance of because sampling error gularly produces a diffence that is larger than the thshold. orspondence in significance, M is poor (dashed lines), almost always suggests that the RT and nonexperimental estimates do not differ. This is what we would expect from a significance test with a null hypothesis of a zero diffence. With a type-i error rate of 5%, we do not ject the null hypothesis in 95% of the tests, given the null hypothesis is true (which is the case he, because both the RT and non-experimental estimates a unbiased). For small non-experimental samples (less than about 200), the corspondence probabilities a slightly below the expected 95% because the t- test we employed is based on the normal approximation with identical variances instead of a t- distribution with Welch atterthwaite-corcted deges of fedom. Both measus, M and a sensitive to bias in the non-experimental estimate. With incasing bias (from the second to the fourth row of plots in Figu 5), the corspondence probabilities decline, particularly if the sample sizes of both studies a large. However, with small to medium-sized biases (.24 ) in the nonexperimental estimate, and small to medium RT and non-experimental sample sizes (up to about 500), the corspondence probabilities for the significance measu, a still high (gater than dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 22

24 Analysis of Within-tudy omparisons.5). ue to small diffences in effect estimates or because of lack of power, the detect true diffences in RT and non-experimental sults. As discussed above, one option for addssing the undesirable but expected liberal measu fails to performance of the significance measu ( ) is to use an equivalence test ( ). The advantage he is that the equivalence test will not indicate corspondence in RT and non-experimental sults simply because of lack of statistical power in the W. However, an important consideration for equivalence tests is establishing an appropriate tolerance thshold for assessing corspondence in benchmark and non-experimental sults. If the tolerance thshold is set too small, the equivalence test will indicate lack of corspondence when W sample sizes a not large. In our simulation, we found that if the thshold is set at.1, then the equivalence test often had insufficient power for indicating corspondence, even when the was no actual bias. He, sample sizes exceeding 5,000 in the RT and non-experiment we quid. Figu 5 shows that with δ =.1 has a corspondence probability gater than zero only in the top right plot with 1600 RT cases and mo than 1500 non-experimental cases (dotted line). Incasing the thshold values to.3 or.5 allows for a mo liberal equivalence test when W sample sizes a not very large. As our simulation sults show, with a tolerance thshold psents indeterminacy. Figu 6 contains the plots for a thshold of.3. If the is no maining 23 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia δ of.3 and.5 succeeded in showing equivalence when sample sizes we moderate (at least 400 in the RT and non-experiment, second plot in the first row). However, the tradeoff is that tolerance thsholds of.3 or.5 may be too large for assessing non-experimental performance when the is in fact bias. Our simulation sults show that even with well-powed tests, the equivalence measu gularly indicates corspondence unless the maining bias in the non-experiment is large (.43, last row of plots). However, if the bias exceeds the chosen thshold δ the equivalence measu always suggests non-corspondence. The corspondence test ( ), which combines the and measus, overcomes the weaknesses of each test on its own. Figus 6 to 8 show for diffent thshold values (.3,.1,.5 ), the sults for the corspondence test ( ) with Population 1. The structu of each figu is the same as befo but the ordinate now flects the probabilities with which each outcome of the equivalence test occurs. The solid line shows the probability of equivalence, the dashed line the probability of diffence, the dotted line indicates a trivial diffence, and the dashed-dotted line

25 Analysis of Within-tudy omparisons bias in the non-experimental estimate (first row of plots) and if both the RT and non-experiment have sufficiently large sample sizes (1600 in the RT and at least about 500 in the non-experiment; top right plot), then the corspondence test almost always indicates equivalence (solid line). With decasing sample sizes of the RT or non-experiment, the corspondence test suggests equivalence less likely but instead the probability of an indeterminate outcome incases (dasheddotted line). Thus, with an unbiased non-experimental estimate, the corspondence test indicates equivalence only if the test is sufficiently powed, otherwise the test indicates indeterminacy. Only occasionally, the test suggests a diffence (dashed line) or a trivial diffence (dotted line). The plots in the second to the third row of Figu 6 show that the corspondence test is also sensitive to bias in the non-experimental estimate. If the bias is large (.43, last row), then the tests indicate either a diffence or indeterminacy. With sufficient power (large enough sample sizes), the corspondence test almost suly indicates a significant diffence. But as the power diminishes, indeterminacy is the mo likely test outcome. Given a bias of.43 and a thshold of.3, the test almost never indicates equivalence of RT and non-experimental estimates. If the bias in the non-experimental estimate is smaller (.08, second row), then the corspondence test tends to indicate equivalence, provided that sample sizes a large enough; otherwise indeterminacy is the mo likely test outcome. However, given the slight bias, the probability of a diffence or a trivial diffence is now incased (see particularly the plot in the second row and last column). As the bias incased further (.24, third column), the probability of an equivalence outcome decases but the probability of diffence incases. The plots in Figus 7 and 8 demonstrate that the corspondence test is sensitive to the choice of the thshold. Though setting the thshold at.1 indicates that we a not willing to accept a diffence in RT and non-experimental estimates of.1 or larger as negligible, the corspondence test often lacks the power to show equivalence in this case. As the plots in the first row of Figu 7 show, the pdominant test outcome is indeterminacy. Only when the sample sizes of both studies exceed 1500, the test occasionally indicates equivalence (top right plot). However, the test mains sensitive to moderate to large biases in the non-experimental estimate. For example, the corspondence test indicates a significant diffence with sufficiently large sample sizes (at least 400) and a bias of.24 or mo (last two rows of plots in Figu 6). If the corspondence test should be mo sensitive to the equivalence rather than diffence in estimates, we can incase the thshold, for instance to.5 as shown in Figu 8. In the absence of any bias (first row of plots), the corspondence test indicates equivalence as long as both 24 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

26 Analysis of Within-tudy omparisons sample sizes a moderately large (at least 300). In case of biased non-experimental estimates, the corspondence test mo fquently indicates a trivial diffence, that is, the is a significant diffence in estimates, but simultaneously the equivalence test suggest that the estimates do not differ (due to the large tolerance thshold). Given a thshold of.5, one should interpt trivial diffences as evidence against the equivalence of experimental and non-experimental estimates. ONLUION This paper examines methods for assessing corspondence in benchmark and nonexperimental sults, highlighting each approach s advantages and limitations. We have argued that the measus employed should flect the purposes of the W design. For example, is the W design meant to addss the policy-levant question of whether both the benchmark and nonexperimental method would sult in the same conclusion by a decision-maker? If so, then the levant measus include comparing the diction, magnitude, and statistical significance patterns of sults from the experiment and non-experiment. These approaches indicate whether the policymaker would arrive at the same decision about a program s efficacy using the benchmark or nonexperimental design. However, our simulations show that sults from conclusion-based measus a highly sensitive to the statistical power in the benchmark and non-experimental designs (i.e., the magnitude of the unknown true effect, the sample sizes, and error variances) and the diction of the initial selection bias in the non-experiment. Overall, we commend that W analysts should interpt their conclusion-based sults cautiously, within the context of their study conditions. For example, when the true tatment impact is substantially large (.3 ) and the is positive selection bias in the non-experiment, all the conclusion-based measus a likely to indicate corspondence in benchmark and non-experimental sults. This is because even if substantial bias mains in the non-experiment, the policy-maker will make the same conclusions based on the diction, size and statistical significance of sults in the benchmark and non-experimental design. The only exception is in cases when one of the study designs is underpowed for detecting significant effects. In the case whe the true tatment impact is near zero and the is positive selection bias in the non-experiment, then comparing the diction and magnitude of the effects from the two study designs may be informative for assessing the performance of the nonexperiment (depending on the thshold established for assessing magnitude ). However, in this case as well, patterns of statistical significance a highly sensitive to the power of the benchmark and non-experimental design. Thus, conclusion-based measus often perform poorly as an indicator of non-experimental bias; instead, these measus a informative only for addssing the 25 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

27 Analysis of Within-tudy omparisons policy question of whether a decision-maker would arrive at the same conclusion using the benchmark and non-experimental study designs in the given context. A lack of corspondence does not automatically imply the failu of the non-experiment it might be caused by a type-ii error of an underpowed test in the RT. Given that conclusion-based corspondence measus a highly sensitive to design attributes of a specific W, how should one assess the magnitude of the true tatment effect, and thus the statistical power in the experiment and non-experiment, and the diction of the selection bias in the non-experiment when these attributes a not dictly observable to the searcher? Often, she may be able to roughly infer this information by examining sults from prior search and baseline characteristics of W study participants. For example, sults from earlier evaluations of a similar intervention may provide the W analyst with guidance about the magnitude of the program s true impact. Moover, comparing baseline characteristics between tatment and comparison units in the non-experiment may indicate the diction of the initial selection bias. To addss the methodological question of how non-experimental methods perform in field settings, distance-based measus a useful. He, the W analyst compas tatment effects from two study designs to assess how well the non-experimental method is able to plicate benchmark sults. urntly, the most popular approach for judging whether non-experimental and benchmark sults corspond is to conduct statistical tests of diffence between the two sults. As we have shown, this criterion has seve limitations in the W context, especially when benchmark and non-experimental sample sizes a small. In these cases, a non-significant diffence in benchmark and non-experimental sults may indicate corspondence in the two estimates, or it may be due to lack of statistical power for jecting the false null hypothesis. In this paper, we have argued for an alternative approach for judging corspondence in W contexts: the corspondence test. This test indicates equivalence or a diffence in methods when the is sufficient evidence for making these conclusions, but yields a sult of indeterminacy when neither of these conclusions is warranted (often due to insufficient sample sizes). As a sult, the corspondence test can help guard against implicit and explicit biases from the W searcher or ader. orspondence tests also have the advantage of quiring the searcher to define in advance a tolerance thshold for when benchmark and non-experimental sults a close enough to be equivalent. It is beyond the scope of this paper for advising exactly how one should establish these thsholds, but in general, it should be based on substantive or empirical grounds, such as sults from a prior cost-benefit analysis. Our simulation sults show that in a W context, a dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 26

28 Analysis of Within-tudy omparisons latively large tolerance thshold of at least.3 was needed for establishing equivalence in unbiased benchmark and non-experimental estimates. In fact, equivalence was hardly ever indicated when the thshold was as small as.1. As a sult, in the W context, the thshold should flect the upper bound for what the analyst considers tolerable for equivalence in benchmark and non-experimental sults. The simulation also demonstrates that for the independent design, latively large sample sizes a quid for the corspondence test to indicate a liable equivalence or diffence in benchmark and non-experimental sults. When the tolerance thshold is.3, the RT quid at least 400 units, and the non-experimental arm quid between 600 and 1600 participants, depending on the size of the bias. Given that the W searcher often plans independent designs prospectively, the approach may be feasible only in cases whe the intervention is straight-forward to implement, such as a text message or informational mailer, and the outcome is obtained through administrative cords. An alternative to independent arm designs is W approaches with dependent benchmark and non-experimental approaches. For example, in the simultaneous design, some portion of the tatment or comparison group is shad simultaneously in the RT and non-experiment. Because the design is an ad hoc approach that uses existing RT and observational data, it is often possible for the W searcher to include much larger sample sizes than what would be feasible in prospective W designs. Moover, even with the same overall W sample size, the simultaneous design has improved statistical power for assessing corspondence over the independent arm design. This is because sidual variance in the outcome is duced when units a shad between the benchmark and non-experimental conditions, as opposed to the case when units in the benchmark and non-experiment a independent. Wong and teiner (under view) discuss dependent and independent W designs further, highlighting validity concerns with each approach for evaluating non-experimental methods. To assess corspondence in W designs, our general commendation is that searchers begin by defining the search purpose of their evaluation, and selecting corspondence measus that addss their questions dictly. econd, the searcher should consider attributes of their W design, including the likely true impact of the intervention, the statistical power of the benchmark, non-experiment, and W design itself, and the diction of the selection bias in the nonexperiment. These factors will likely affect the sensitivity of the corspondence measus (particularly the conclusion-based measus), and the W analyst should interpt the sults accordingly. Finally, given that aders of W studies often have both policy and methodologically 27 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia

29 Analysis of Within-tudy omparisons lated intests, it may often be useful for W analysts to provide a table of sults that summarize all their sults, including conclusion- and distance-based corspondence measus. As standalone enterprises, individual W studies may have little to say about the performance of non-experimental methods in field settings. However, sults from multiple W studies may begin to provide an empirically based mapping of contexts and conditions for optimal non-experimental performance in specific domains. A critical component of this endeavor is clear and consistent criteria for assessing corspondence in benchmark and non-experimental sults. This paper provides searchers with guidance for analyzing and interpting W sults. dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 28

30 Analysis of Within-tudy omparisons RFRN Aiken, L.., West,. G., chwalm,.., arroll, J., & Hsuing,. (1998). omparison of a randomized and two quasi-experimental designs in a single outcome evaluation: fficacy of a university-level medial writing program. valuation Review, 22, Arceneaux, Gerber, & Gen (2010). A autionary Note on Use of Matching to stimate ausal ffects: An mpirical xample omparing Matching stimates to xperimental Benchmark. ociological Methods & Research. 39(2) Barker, L., Luman,.T., Mcauley, M.M., hu,.y. (2002). Assessing equivalence: An Alternative to the use of diffence tests for measuring disparities in vaccination coverage. American Journal of pidemiology, 156(11), Bell, Orr, Bloomquist, and ain (1995). Program Applicants as a omparison Group in valuating Training Programs: Theory and a Test. Kalamazoo, MI: W.. Upjohn Institute for mployment Research. Berk, R., Barnes, G., Ahlman, L., & Kurtz (2010). When second best is good enough: A comparison between a true experiment and a gssion discontinuity quasi-experiment. Journal of xperimental riminology, 6(2), ook, T.., hadish, W. R., & Wong, V.. (2008). The conditions under which experiments and observational studies often produce comparable causal estimates: New findings from withinstudy comparisons. Journal of Policy Analysis and Management, 27(4), ong, N. & Lipsey, M. W. (2014). How well propensity sco methods approximate experiments using ptest and demographic information in educational search? Paper psented at the 35 th Annual Association for Public Policy Analysis and Management (APPAM) Research onfence, Albuquerque, NM. Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-lated programs. Journal of Human Resources, 22(2), Friedlander,., & Robins, P. (1995). valuating program evaluations: New evidence on commonly used nonexperimental methods. American conomic Review, 85(4), Glazerman,., Levy,. M., & Myers,. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy, 589, Hallberg, K., Wong, V.., & ook, T.. (under view). chool Level Matching in Observational tudies. Heckman, J. J., Ichimura, H., mith, J.., & Todd, P. (1998). haracterizing selection bias. conometrica, 66(5), dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 29

31 Analysis of Within-tudy omparisons What Works learinghouse (2014). Procedus and tandards Handbook Version 3.0. U.. epartment of ducation. Retrieved February 16, 2016 from andbook.pdf Krueger (1999). xperimental stimates of the ducation Production Functions. The Quarterly Journal of conomics, 114(2), LaLonde, R. (1986). valuating the econometric evaluations of training with experimental data. The American conomic Review, 76(4), Munk, Hwang, & Brown (200). Testing Average quivalence Finding a ompromise between Theory and Practice. Biometrical Journal. 5, Office of Management and Budget, What onstitutes trong vidence of a Program s ffectiveness? (Washington,..: eptember 2005) is at [hyperlink, Paige (2005). cientifically Based valuation Methods. Federal Register, 70(15), pg Retrieved on February 16, 2016 at chuirmann,. J. (1987). A comparison of the two one-sided tests procedu and the power approach for assessing equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15, hadish, W.R., lark, M.H., & teiner, P.M. (2008). an nonrandomized experiments yield accurate answers? A randomized experiment comparing random to nonrandom assignment (with comments by Little/Long/Lin, Hill, and Rubin, and a joinder). Journal of the American tatistical Association, 103, hadish, W.R., Galindo, R., Wong, V.., teiner, P.M., & ook, T.. (2011). A randomized experiment comparing random to cutoff-based assignment. Psychological Methods, 16(2), Tryon, W. W. (2001). valuating statistical diffence, equivalence, and indeterminacy using infential confidence intervals: An integrated alternative method of conducting null hypothesis statistical tests. Psychological Methods, 6, Tryon, W. W., & Lewis,. (2008). An Infential onfidence Interval Method of stablishing tatistical quivalence That orcts Tryon s (2001) Reduction Factor. Psychological Methods, 13, dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 30

32 Analysis of Within-tudy omparisons Wilde,. T., & Hollister, R. (2007). How close is close enough? Testing nonexperimental estimates of impact against experimental estimates of impact with education test scos as outcomes. Journal of Policy Analysis and Management, 26(3), dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 31

33 Analysis of Within-tudy omparisons Table 1. Parameter ettings for the The Target Populations Population Outcome Model Tatment ffect Population 1: medium effect β 1 =. 3, β2 =.2 size & positive selection τ =.3Y Population 2: very small effect size & positive selection Population 3: medium effect size & negative selection β 1 =. 3, β2 =.2 τ =.05Y β 1 =. 3, β2 =.2 τ =.3Y election Model Non- xperiment γ =.5, γ =.5, 0 γ =.3 2 γ =.5, 0 γ 2 =.3 γ =.5, 0 γ = γ =.5, 1 γ =.5, 1 Table 2. Regssion Models for stimating the Tatment ffect (τ) ata & Analysis Randomized experiment Non-experiment with both confounders (X 1 and X 2 ) Non-experiment with confounder X 1 Non-experiment with confounder X 2 Non-experiment without any confounder Regssion model = β + τt + ε Y 0 Y = X 2 β + τt + β X + β + ε with inverse-p weights based on X 1 and X 2 Y β + τt + β + ε = 0 1X1 with inverse-p weights based on X 1 Y β + τt + β + ε = 0 2X 2 with inverse-p weights based on X 2 = β + τt + ε Y 0 dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 32

34 Analysis of Within-tudy omparisons Figu 1: Independent versus dependent within-study comparisons Panel 1: Independent Arm Approach Panel 2: ependent Arm Approach imultaneous design dpolicyworks Working Paper eries No. 46. April urry chool of ducation Frank Batten chool of Leadership and Public Policy University of Virginia 33

Abstract Title Page Not included in page count. Title: Analyzing Empirical Evaluations of Non-experimental Methods in Field Settings

Abstract Title Page Not included in page count. Title: Analyzing Empirical Evaluations of Non-experimental Methods in Field Settings Abstract Title Page Not included in page count. Title: Analyzing Empirical Evaluations of Non-experimental Methods in Field Settings Authors and Affiliations: Peter M. Steiner, University of Wisconsin-Madison

More information

Working Paper: Designs of Empirical Evaluations of Non-Experimental Methods in Field Settings. Vivian C. Wong 1 & Peter M.

Working Paper: Designs of Empirical Evaluations of Non-Experimental Methods in Field Settings. Vivian C. Wong 1 & Peter M. EdPolicyWorks Working Paper: Designs of Empirical Evaluations of Non-Experimental Methods in Field Settings Vivian C. Wong 1 & Peter M. Steiner 2 Over the last three decades, a research design has emerged

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA The uncertain nature of property casualty loss reserves Property Casualty loss reserves are inherently uncertain.

More information

Within Study Comparison Workshop. Evanston, Aug 2012

Within Study Comparison Workshop. Evanston, Aug 2012 Within Study Comparison Workshop Evanston, Aug 2012 What is a WSC? Attempt to ascertain whether a causal benchmark provided by an RCT is closely approximated by an adjusted QE. Attempt to ascertain conditions

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Chapter 11 Nonexperimental Quantitative Research Steps in Nonexperimental Research

Chapter 11 Nonexperimental Quantitative Research Steps in Nonexperimental Research Chapter 11 Nonexperimental Quantitative Research (Reminder: Don t forget to utilize the concept maps and study questions as you study this and the other chapters.) Nonexperimental research is needed because

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 11 + 13 & Appendix D & E (online) Plous - Chapters 2, 3, and 4 Chapter 2: Cognitive Dissonance, Chapter 3: Memory and Hindsight Bias, Chapter 4: Context Dependence Still

More information

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering Meta-Analysis Zifei Liu What is a meta-analysis; why perform a metaanalysis? How a meta-analysis work some basic concepts and principles Steps of Meta-analysis Cautions on meta-analysis 2 What is Meta-analysis

More information

Propensity Score Analysis Shenyang Guo, Ph.D.

Propensity Score Analysis Shenyang Guo, Ph.D. Propensity Score Analysis Shenyang Guo, Ph.D. Upcoming Seminar: April 7-8, 2017, Philadelphia, Pennsylvania Propensity Score Analysis 1. Overview 1.1 Observational studies and challenges 1.2 Why and when

More information

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Still important ideas Contrast the measurement of observable actions (and/or characteristics)

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Statistical Literacy in the Introductory Psychology Course

Statistical Literacy in the Introductory Psychology Course Statistical Literacy Taskforce 2012, Undergraduate Learning Goals 1 Statistical Literacy in the Introductory Psychology Course Society for the Teaching of Psychology Statistical Literacy Taskforce 2012

More information

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy Jee-Seon Kim and Peter M. Steiner Abstract Despite their appeal, randomized experiments cannot

More information

Methods of Reducing Bias in Time Series Designs: A Within Study Comparison

Methods of Reducing Bias in Time Series Designs: A Within Study Comparison Methods of Reducing Bias in Time Series Designs: A Within Study Comparison Kylie Anglin, University of Virginia Kate Miller-Bains, University of Virginia Vivian Wong, University of Virginia Coady Wing,

More information

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Cochrane Pregnancy and Childbirth Group Methodological Guidelines Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Confidence Intervals On Subsets May Be Misleading

Confidence Intervals On Subsets May Be Misleading Journal of Modern Applied Statistical Methods Volume 3 Issue 2 Article 2 11-1-2004 Confidence Intervals On Subsets May Be Misleading Juliet Popper Shaffer University of California, Berkeley, shaffer@stat.berkeley.edu

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 5, 6, 7, 8, 9 10 & 11)

More information

Student Performance Q&A:

Student Performance Q&A: Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Instrumental Variables Estimation: An Introduction

Instrumental Variables Estimation: An Introduction Instrumental Variables Estimation: An Introduction Susan L. Ettner, Ph.D. Professor Division of General Internal Medicine and Health Services Research, UCLA The Problem The Problem Suppose you wish to

More information

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 This course does not cover how to perform statistical tests on SPSS or any other computer program. There are several courses

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016 Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP233201500069I 5/2/2016 Overview The goal of the meta-analysis is to assess the effects

More information

M.tech Student Satya College of Engg. & Tech, India *1

M.tech Student Satya College of Engg. & Tech, India *1 [Mangla, 3(7: July, 2014] ISSN: 2277-9655 (ISRA, Impact Factor: 1.852 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY Comparative Analysis of Various Edge Detection Techniques

More information

The Regression-Discontinuity Design

The Regression-Discontinuity Design Page 1 of 10 Home» Design» Quasi-Experimental Design» The Regression-Discontinuity Design The regression-discontinuity design. What a terrible name! In everyday language both parts of the term have connotations

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

The ROBINS-I tool is reproduced from riskofbias.info with the permission of the authors. The tool should not be modified for use.

The ROBINS-I tool is reproduced from riskofbias.info with the permission of the authors. The tool should not be modified for use. Table A. The Risk Of Bias In Non-romized Studies of Interventions (ROBINS-I) I) assessment tool The ROBINS-I tool is reproduced from riskofbias.info with the permission of the auths. The tool should not

More information

Live WebEx meeting agenda

Live WebEx meeting agenda 10:00am 10:30am Using OpenMeta[Analyst] to extract quantitative data from published literature Live WebEx meeting agenda August 25, 10:00am-12:00pm ET 10:30am 11:20am Lecture (this will be recorded) 11:20am

More information

Group Assignment #1: Concept Explication. For each concept, ask and answer the questions before your literature search.

Group Assignment #1: Concept Explication. For each concept, ask and answer the questions before your literature search. Group Assignment #1: Concept Explication 1. Preliminary identification of the concept. Identify and name each concept your group is interested in examining. Questions to asked and answered: Is each concept

More information

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to CHAPTER - 6 STATISTICAL ANALYSIS 6.1 Introduction This chapter discusses inferential statistics, which use sample data to make decisions or inferences about population. Populations are group of interest

More information

Practical propensity score matching: a reply to Smith and Todd

Practical propensity score matching: a reply to Smith and Todd Journal of Econometrics 125 (2005) 355 364 www.elsevier.com/locate/econbase Practical propensity score matching: a reply to Smith and Todd Rajeev Dehejia a,b, * a Department of Economics and SIPA, Columbia

More information

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis EFSA/EBTC Colloquium, 25 October 2017 Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis Julian Higgins University of Bristol 1 Introduction to concepts Standard

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Comparison of the Null Distributions of

Comparison of the Null Distributions of Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic Domenic V. Cicchetti West Haven VA Hospital and Yale University Joseph L. Fleiss Columbia University It frequently occurs

More information

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton Effects of propensity score overlap on the estimates of treatment effects Yating Zheng & Laura Stapleton Introduction Recent years have seen remarkable development in estimating average treatment effects

More information

PSYCHOLOGY 300B (A01) One-sample t test. n = d = ρ 1 ρ 0 δ = d (n 1) d

PSYCHOLOGY 300B (A01) One-sample t test. n = d = ρ 1 ρ 0 δ = d (n 1) d PSYCHOLOGY 300B (A01) Assignment 3 January 4, 019 σ M = σ N z = M µ σ M d = M 1 M s p d = µ 1 µ 0 σ M = µ +σ M (z) Independent-samples t test One-sample t test n = δ δ = d n d d = µ 1 µ σ δ = d n n = δ

More information

Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome

Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome Stephen Burgess July 10, 2013 Abstract Background: Sample size calculations are an

More information

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia Paper 109 A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia ABSTRACT Meta-analysis is a quantitative review method, which synthesizes

More information

Scientific Research. The Scientific Method. Scientific Explanation

Scientific Research. The Scientific Method. Scientific Explanation Scientific Research The Scientific Method Make systematic observations. Develop a testable explanation. Submit the explanation to empirical test. If explanation fails the test, then Revise the explanation

More information

Version No. 7 Date: July Please send comments or suggestions on this glossary to

Version No. 7 Date: July Please send comments or suggestions on this glossary to Impact Evaluation Glossary Version No. 7 Date: July 2012 Please send comments or suggestions on this glossary to 3ie@3ieimpact.org. Recommended citation: 3ie (2012) 3ie impact evaluation glossary. International

More information

Chapter 5: Field experimental designs in agriculture

Chapter 5: Field experimental designs in agriculture Chapter 5: Field experimental designs in agriculture Jose Crossa Biometrics and Statistics Unit Crop Research Informatics Lab (CRIL) CIMMYT. Int. Apdo. Postal 6-641, 06600 Mexico, DF, Mexico Introduction

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

CHAMP: CHecklist for the Appraisal of Moderators and Predictors

CHAMP: CHecklist for the Appraisal of Moderators and Predictors CHAMP - Page 1 of 13 CHAMP: CHecklist for the Appraisal of Moderators and Predictors About the checklist In this document, a CHecklist for the Appraisal of Moderators and Predictors (CHAMP) is presented.

More information

The Research Roadmap Checklist

The Research Roadmap Checklist 1/5 The Research Roadmap Checklist Version: December 1, 2007 All enquires to bwhitworth@acm.org This checklist is at http://brianwhitworth.com/researchchecklist.pdf The element details are explained at

More information

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris

Running Head: ADVERSE IMPACT. Significance Tests and Confidence Intervals for the Adverse Impact Ratio. Scott B. Morris Running Head: ADVERSE IMPACT Significance Tests and Confidence Intervals for the Adverse Impact Ratio Scott B. Morris Illinois Institute of Technology Russell Lobsenz Federal Bureau of Investigation Adverse

More information

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE ...... EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE TABLE OF CONTENTS 73TKey Vocabulary37T... 1 73TIntroduction37T... 73TUsing the Optimal Design Software37T... 73TEstimating Sample

More information

JSM Survey Research Methods Section

JSM Survey Research Methods Section Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal

More information

Clinical trial design issues and options for the study of rare diseases

Clinical trial design issues and options for the study of rare diseases Clinical trial design issues and options for the study of rare diseases November 19, 2018 Jeffrey Krischer, PhD Rare Diseases Clinical Research Network Rare Diseases Clinical Research Network (RDCRN) is

More information

Appendix B Statistical Methods

Appendix B Statistical Methods Appendix B Statistical Methods Figure B. Graphing data. (a) The raw data are tallied into a frequency distribution. (b) The same data are portrayed in a bar graph called a histogram. (c) A frequency polygon

More information

Propensity Score Matching with Limited Overlap. Abstract

Propensity Score Matching with Limited Overlap. Abstract Propensity Score Matching with Limited Overlap Onur Baser Thomson-Medstat Abstract In this article, we have demostrated the application of two newly proposed estimators which accounts for lack of overlap

More information

Experimental Psychology

Experimental Psychology Title Experimental Psychology Type Individual Document Map Authors Aristea Theodoropoulos, Patricia Sikorski Subject Social Studies Course None Selected Grade(s) 11, 12 Location Roxbury High School Curriculum

More information

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity

PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity PLS 506 Mark T. Imperial, Ph.D. Lecture Notes: Reliability & Validity Measurement & Variables - Initial step is to conceptualize and clarify the concepts embedded in a hypothesis or research question with

More information

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS Chapter Objectives: Understand Null Hypothesis Significance Testing (NHST) Understand statistical significance and

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous Data

Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous Data American Journal of Applied Sciences 9 (9): 1512-1517, 2012 ISSN 1546-9239 2012 Science Publication Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous

More information

Special guidelines for preparation and quality approval of reviews in the form of reference documents in the field of occupational diseases

Special guidelines for preparation and quality approval of reviews in the form of reference documents in the field of occupational diseases Special guidelines for preparation and quality approval of reviews in the form of reference documents in the field of occupational diseases November 2010 (1 st July 2016: The National Board of Industrial

More information

Chapter 14: More Powerful Statistical Methods

Chapter 14: More Powerful Statistical Methods Chapter 14: More Powerful Statistical Methods Most questions will be on correlation and regression analysis, but I would like you to know just basically what cluster analysis, factor analysis, and conjoint

More information

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research 2012 CCPRC Meeting Methodology Presession Workshop October 23, 2012, 2:00-5:00 p.m. Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy

More information

Incorporating Within-Study Correlations in Multivariate Meta-analysis: Multilevel Versus Traditional Models

Incorporating Within-Study Correlations in Multivariate Meta-analysis: Multilevel Versus Traditional Models Incorporating Within-Study Correlations in Multivariate Meta-analysis: Multilevel Versus Traditional Models Alison J. O Mara and Herbert W. Marsh Department of Education, University of Oxford, UK Abstract

More information

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) UNIT 4 OTHER DESIGNS (CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) Quasi Experimental Design Structure 4.0 Introduction 4.1 Objectives 4.2 Definition of Correlational Research Design 4.3 Types of Correlational

More information

Guidelines for reviewers

Guidelines for reviewers Guidelines for reviewers Registered Reports are a form of empirical article in which the methods and proposed analyses are pre-registered and reviewed prior to research being conducted. This format of

More information

Controlled Trials. Spyros Kitsiou, PhD

Controlled Trials. Spyros Kitsiou, PhD Assessing Risk of Bias in Randomized Controlled Trials Spyros Kitsiou, PhD Assistant Professor Department of Biomedical and Health Information Sciences College of Applied Health Sciences University of

More information

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)

More information

Causal Mediation Analysis with the CAUSALMED Procedure

Causal Mediation Analysis with the CAUSALMED Procedure Paper SAS1991-2018 Causal Mediation Analysis with the CAUSALMED Procedure Yiu-Fai Yung, Michael Lamm, and Wei Zhang, SAS Institute Inc. Abstract Important policy and health care decisions often depend

More information

How many speakers? How many tokens?:

How many speakers? How many tokens?: 1 NWAV 38- Ottawa, Canada 23/10/09 How many speakers? How many tokens?: A methodological contribution to the study of variation. Jorge Aguilar-Sánchez University of Wisconsin-La Crosse 2 Sample size in

More information

04/12/2014. Research Methods in Psychology. Chapter 6: Independent Groups Designs. What is your ideas? Testing

04/12/2014. Research Methods in Psychology. Chapter 6: Independent Groups Designs. What is your ideas? Testing Research Methods in Psychology Chapter 6: Independent Groups Designs 1 Why Psychologists Conduct Experiments? What is your ideas? 2 Why Psychologists Conduct Experiments? Testing Hypotheses derived from

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models PART 3 Fixed-Effect Versus Random-Effects Models Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7

More information

Fixed Effect Combining

Fixed Effect Combining Meta-Analysis Workshop (part 2) Michael LaValley December 12 th 2014 Villanova University Fixed Effect Combining Each study i provides an effect size estimate d i of the population value For the inverse

More information

Chapter 02. Basic Research Methodology

Chapter 02. Basic Research Methodology Chapter 02 Basic Research Methodology Definition RESEARCH Research is a quest for knowledge through diligent search or investigation or experimentation aimed at the discovery and interpretation of new

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH

CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH Why Randomize? This case study is based on Training Disadvantaged Youth in Latin America: Evidence from a Randomized Trial by Orazio Attanasio,

More information

Six Sigma Glossary Lean 6 Society

Six Sigma Glossary Lean 6 Society Six Sigma Glossary Lean 6 Society ABSCISSA ACCEPTANCE REGION ALPHA RISK ALTERNATIVE HYPOTHESIS ASSIGNABLE CAUSE ASSIGNABLE VARIATIONS The horizontal axis of a graph The region of values for which the null

More information

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work?

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work? Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work? Nazila Alinaghi W. Robert Reed Department of Economics and Finance, University of Canterbury Abstract: This study uses

More information

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods Dean Eckles Department of Communication Stanford University dean@deaneckles.com Abstract

More information

Basic Statistics and Data Analysis in Work psychology: Statistical Examples

Basic Statistics and Data Analysis in Work psychology: Statistical Examples Basic Statistics and Data Analysis in Work psychology: Statistical Examples WORK PSYCHOLOGY INTRODUCTION In this chapter we examine a topic which is given too little coverage in most texts of this kind,

More information

2 Critical thinking guidelines

2 Critical thinking guidelines What makes psychological research scientific? Precision How psychologists do research? Skepticism Reliance on empirical evidence Willingness to make risky predictions Openness Precision Begin with a Theory

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine Confounding by indication developments in matching, and instrumental variable methods Richard Grieve London School of Hygiene and Tropical Medicine 1 Outline 1. Causal inference and confounding 2. Genetic

More information

WELCOME! Lecture 11 Thommy Perlinger

WELCOME! Lecture 11 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 11 Thommy Perlinger Regression based on violated assumptions If any of the assumptions are violated, potential inaccuracies may be present in the estimated regression

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

How to interpret results of metaanalysis

How to interpret results of metaanalysis How to interpret results of metaanalysis Tony Hak, Henk van Rhee, & Robert Suurmond Version 1.0, March 2016 Version 1.3, Updated June 2018 Meta-analysis is a systematic method for synthesizing quantitative

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary Statistics and Results This file contains supplementary statistical information and a discussion of the interpretation of the belief effect on the basis of additional data. We also present

More information

Identifying Endogenous Peer Effects in the Spread of Obesity. Abstract

Identifying Endogenous Peer Effects in the Spread of Obesity. Abstract Identifying Endogenous Peer Effects in the Spread of Obesity Timothy J. Halliday 1 Sally Kwak 2 University of Hawaii- Manoa October 2007 Abstract Recent research in the New England Journal of Medicine

More information

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Number XX An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 54 Gaither

More information

Understanding Uncertainty in School League Tables*

Understanding Uncertainty in School League Tables* FISCAL STUDIES, vol. 32, no. 2, pp. 207 224 (2011) 0143-5671 Understanding Uncertainty in School League Tables* GEORGE LECKIE and HARVEY GOLDSTEIN Centre for Multilevel Modelling, University of Bristol

More information

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Supplement 2. Use of Directed Acyclic Graphs (DAGs) Supplement 2. Use of Directed Acyclic Graphs (DAGs) Abstract This supplement describes how counterfactual theory is used to define causal effects and the conditions in which observed data can be used to

More information

SkillBuilder Shortcut: Levels of Evidence

SkillBuilder Shortcut: Levels of Evidence SkillBuilder Shortcut: Levels of Evidence This shortcut sheet was developed by Research Advocacy Network to assist advocates in understanding Levels of Evidence and how these concepts apply to clinical

More information

Identifying Mechanisms behind Policy Interventions via Causal Mediation Analysis

Identifying Mechanisms behind Policy Interventions via Causal Mediation Analysis Identifying Mechanisms behind Policy Interventions via Causal Mediation Analysis December 20, 2013 Abstract Causal analysis in program evaluation has largely focused on the assessment of policy effectiveness.

More information

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Addendum: Multiple Regression Analysis (DRAFT 8/2/07) Addendum: Multiple Regression Analysis (DRAFT 8/2/07) When conducting a rapid ethnographic assessment, program staff may: Want to assess the relative degree to which a number of possible predictive variables

More information

Ambiguous Data Result in Ambiguous Conclusions: A Reply to Charles T. Tart

Ambiguous Data Result in Ambiguous Conclusions: A Reply to Charles T. Tart Other Methodology Articles Ambiguous Data Result in Ambiguous Conclusions: A Reply to Charles T. Tart J. E. KENNEDY 1 (Original publication and copyright: Journal of the American Society for Psychical

More information

Political Science 15, Winter 2014 Final Review

Political Science 15, Winter 2014 Final Review Political Science 15, Winter 2014 Final Review The major topics covered in class are listed below. You should also take a look at the readings listed on the class website. Studying Politics Scientifically

More information

Basic Concepts in Research and DATA Analysis

Basic Concepts in Research and DATA Analysis Basic Concepts in Research and DATA Analysis 1 Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...2 The Research Question...3 The Hypothesis...3 Defining the

More information

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co.

Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co. Evidence-Based Medicine and Publication Bias Desmond Thompson Merck & Co. Meta-Analysis Defined A meta-analysis is: the statistical combination of two or more separate studies In other words: overview,

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

26:010:557 / 26:620:557 Social Science Research Methods

26:010:557 / 26:620:557 Social Science Research Methods 26:010:557 / 26:620:557 Social Science Research Methods Dr. Peter R. Gillett Associate Professor Department of Accounting & Information Systems Rutgers Business School Newark & New Brunswick 1 Overview

More information

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives DOI 10.1186/s12868-015-0228-5 BMC Neuroscience RESEARCH ARTICLE Open Access Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives Emmeke

More information