Next: An Improved Method for Identifying Impacts in Regression Discontinuity Designs

Next: An Improved Method for Identifying Impacts in Regression Discontinuity Designs Abstract: Mark C. Long University of Washington Box 353055 Seattle, WA 98195-3055 marklong@uw.edu (corresponding author) Jordan Rooklyn University of Washington jrooklyn@uw.edu Working Paper, August 31, 2016 This paper develops and advocates for a data-driven algorithm that simultaneously selects the polynomial specification and bandwidth combination that minimizes the predicted mean squared error at the threshold of a discontinuity. It achieves this selection by evaluating the combinations of specification and bandwidth that perform best in estimating the next point in the observed sequence on each side of the discontinuity. We illustrate this method by applying it to data with a simulated treatment effect to show its efficacy for regression discontinuity designs and reexamine the results of papers in the literature. Keywords: Program evaluation; regression discontinuity Acknowledgments: Brian Dillon, Dan Goldhaber, Jon Smith, Jake Vigdor, Ted Westling, and University of Washington seminar audience members provided helpful feedback on the paper. This research was partially supported by a grant from the U.S. Department of Education s Institute of Education Sciences (R305A140380).

Next: An Improved Method for Identifying Impacts in Regression Discontinuity Designs I. Introduction and Literature Review Regression discontinuity (RD) designs have become a very popular method for identifying the local average treatment effect of a program. In many policy contexts, estimating treatment effects via social experiments is not feasible due to either cost or ethical considerations. Furthermore, in many contexts, allocating a treatment on the basis of some score (often a score that illustrates the individual s worthiness of receiving the treatment) seems natural. RD holds the promise of having some of the advantages of random treatment allocation (assuming that being just above or just below the threshold score for receiving the treatment is effectively random) without the adverse complications of full-blown randomized experiments. However, RD designs present a challenge for researchers: how to identify the predicted value of the outcome (Y) as the score (X) approaches the threshold (T) from both the left and right hand side of that threshold. A number of guides to standard practice have been written during the past ten years; the highly cited guide by Lee and Lemieux (2010) provides the following guidance: 1 When the analyst chooses a parametric functional form (say, a low-order polynomial) that is incorrect, the resulting estimator will, in general, be biased. When the analyst uses a nonparametric procedure such as local linear regression essentially running a regression using only data points close to the cutoff there will also be bias.our main suggestion in estimation is to not rely on one particular method or specification (p. 284) To illustrate this point, Lee and Lemieux reanalyze the data from Lee (2008) who evaluated the impact of party incumbency on the probability that the incumbent party will retain the district s seat in the next election for the U.S. House of Representatives. In this analysis, X is defined as the Democratic vote share in year t minus the vote share of the Democrats strongest opponent (virtually always a Republican) (Lee, 2008, p. 686). Lee and Lemieux estimate the treatment effect by using polynomials ranging from order zero (i.e., the average of prior values) up to a 6 th 1 For other discussions of standard methods, see Imbens and Lemieux (2008), DiNardo and Lee (2010), Jacob et al. (2012) and Van Der Klaauw (2013). 2

order polynomial with the same order polynomial estimated for both sides of the discontinuity and with bandwidths ranging from 1% to 100% (i.e., using all of the data). For each bandwidth, they identify the optimal order of the polynomial by selecting the one with the lowest value of the Akaike information criterion (AIC) value. And, they identify an optimal bandwidth by choosing the value of h that minimizes the mean square of the difference between the predicted and actual value of Y (p. 321). As shown in Table 2 of their paper, using the optimal bandwidth, which is roughly 5%, and the optimal order of the polynomial for this bandwidth (quadratic), the estimated effect of incumbency on the Democratic party s vote share in year t+1 is 0.100 (s.e. = 0.029). While this model selection procedure has the nice feature of selecting the specification and bandwidth optimally, it has two limitations: (1) it suggests that a particular order of the polynomial and bandwidth be used on both sides of the discontinuity, and (2) the AIC evaluates the fit of the polynomial at all values of X, and doesn t attempt to evaluate the fit of the polynomial as X approaches the threshold, which is more appropriate for the RD treatment effect estimation. Gelman and Imbens (2014) argue against using high order polynomial regressions to estimate treatment effects in an RD context and instead recommend that researchers control for local linear or quadratic polynomials or other smooth functions (p. 2). We focus here on their second critique: Results based on high order polynomial regressions are sensitive to the order of the polynomial. Moreover, we do not have good methods for choosing that order in a way that is optimal for the objective of a good estimator for the causal effect of interest. Often researchers choose the order by optimizing some global goodness of fit measure, but that is not closely related to the research objective of causal inference (p. 2). The goal of our paper is to provide an optimal method for choosing the polynomial order (as well as the bandwidth) that Gelman and Imbens (2014) note is currently lacking in the literature. Gelman and Zelizer (2015) illustrate the challenges that could come from using a higherorder polynomial by critiquing a prominent paper by Chen, Ebenstein, Greenstone, and Li (2013), which we describe in greater detail below, which examines the effect of an air pollution policy on life expectancy. Gelman and Zelizer note: 3

[Chen et al. s] cubic adjustment gave an estimated effect of 5.5 years with standard error 2.4. A linear adjustment gave an estimate of 1.6 years with standard error 1.7. The large, statistically significant estimated treatment effect at the discontinuity depends on the functional form employed. the headline claim, and its statistical significance, is highly dependent on a model choice that may have a data-analytic purpose, but which has no particular scientific basis (pp.3-4). Gelman and Zelizer conclude that: we are not recommending global linear adjustments as an alternative. In some settings a linear relationship can make sense. What we are warning against is the appealing but misguided view that users can correct for arbitrary dependence on the forcing variable by simply including several polynomial terms in a regression (p. 6). In the case study in Section 3.3 of this paper, we re-examine the Chen et al. results using our method. We show that Gelman and Zelizer s concerns are well founded; our method shows that the estimated effect of pollution on life expectancy is much smaller. In addition to finding the most appropriate form for the specification, researchers also face the challenge of deciding whether to estimate the selected specification over the whole range of X (that is a global estimate of Y=f0(X) and Y=f1(X) where f0(.) and f1(.) reflect the function on the left and right sides of the threshold) or to estimate the selected specification over a narrower range of X near T, a local approach. Imbens and Kalyanaraman (2012) argue for using a local approach and develop a technique for finding the optimal bandwidth. The Imbens and Kalyanaraman bandwidth selection method is devised for the estimation of separate local linear regressions on each side of the threshold. They note that ad hoc approaches for bandwidth choice, such as standard plug-in and cross-validation methods are typically based on objective functions which take into account the performance of the estimator of the regression function over the entire support and do not yield optimal bandwidths for the problem at hand (p. 934). Their method, in contrast, finds the bandwidth that minimizes mean squared error at the threshold. Imbens and Kalyanaraman caution that their method, which we henceforth label IK, gives a convenient starting point and benchmark for doing a sensitivity analyses regarding bandwidth choice (p. 940) and thus they remind the user to examine the results using other bandwidths. 4

While the IK method greatly helps researchers by providing a data-generated method for choosing the optimal bandwidth, it does so by assuming that the researcher is using a local linear regression on both sides of the threshold. This is can introduce substantial bias if (1) a linear regression is the incorrect functional form and (2) if the treatment changes the relationship between and. Our method, thus, simultaneously selects the optimal polynomial order and the optimal bandwidth for each side of the discontinuity. We achieve this result by evaluating the performance of various combinations of order and bandwidth with performance measured as mean squared error in predicting the observed values of Y as X approaches the threshold (from either side); estimating the mean squared error at the threshold as a weighted average of prior mean squared errors with greater weight on mean squared errors close to the threshold; and identifying the specification/bandwidth combination that has the lowest predicted mean squared error at the threshold. We show that our method does modestly better than the IK method when applied to real data with a simulated treatment effect. We then apply our method to data from two prominent papers (Lee (2008) and Chen et al. (2013)) and we document the extent to which our method produces different results. 2. Method The goal of RD studies is to estimate the local average treatment effect defined as the expected change in the outcome for those whose score is at the threshold:, where is the value of if observation i is untreated and is the value of if the treatment is received. Assume that treatment occurs when. 2 Assume that there is a smooth and continuous relationship between and in the range and that this relationship can be expressed as. Likewise, assume that there is a smooth 2 Note that our method is designed for sharp regression discontinuities, where treatment is received by all those who are on one side of a threshold and not received by anyone on the other side of the threshold. In fuzzy contexts, where there is a discontinuity in the probability of receiving treatment at the threshold, one can obtain estimates of the local effect of the treatment on the treated by computing the ratio of the discontinuity in the outcome at the threshold and the discontinuity in the probability of receiving treatment at the threshold. When applied in the context of fuzzy RDs, our method will identify the intent-to-treat estimate for those at the threshold, but will not yield an estimate of the local average treatment on the treated effect. 5

and continuous relationship between and in the range and that this relationship can be expressed as. Assuming that the only discontinuity in the relationship between and at is due to the impact of the treatment, the estimand,, is defined as the difference of the two estimated functions evaluated at the threshold: =. Define mean square prediction error ( ) as follows:. Our goal is to select the bandwidths ( and ) and order of the polynomials ( and ) for estimating and such that is minimized 3 : (1) argmin,, = argmin,, = argmin =,, argmin =,, argmin,, 2 To this point, the minimization problem is unconstrained and standard. Imbens and Kalyanaraman (2012) add the following constraints to this problem: and 1. That is, they assume linear relationships between and in the ranges and with the treatment effect,, being identified as the jump between those two linear functions at. 3 Note that choosing a higher bandwidth allows for more data to be used in estimating., which reduces the variance of the estimated parameters. But, a larger bandwidth increases the chance that. is not constant and smooth within the range in which it is estimated. A higher polynomial order can improve the fit of the function. to the observed distribution of and, and thus lowers the bias. But, a higher polynomial order leads to increased variance of the prediction, particularly in the tails of the distribution (e.g., at ). By minimizing though the choice of these parameters, we balance between our desires for low bias and low variance. 6

We take a different approach, which involves a different set of simplifying assumptions. First, unlike IK, our approach allows the treatment to more flexibly change the functional relationship between and as we do not assume linear functions on either side of the discontinuity. Our method has. estimated solely on data where, and. estimated solely on data where. This approach is akin to the common practice in RD studies of estimating one regression which fully interacts the polynomial terms with. 4 Second, we simplify the minimization problem considerably by dropping the last term (i.e., 2 ). Here is our justification for doing so. Suppose that for a given choice of and, the prediction on the left side of the threshold is positive (i.e., 0. One could attempt to select and such that the prediction error on the right side of the threshold is also positive (i.e., 0 and equal to the bias on the left so as to cancel it. In fact, one could carry this further and select and such that the error on the right side of the threshold is as positive as possible and thus making the last term as negative as possible (a point Imbens and Kalyanaraman note as well). However, doing so comes at a penalty of increasing the square of the prediction error on the right side (i.e., ) and thereby results in a higher. Thus, there is little to be gained by selecting and on the basis of the last term in Equation 1. If we can ignore this term, we substantially simplify the task by breaking it into two separate problems: (2) argmin argmin argmin,,,, argmin argmin.,, The advantage of our approach is that we can directly evaluate how different choices of and perform in predicting observed outcomes before one reaches the threshold, and pick values of and that have demonstrated strong performance in terms of their mean squared prediction errors for observed values. Our key insight is that by focusing on data from one side of the 4 Note that in such models. and. are in effect estimated solely based on data from their respective sides of the threshold as the same coefficients could be obtained by separate polynomial regressions on each side of the threshold. Put differently, no information from the right hand side is being used to estimate the coefficients on the left hand side and vice-versa. 7

threshold only, we can use that observed data to calculate a series of s and then predict (and ) as weighted averages of observed s (and confidence intervals around the weighted averages of observed s). We recognize, however, that if the treatment does not affect the functional relationship between and (e.g.,.. ), then our method would be inefficient (but unbiased), as one would gain power to estimate the common slope parameters of. and. by using data on both sides of the threshold. Index the observed distinct values of as 1 to. Define as equal to, where is a polynomial of order that is estimated over the interval from to 1 using the observed distributions of and in this interval, reflects the number of prior observations that are used to estimate the polynomial, and is the observed value of when =. Note that this formula uses an adaptive bandwidth that is a function of (i.e., Δ ) to accommodate areas where the data are thin. Suppose that we estimated as a straight average of these calculated values of (i.e., ), and then selected the parameters and that minimized this straight average. One disadvantage of doing so would be that it would ignore variance across the values of and would not consider the number of observations of used to compute this average. 5 Less confidence should be placed in estimates of that rely on fewer or more variable observations of. Thus, rather than select the parameters and that minimize the average, we select parameters and that minimize the upper bound of an 80% confidence interval around the average (i.e., such that there is only a 10% chance that the true, unknown, mean value of the broader distribution from which our observations are drawn is greater than this upper bound). 6 A second disadvantage of a straight average is that it places equal weight on the calculated values of regardless of how far is from the threshold. So as to place more weight on the calculated values of for which is close to, we estimate as a weighted average of the calculated values of : (3), 5 The number of observations of declines by 1 for each unit increase in either or 6 As with all confidence intervals, the choice of 80% is arbitrary. Different values can be set by the user of our Stata program for executing this method (Long and Rooklyn, 2016). 8

where is a kernel function (defined below). We then find the parameters that solve argmin., where, is the estimated standard error of. To find these parameters, we compute for all combinations of and subject to the following constraints: and are integers; max, 1..min, 1, where and are the minimum and maximum number of prior observations the researcher is willing to allow to be used in computing ; and.., where and are the minimum and maximum polynomial orders the researcher is willing to consider, 0, and when 0, is defined as the average of the prior values of. We select the combination of and (among those that are considered) that minimize. 7 In our empirical investigations below, we use an exponential kernel, defined as follows: (4) is the base weight, and we alternately explore base weights equal to 1, 10 3, 10 6, and 10 10.. When 1,. is the uniform kernel which gives uniform weight to each value of when estimating. When = 10 3 (10 6 ) [10 10 ], while all s get some positive weight, 50% (75%) [90%] of the weight is placed on the last 10% of s that are closest to the threshold. That is, higher values of gives more emphasis to s closer to the threshold than further away.. We repeat this process to estimate the parameters that solve argmin,, with the only difference being that we index the observed distinct values of as to 1 so that the analysis moves from the extreme right in towards. 7 Note that if using a linear specification with only the last two data points (or any polynomial of order using the last +1 data points) there will be no variance in the estimate of Y at the threshold. If this occurs for both sides of the discontinuity, there would be no variance to the estimate of the jump at the threshold. Such a lack of variance of the difference at the discontinuity would disallow hypothesis testing (or, conclude that there is an infinite t-statistic). As this result is unsatisfactory in most contexts, the reader may want to disallow such specification/bandwidth combinations. 9

To illustrate our method, suppose we had a series of six data points with (X, Y) coordinates (1,12), (2,15), (3,16), (4,13), (5,10), and (6,7), and we would like to use this information to estimate the next value of Y (when X=7). These six points are shown in Panel A of Figure 1. Our task is to find the specification that generally performs well in predicting the next value of Y, and more specifically, as discussed above, has a low for X=7. The argument for imposing a limited bandwidth, and not using all of the data points to predict the next value of Y, is a presumption that there has been a change in the underlying relationship between Y and X; for example a discrete jump in the value of Y (perhaps unrelated to X), or a change in the function defining the relationship of Y=f(X). If such a change occurred, then limiting the bandwidth would (ideally) constrain the analysis to the range in which f(x) is steady. In the example discussed above, there does appear to be a change in f(x) as the function appears to become linear after X=3. Of course, this apparent change could be a mirage and the underlying relationship could in fact be quadratic with no change. If there is no change in the relationship between Y and X, then one would generally want to use all available data points to best estimate f(x). Our method for adjudicating between these specifications and bandwidth choices is to compare all possibilities based on (and the upper bound of its confidence interval). Panels B through F of Figure 1 show the performance of possible candidate estimators. The corresponding Table 1 illustrates our method, where Panel A gives the predicted values based on polynomial orders in the range 0...2, and Panel B gives the calculations of each for the feasible combinations of and. Note that since the last four observations happen to be on a line (i.e., (3,16), (4,13), (5,10), and (6,7)), the linear specification using two prior data points has no error in predicting the values of Y when X equals 5 or 6, and the same is true for either the linear or quadratic specifications using three prior values for predicting the value of Y when X equals 6. [Insert Figure 1] [Insert Table 1] Panel C of Table 1 shows the weighted averages using various kernels. A linear specification using two prior data points has the lowest weighted average MSPE using all four 10

base weights, as is indicated by the bolded numbers. 8 This result is not surprising given the perfect linear relation of Y and X for the last four data points. As one can see, as the base weight increases, the weighted average approaches the value of the last in the series. There is clearly a trade-off involved here. With greater weight placed on the last s in the series, one gets less bias in the estimate of at the threshold as less weight is placed on s far away from the threshold. However, relying solely on the last (i.e., ) could invite error a particular specification might accidently produce a near perfect prediction for the last values of Y before the threshold and thus have a lower, but incorrectly predict the unknown value of Y at the threshold. Panel D of Table 1 presents the upper-bound of the 80% confidence interval around. Note that the linear specification using two prior data points has the lowest upper bound for three of the four base weights (with the exception being for the uniform weight). Since high base weights produce wider confidence intervals as they increase the sample standard deviation of the weighted average, using this upper bound of the confidence interval helps avoid unhappy accidents that could occur when using only. When we apply our method to simulated data, we find that the performance is relatively insensitive to the base weight, although we favor =10 3 given its strong performance documented below. Our Stata program for executing this method (Long and Rooklyn, 2016) allows the user to (a) select the minimum number of s that must be included in the analysis ( 2), excluding from consideration combinations of bandwidth and polynomial orders that result in few observations of, and thus to avoid unhappy accidents ; (b) select the minimum and maximum order of the polynomial that the user is willing to consider, (c) select the minimum number of observations the researcher is willing to allow to be used to estimate the next observation, and (d) the desired confidence interval for. For the rest of the paper (excluding Section 3.3), we set the minimum number of MSPEs to five, the minimum and maximum polynomial orders to zero and five, the minimum number of observations to five, and the confidence interval to 80%. 8 If there are ties for the lowest, which did not occur in Table 1, we select the specification with the lowest order polynomial (and ties for a given specification are adjudicated by selecting the smaller bandwidth). We make these choices given the preference in the literature for narrower bandwidths and lower-order polynomials. 11

In the next section, we illustrate the method by applying it to simulated data and use the method to re-evaluate examples from the existing literature. 3. Case Studies That Illustrate the Method 3.1 Case Study 1: Method Applied to Jacob et al. (2012) with a Simulated Treatment Jacob, Zhu, Somers, and Bloom (2012) provide a primer on how to use RD methods. They illustrate contemporary methods using a real data set with a simulated treatment effect, described as follows: The simulated data set is constructed using actual student test scores on a seventh-grade math assessment. From the full data set, we selected two waves of student test scores and used those two test scores as the basis for the simulated data set. One test score (the pretest) was used as the rating variable and the other (the posttest) was used as the outcome. We picked the median of the pretest (= 215) as the cut-point (so that we would have a balanced ratio between the treatment and control units) and added a treatment effect of 10 scale score points to the posttest score of everyone whose pretest score fell below the median. (pp. 7-8). We utilize these data provided by Jacob et al. to illustrate the efficacy of our method. Since the test scores are given in integers, and since the number of students located at each value of the pretest scores differs, we add a frequency weight to the regressions in constructing our predicted values and the weight for computing the weighted average MSPE becomes, where is the number of observations that have that value of X. In the first panel of Table 2, we estimate the simulated treatment effect (which should be -10 by construction) with the threshold at 215. Our method selects a linear specification using 23 data points for the left hand side and a quadratic specification with 33 data points for the right hand side (these selections are not sensitive to the base weight). Compared to the IK method, which selects a bandwidth of 6.3 for both sides, our method selected a much larger bandwidth. 9 9 To estimate these IK bandwidths and resulting treatment effect estimations, we use the rd command for Stata that was developed by Nichols (2011) and use local linear regressions (using 12

Our method outperforms IK with a slightly better estimate of the treatment effect (-9.36 versus - 10.68) and smaller standard errors (0.73 versus 1.27). The much smaller standard error provides our method more power than IK to correctly identify smaller treatment effects. [Insert Table 2] The second and third panels of Table 3 reset the threshold for the simulated effect to 205 and 225, which are respectively at the 19 th and 77 th percentiles of the distribution of X. With the threshold at 205, our model produces estimates of the simulated treatment effect in the range of - -9.96 to -10.09 with base weights of 1 to 10 6 and -8.60 with a base weight of 10 10. Regardless of the base weight, our method selects a quadratic specification using the first 47 observations on the right side of the discontinuity. In contrast, the IK method uses a bandwidth of only 7.3 on both sides of the discontinuity and yields an inferior estimates of the treatment effect (-8.25) and with a higher standard error. Our method and the IK method produce comparable estimates of the treatment effect when the threshold is set at 225 (-11.67 to -11.78 for our method versus -11.74 for IK), yet our method again has smaller standard errors due to more precision in the estimates of the regression line. Figure 3 illustrates our preferred specifications and bandwidths for these three thresholds using 10 3 as the base weight. [Insert Figure 3] The next analysis, which is shown in Table 3, evaluates how our method performs when there is a zero simulated treatment effect. We restore the Jacob et al. data to have no effect and then estimate placebo treatment effects with the threshold set at 200, 205,, 230. We are testing whether our method generates false positives: apparent evidence of a treatment effect when there is no treatment. Our model yields estimated treatment effects that are generally small and lie in the range of -1.67 to 0.64. The bad news is that 2 of the 7 estimates are significant at the 10% level (1 at the 5% level). Thus, a researcher who uses our method would be more likely to incorrectly claim a small estimated treatment effect to be significant. The IK method does better at not finding significant placebo effects in the Jacob et al. (2012) data (none of the IK estimates are significant). However, the IK estimates have a broader range of -2.27 to 1.75. Thus the researcher using the IK method would be more inclined to incorrectly conclude that the triangular ( edge ) kernel weights) within the selected bandwidth. We also find nearly identical results using the rdob program for Stata written by Fuji, Imbens, and Kalyraman (2009). 13

treatment had a sizable effect even when the policy had no effect. The mean absolute error for this set of estimates is 0.76 using our method versus 0.97 using the IK method. The only reason that our method is more likely to incorrectly find significant effects is our lower standard errors, which lie in the range of 0.68 to 1.10 versus the IK standard errors, which lie in the range of 1.22 to 1.89. Thus, we conclude that our higher rate of incorrectly finding significant effects is not a bug but a feature. The researcher who uses our method and finds an insignificant effect can argue that it s a well estimated zero, while that advantage is less likely to be present using IK. [Insert Table 3] To further investigate the efficacy of our method and to compare it to IK s method, we augment the Jacob et al. (2012) data by altering the outcome as follows: _ 5 200 0.1 200 0.0015 200. This cubic augmentation increases up to a local maxima of 7.7 points at 206, then declines to a local minima of -19.1 at 239, and then curves upward again. We then estimate simulated treatment effects of 10 points for those below various thresholds, alternatively set at 200, 205,, 230. This simulated treatment effect added to an underlying cubic relation between and _ should be harder to identify using the IK method as it relies on an assumption of local linear relations. We furthermore evaluate our method relative to IK where the augmentation of posttest only occurs on the left or right side of the specification. Note that since a treatment could have heterogeneous effects, and thus larger or smaller effects away from the threshold, it is possible for the treatment to not only have a level effect at the threshold, but also alter the relationship between the outcome (Y) and the score (X). 10 Our method should have a better ability to handle such cases, and to thus derive a better estimate of the local effect at the threshold. The results are shown in Table 4 and the corresponding graphical representations are shown in Figure 4. In Panel A of Table 4, we show the results with the cubic augmentation applied to both sides of threshold. Across the seven estimations, our method produces an average absolute error of 0.94, which is a 7% improvement on the absolute error found using the IK method, where the average absolute error was 1.00. In Panels B and C of Table 4, we show the results with the cubic augmentation applied to the left and right sides of threshold, 10 When we add the augmentation to the left hand side only, we level-shift the right hand side up or down so that there is a simulated effect of -10 points at the threshold, and vice-versa. 14

respectively. Our method is particularly advantageous when the augmentation is applied to the right side for these estimations, our method produces an average absolute error that is 30% lower than the average absolute error using the IK method. As shown in Figure 4, the principal advantage of our method is the adaptability of the bandwidth and curvature given the available evidence on each side of the threshold. [Insert Table 4] [Insert Figure 4] Having now (hopefully) established the utility of our method, in the next two sections we apply the method to two prominent papers in the RD literature. 3.3 Case Study 2: Method Applied to Data from Lee (2008) Our second case study applies our method to re-estimate findings in Lee (2008) discussed in Section 1. First, we re-examine the result shown in Lee s Figure 2a. Y is an indicator variable that equals 1 if the Democratic Party won the election in that district in year t+1. The key identifying assumption is that there is a modest random component to the final vote share (e.g., rain on Election Day) that cannot be fully controlled by the candidates and that, effectively, "whether the Democrats win in a closely contested election is...determined as if by a flip of a coin" (p. 684). Lee s data comes from U.S. Congressional election returns from 1946 to 1998 (see Lee (2008) for full description of the data). 11 The Lee data present a practical challenge for our method. It contains 4,341 and 5,332 distinct values of X on the left and right sides of the discontinuity. Using every possible number of prior values of X to predict Y at all distinct values of X, while possible, requires substantial computer processing time. To reduce our processing time, we compute the average value of X and Y within 200 bins on each side of the discontinuity, with each bin having a width of 0.5% (since X ranges from -100% to +100% with the discontinuity at 0%). Binning the data as such has the disadvantage of throwing out some information (i.e., the upwards or downwards sloping relationship between X and Y within the bin); yet, for most practical applications this information loss is minor if the bins are kept narrow. 11 We obtained these data on January 2, 2015 from http://economics.mit.edu/faculty/angrist/data1/mhe/lee. 15

To estimate the treatment effect, Lee applies a logit with a 4 th order polynomial in the margin of victory, separately, for the winners and the losers (Lee, 2001, p. 14) using all of the data on both sides of the discontinuity. Given that our binning results in fractional values that lie in the interval from 0% to 100%, we use a generalized linear model using a logit link function as recommended by Papke and Woolridge (1996) for modeling proportions. 12 We find that a specification that is linear and uses less than half of the data points is best for X'β for both the left and the right sides (64 and 28 values on the right and left respectively, with the corresponding bandwidth range for the assignment variable being -32.0% to 13.0%). 13 We estimate that the Democratic Party has a 15.3% chance of winning the next election if they were barely below 50% on the prior election, and a 57.7% chance of winning the next election if they are just to the right of the discontinuity. Figure 5 shows the estimated curves. Our estimate of the treatment effect (i.e., barely winning the prior election) is 42.3% (s.e. = 3.5%) is smaller than Lee s estimate, which is found in Lee (2001): 45.0% (s.e. = 3.1%). [Insert Figure 5] Next, we re-examine the result shown in Lee s Figure 4a, where Y is now defined as the Democratic Party s vote share in year t+1. Lee (2008) used a 4 th order polynomial in X for each side of the discontinuity and concluded that the impact of incumbency on vote share was 0.077 (s.e. = 0.011). That is, being the incumbent raised the expected vote share in the next election by 7.7 percentage points. Applying our method (as shown in Figure 6), we find that the best specification/bandwidth choice uses a quadratic specification based on the last 171 observations on the left hand side and a 5 th order polynomial based on the 188 observations to the right of the discontinuity (with the corresponding bandwidth range for the assignment variable being -94.8% to 93.7%). Our estimated treatment effect is smaller than Lee s and has a smaller standard error: 0.057 (s.e. = 0.003). [Insert Figure 6] Lee s (2008) study was also reexamined by Lee and Lemieux (2010) and Imbens and Kalyanaraman (2011). We noted in Section 1 that according to Lee and Lemieux s analysis, the 12 See also Baum (2008). 13 After binning the data, we end up with 145 distinct values of X on the left side as some bins have no data (i.e., no elections in which the Democratic vote share in year t minus the strongest opponents share fell in that range). 16

optimal bandwidth/specification resulted in a larger estimate of the effect of incumbency (0.100) and a larger standard error (0.029). Scanning across their Table 2, the smallest estimated effect that they found was 0.048. Thus, our estimate is not outside of the range of their estimates. Nonetheless, our estimate is smaller than what would be selected using Lee and Lemieux s twostep method for selecting the optimal bandwidth and then optimal specification for that bandwidth. Imbens and Kalyanaraman s found that the optimal bandwidth for a linear specification on both sides was 0.29 and using this bandwidth/specification produced an estimate of the treatment effect of 0.080 (s.e. = 0.008). Again, their preferred estimate is somewhat larger than the estimate found using our method and with a higher standard error. 14 3.3 Case Study 5: Method Applied to Data from Chen, Ebenstein, Greenstone, & Li (2013) Our final case study is a replication of a prominent paper by Chen et al. (2013) that alarmingly concludes that an arbitrary Chinese policy that greatly increases total suspended particulates (TSPs) air pollution is causing the 500 million residents of Northern China to lose more than 2.5 billion life years of life expectancy (p. 12936). This policy established free coal to aid winter heating of homes north of the Huai River and Qinling Mountain range. Chen et al. used the distance from this boundary as the assignment variable with the treatment discontinuity being the border itself. As shown in the first column of our Figure 7 (which reprints their Figures 2 and 3), Chen et al. estimate that being north of the boundary significantly raises TSP by 248 points and significantly lowers life expectancy by 5.04 years. These estimates are also shown in Panel A of Table 5. [Insert Figure 7] [Insert Table 5] We have attempted to replicate these results. Unfortunately, the primary data are proprietary and not easy to obtain; permission for their use can only be granted by the Chinese 14 Note however, that when we apply the Stata programs written by Fuji, Imbens, and Kalyraman (2009) and Nichols (2011) that produce treatment estimates using the Imbens and Kalyanaraman (2011) method, we find the optimal bandwidth for a linear specification on both sides was 0.11 and using this bandwidth/specification produced an estimate of the treatment effect of 0.059 (s.e. = 0.002), which are quite similar to our estimates. 17

Center for Disease Control. 15 Rather than use the underlying primary data, we are treating the data shown in their Figures 2 and 3 as if it were the actual data. To do so, we have manually measured the X and Y coordinates of each data point in these figures as well as the diameter of each circle (where the circle s area is proportional to the population of localities represented in the bin). 16 The middle column of Figure 7 and Panel B of Table 5 present our replication applying their specification (a global cubic polynomial in latitude with a treatment jump at the discontinuity) to these data. We obtain similar results, although the magnitudes are smaller and less significant; our replication of their specification produces estimates that being north of the boundary raises TSP by 178 points (p-value 0.069) and insignificantly lowers life expectancy by 3.94 years (p-value 0.389). Comparing the first and second columns of Figure 7, note that the shapes of the estimated polynomial specifications are generally similar with the modest discrepancies showing that there is a bit of information lost by binning the data. In Panel C of Table 5, we apply our method to estimate these treatment effects. 17 We find significant effects on TSP, with TSP rising significantly 146 points (using IK s method, TSP is found to rise significantly by 197 points). Thus, Chen et al. s conclusion that TSP rises significantly appears to be reasonable and robust to alternative specifications. However, as shown in the second column in Panel D of Table 5, the estimated treatment impact on Life Expectancy is much smaller; we estimate that being north of the boundary significantly lowers life expectancy by 0.40 years, which is roughly one-tenth the effect size we estimated using their global cubic polynomial specification. The fragility of these results should not be surprising given a visual inspection of the scatterplot, which does not reveal a clear pattern to the naked eye. In fact, for the right hand side of the threshold for Life Expectancy, we find that a simple averaging of the 8 data points to the left of the threshold gives the best prediction at the threshold. We agree with Gelman and Zelizer s (2015) critique that the result 15 Personal communication with Michael Greenstone, March 16, 2015. 16 We have taken two separate measurements for each figure and use the average of these two measurements for the X and Y coordinates and the median of our four measurements of the diameter of each circle. 17 Given that there are a small number of observations of and on each side of the discontinuity, we placed no constraint on the minimum number of observations or the minimum number of MSPEs that are required to be included. We considered polynomials of order 0 to 5. 18

indicates to us that neither the linear nor the cubic nor any other polynomial model is appropriate here. Instead, there are other variables not included in the model which distinguish the circles in the graph (p.4). 4. Conclusion While regression discontinuity design has over a 50 year history for estimating treatment impacts (going back to Thistlewaite and Campbell (1960)), the appropriate method for selecting the specification and bandwidth to implement the estimation has yet to be settled. This paper s contribution is the provision of a method for optimally and simultaneously selecting a bandwidth and polynomial order for both sides of a discontinuity. We identify the combination that minimizes the estimated mean squared predicted error at the threshold of a discontinuity. Our paper builds on Imbens and Kalyanaraman (2012), but is different from their approach which solves for the optimal bandwidth assuming that a linear specification will be used on both sides of the discontinuity. Our insight is that one can use the information on each side of the discontinuity to see what bandwidth/polynomial-order combinations do well in predicting the next data point as one moves closer and closer to the discontinuity. We apply our paper to reexamine several notable papers in the literature. While some of these paper s results are shown to be robust, others are shown to be more fragile, suggesting the importance of using optimal methods for specification and bandwidth selection. 19

References Baum, C.F. (2008). Modeling proportions. Stata Journal 8: 299 303. Chen, Y., Ebenstein, A., Greenstone, M., and Li, H. (2013). Evidence on the impact of sustained exposure to air pollution on life expectancy from China s Huai River policy. Proceedings of the National Academy of Sciences 110, 12936 12941. DiNardo, J., and Lee, D. (2010). Program evaluation and research designs. In Ashenfelter and Card (eds.), Handbook of Labor Economics, Vol. 4. Fuji, D., Imbens, G. and Kalyanaraman, K. (2009). Notes for Matlab and Stata regression discontinuity software. https://www.researchgate.net/publication/228912658_notes_for_matlab_and_stata_regr ession_discontinuity_software. Software downloaded on July 2, 2015 from http://faculty-gsb.stanford.edu/imbens/regressiondiscontinuity.html. Gelman, A., and Imbens, G. (2014). Why high-order polynomials should not be used in regression discontinuity designs. National Bureau of Economic Research, Working Paper 20405, http://www.nber.org/papers/w20405. Gelman, A., and Zelizer, A. (2015). Evidence on the deleterious impact of sustained use of polynomial regression on causal inference. Research & Politics, 2(1), 1-7. Imbens, G., and Kalyanaraman, K. (2012.) Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79, 933 95. Imbens, G., and Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics 142, 615 635. Jacob, R., Zhu, P., Somers, M., and Bloom, H. (2012). A practical guide to regression discontinuity, MDRC, Accessed via http://www.mdrc.org/sites/default/files/regression_discontinuity_full.pdf. Lee, D.S. (2001). The electoral advantage to incumbency and voters valuation of politicians experience: A regression discontinuity analysis of elections to the U.S. National Bureau of Economics Research, Working Paper 8441. Lee, D.S. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics, 142, 675-697. Lee, D. S., and Lemieux, T. (2010). Regression discontinuity designs in economics. Journal of Economic Literature 48, 281-355. Long, M.C., and Rooklyn, J. (2016). Next: A Stata program for regression discontinuity. University of Washington. Nichols, A. (2011). rd 2.0: Revised Stata module for regression discontinuity estimation. http://ideas.repec.org/c/boc/bocode/s456888.html Papke, L.E., and Wooldridge, J. (1996.) Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics 11: 619 632. Thistlewaite, D., and Campbell, D. (1960). Regression-discontinuity analysis: An alternative to the ex post facto experiment". Journal of Educational Psychology 51(6): 309 317. Van Der Klaauw, W. (2008). Regression-discontinuity analysis: A survey of recent developments in economics. Labour 22, 219 245. 20

Figure 1: Predicting the next value after six observed data points Y 0 2 4 6 8 10 12 14 16 18 20 Panel A: Data available to predict next Y Panel B: Predicting Y given X 2 using prior value of X 1 2 3 4 5 6 X Y 0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 X Panel C: Predicting Y given X 3 using prior two values of X Panel D: Predicting Y given X 4 using prior three values of X Y 0 2 4 6 8 10 12 14 16 18 20 Y 0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 X 1 2 3 4 5 6 X Panel E: Predicting Y given X 5 using prior four values of X Panel F: Predicting Y given X = 6 using prior five values of X Y 0 2 4 6 8 10 12 14 16 18 20 Y 0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 X 1 2 3 4 5 6 X 21

Table 1: Computing Mean Squared Prediction Error (MSPE) and Selecting the Optimal Specification and Bandwidth Polynomial order ( 0 ) Number of prior data points ( 0 ) 0 0 0 0 0 1 1 1 1 2 2 2 1 2 3 4 5 2 3 4 5 3 4 5 Panel A: Prediction of Y X Y 1 12 2 15 12.0 3 16 15.0 13.5 18.0 4 13 16.0 15.5 14.3 17.0 18.3 15.0 5 10 13.0 14.5 14.7 14.0 10.0 12.7 15.0 6.0 7.5 6 7 10.0 11.5 13.0 13.5 13.2 7.0 7.0 9.0 11.4 7.0 4.0 3.4 Panel B: Error Squared X 2 9.0 3 1.0 6.3 4.0 4 9.0 6.3 1.8 16.0 28.4 4.0 5 9.0 20.3 21.8 16.0 0.0 7.1 25.0 16.0 6.3 6 9.0 20.3 36.0 42.3 38.4 0.0 0.0 4.0 19.4 0.0 9.0 13.0 Panel C: Predicted Value of MSPE given X = Threshold (i.e., Weighed Average of MSPEs) Base Wgt. = 1 (Uniform) 7.4 13.3 19.9 29.1 38.4 5.0 11.9 14.5 19.4 6.7 7.6 13.0 Base Wgt. = 10^3 8.9 19.4 31.6 37.0 38.4 0.8 2.7 8.2 19.4 3.2 8.4 13.0 Base Wgt. = 10^6 8.998 20.2 35.0 40.7 38.4 0.1 0.5 5.2 19.4 1.0 8.8 13.0 Base Wgt. = 10^10 8.99999 20.25 35.9 42.0 38.4 0.002 0.1 4.2 19.4 0.2 8.97 13.0 Panel D: Upper Bound of 80% Confidence Interval Around Predicted Value of MSPE given X = Threshold Base Wgt. = 1 (Uniform) 9.9 19.9 38.6 69.5 11.2 28.0 46.8 15.7 11.9 Base Wgt. = 10^3 13.2 29.7 57.1 84.1 11.2 26.2 47.6 17.5 13.9 Base Wgt. = 10^6 14.1 32.6 65.5 94.5 12.1 27.6 49.6 16.5 14.8 Base Wgt. = 10^10 14.4 33.4 68.0 98.6 12.4 27.9 49.8 15.9 15.0 22

Table 2: Estimating a Simulated Treatment Effect of -10 with Jacob et al. (2012) Data Simulated Treatment Effect = -10 Threshold = 215 Threshold = 205 Threshold = 225 Base Weight 1 1,000 10^6 10^10 1 1,000 10^6 10^10 1 1,000 10^6 10^10 Left Side of Threshold Optimal Specification Linear Linear Linear Linear Linear Linear Linear Linear Cubic Cubic Cubic Cubic Optimal # Prior Observations 23 23 23 23 19 19 17 6 44 44 44 33 Total # Prior Observations 42 42 42 42 32 32 32 32 52 52 52 52 Right Side of Threshold Optimal Specification Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Quadratic Linear Linear Linear Linear Optimal # Prior Observations 33 33 33 33 47 47 47 47 20 20 20 20 Total # Prior Observations 42 42 42 42 52 52 52 52 32 32 32 32 Our Estimate of Treatment Effect Estimate -9.36-9.36-9.36-9.36-9.96-9.96-10.09-8.60-11.67-11.67-11.67-11.78 s.e. (Estimate) (0.73) (0.73) (0.73) (0.73) (0.93) (0.93) (0.95) (1.38) (0.98) (0.98) (0.98) (1.07) Using Imbens and Kalyanaraman's (2012) Optimal Bandwidth for Linear Specification Bandwidth 6.3 7.3 7.3 Estimate -10.68-8.25-11.74 s.e. (Estimate) (1.27) (1.50) (1.44) 23

Figure 2: Selection of Specification and Bandwidth Using Data from Jacob et al. (2012) With Simulated Treatment Effect of -10 at Various Thresholds Simulated Threshold = 205 Estimated Treatment Effect = -9.39 (s.e. = 0.24) 180 200 220 240 260 280 Simulated Threshold = 215 Estimated Treatment Effect = -9.96 (s.e. = 0.18) 180 200 220 240 260 280 Simulated Threshold = 225 Estimated Treatment Effect = -11.67 (s.e. = 0.14) 180 200 220 240 260 280 24