Predictions Without Models? Using Natural Experiments to Test the Performance of Machine Learning Algorithms

Size: px

Start display at page:

Download "Predictions Without Models? Using Natural Experiments to Test the Performance of Machine Learning Algorithms"

Leo Henderson
5 years ago
Views:

1 Predictions Without Models? Using Natural Experiments to Test the Performance of Machine Learning Algorithms Devesh Raval Federal Trade Commission Ted Rosenbaum Federal Trade Commission January 12, 2017 Preliminary and Incomplete Nathan E. Wilson Federal Trade Commission Abstract In recent years, scholars and practitioners in many areas have become interested in machine learning algorithms. Typically, these emphasize predictive accuracy and are agnostic about what theoretical models suggest should determine behavior. However, in changing environments, predictions based upon historical choice patterns may be difficult or even without a model of behavior. In this paper, we test several standard machine learning models performance in a context where the relevant policy question often concerns how consumers would respond to a significant change in their choice environment. Specifically, we assess how well machine learning algorithms perform in predicting where patients receive hospital-based treatment when their first choice hospital is no longer available. To do this, we exploit natural disasters that abruptly close one or more general acute care hospitals, but leave the surrounding area relatively unaffected. Our results suggest that relative to commonly used econometric techniques, common machine learning algorithms often do perform better at counterfactual prediction of individual patients choices, although their performance degrades for patients with a larger change in choice environment post-disaster. JEL Codes: C18, I11, L1, L41 Keywords: machine learning, hospitals, natural experiment, patient choice, prediction The views expressed in this article are those of the authors. They do not necessarily represent those of the Federal Trade Commission or any of its Commissioners. We are grateful to Jonathan Byars, Gregory Dowd, Aaron Keller, Laura Kmitch, and Peter Nguon for their excellent research assistance. We also thank Dave Schmidt for his comments on this draft. The usual caveat applies. 1

2 1 Introduction The increasing ability of firms to observe and collect large amounts of information about individual consumers and their decisions has led to the rise of what Breiman (2001b) calls the algorithmic modeling culture. In this approach, often labeled machine learning, analysts predict choices using algorithms that are not directly derived from formal models of behavior. 1 For example, a regularization procedure such as LASSO might be used to select the variables most relevant for predicting a purchase from a large set of possible regressors. This model agnostic approach towards prediction stands in sharp contrast to the traditional Cowles Commission approach in which economists use economic theory to identify the relevant variables and functional form. For example, McFadden (1981) shows how a model of utility maximizing consumers leads to the multinomial logit econometric model of consumer choice. Recently, economists have began to use machine learning methods to answer economic questions (Kalouptsidi, 2014; Gilchrist and Sands, forthcoming; Goel et al., 2016a,b; Athey and Imbens, 2016). This rising interest may be traced to the fact that for many empirical questions, all that is required is prediction (Kleinberg et al., 2015). Therefore, the algorithmic approach may be appropriate to answer the question at hand. Moreover, comparisons of machine learning algorithms to canonical econometric models have suggested the former often outperform the latter (Bajari et al., 2015a,b). To date, however, economists applications of algorithmic prediction have focused on settings where the support of out-of-sample outcomes is effectively observed in the historical data. Such settings include classic problems like the response of demand to a tax change or the treatment effect of a job training program. However, there are many economic problems, 1 Broad details on the methodological underpinnings of machine learning models can be found in various texts, including Hastie et al. (2005) and James et al. (2013). Varian (2014) provide a readable introduction for economists. 2

3 where this is not true. In these types of cases, there is no way to train an algorithm to predict an outcome, since there is no data to train on (Nevo and Whinston, 2010). For example, antitrust policymakers cannot rely on historical data to predict the effects of a proposed merger, nor can marketers leverage brands past performance to predict market shares in new product categories. For such problems, it is less obvious whether machine learning methods can usefully be applied. As Marschak (1974) notes, It follows that a theory may appear unnecessary for policy decisions until a certain structural change is expected or intended. It becomes necessary then. In this paper, we demonstrate how algorithmic prediction models can be straightforwardly integrated into the canonical random utility model of rational decision-making in order to address these types of question. We then use a set of natural experiments to compare the relative performance of machine learning algorithms to standard econometric models after a major structural change in the choice environment. The setting for our experiments are local hospital markets, which are shocked by natural disasters that severely damaged or destroyed hospitals but left the majority of the surrounding area undisturbed. These natural disasters exogenously altered consumers choice sets, creating a benchmark against which to assess the performance of different predictive models. As in our prior work focused on the relative performance of different commonly used econometric specifications (Raval et al., 2015a), we use the pre-disaster data to estimate consumers preferences for different hospital characteristics. Then, we predict consumer decisions after the disaster has changed consumers choice sets. By comparing the different models predictions to actual post-disaster effects, we are able to evaluate their performance in both absolute and relative terms and whether any differences are likely to matter for policy decisions. We make no claims to exhaustively consider how all possible machine learning models perform, as the set of different algorithms is already large and continues to rapidly expand. Instead, we focus on a set of approaches considered as highly accurate, already implemented 3

4 within existing software packages, and that can straightforwardly be applied to multinomial choice problems. In particular, we examine decision trees, random forests, gradient boosted trees, and elastic net regularized conditional logit models. Across all of our natural experiments, we find that the gradient boosting tree model does particularly well, and is usually one of the best models at predicting both aggregate shares, aggregate diversion ratios, and individual choices. When we perform an explicit model combination approach to compare the performance of all the models, the gradient boosting model and random forest model together receive about three-quarters of the model weight on average. We do find, however, that the performance of machine learning models does not always dominate parametric models. For example, the model performance of machine learning models becomes worse for patients who were more likely to have gone to the destroyed hospital, and so were more likely to have to change their preferred hospital post-disaster. In addition, parametric logit models perform better at individual prediction for the service area with the largest share of the destroyed hospital. These results indicate that parametric logit models may still have an important role to play in counterfactual prediction when there are large changes in the environment or comparatively little data on which to train the model. We also find that, while the random forest model does very well at predictions of aggregate diversion ratios, it does very badly at predictions of aggregate shares, is often worse than the prediction of a single tree at individual prediction, and overfits the data much more than the other models. This likely reflects the fact that the predictions of the random forest are biased, as it estimates probabilities by averaging the hospital prediction of each tree rather than the probabilities of each tree. This highlights the need to adapt machine learning models to the questions that economists pose. Econometricians have already started on this project; for example, Wager and Athey (2015) and Belloni et al. (2012) examine how to develop unbiased probability estimates of random forests and regularized models, respectively. Overall, our work contributes to the emerging literature in economics on the application 4

5 of machine learning techniques (Varian, 2014; Athey, 2015; Kleinberg et al., 2015). Within this literature, the work most similar to our own is by Bajari et al. (2015a,b), who consider the relative out of sample performance of several machine learning models compared to simple econometric aggregate demand models. They find that many of the machine learning models outperform simple linear and logit demand models. A major difference between their work and ours is that the choice environment faced by consumers changes dramatically in our test, which is precisely the environment for which a model based approach could be more fruitful. The paper proceeds as follows. Section 2 describes our data and experimental settings. In Section 3, we lay out the theoretical framework underpinning the models considered in this paper. In Section 5, we present our results on model performance. Section 6 concludes. 2 Natural Experiments 2.1 Disasters We exploit the unexpected closures of six hospitals in four different markets following a natural disaster. Table I belows describes the disasters. The Americus tornado struck a community hospital in rural Georgia, while the Moore tornado hit a small local hospital in the suburbs of Oklahoma City. Hurricane Sandy flooded portions of New York City, leading three hospitals to close in Manhattan and Brooklyn. These hospitals included NYU Hospital, one of the highest ranked hospitals in the country, and Bellevue Hospital Center, a flagship hospital of the NYC public system. The Northridge earthquake hit Los Angeles, causing the closure of one hospital in Santa Monica. Because there is considerable heterogeneity in the treated groups, we expect any results that appear consistent across our experimental settings to have a high degree of external validity. 5

6 Table I Natural Disasters Location Month/Year Severe Weather Hospital(s) Closed Share Destroyed Northridge, CA Jan-94 Earthquake St. John s Hospital 17.4% Americus, GA Mar-07 Tornado Sumter Regional Hospital 50.4% New York, NY Oct-12 Superstorm Sandy NYU Langone 8.9% Bellevue Hospital Center 10.8% Coney Island Hospital 18.2% Moore, OK May-13 Tornado Moore Medical Center 11.0% For a natural disaster to provide a good natural experiment to assess choice models, it must satisfy several criteria. First, the service area must be large enough and the period post disaster for which the hospital is closed long enough that we have enough power to compare different demand models. Second, the destroyed hospital must have had a large enough market share in the service area for the disaster, because the experiment is informative on model performance only when the choice environment undergoes a substantial change. Finally, the damage from the disaster must be narrow enough that the change in patient decision making is limited to the change in the choice set. As described in detail in Raval et al. (2015a), our set of disasters meet these criteria. For each experiment, our primary data come from the inpatient hospital discharge records collected by state departments of health. Such patient-hospital data have been previously used by researchers (Capps et al., 2003; Ciliberto and Dranove, 2006), and provide a host of characteristics describing the patient receiving care as well as the type of clinical care being provided. The details on the construction of our estimation samples are provided in Appendix B of Raval et al. (2015a). The set of affected patients are those living within the zip-codes making up the destroyed hospitals 90% service areas. We identify the choice set of affected consumers as those hospitals that have a share of above 1% for patients in the 90% service area in at least one month (quarter for the smaller Sumter and Moore experiments). The last column in Table I contains the share of the destroyed hospital in the choice set for each experiment. While all markets were significantly affected, the choice environment 6

7 changed differentially across experiments. Sumter Regional, the hospital hit by a tornado in rural Georgia, had about a 50% share of admissions in its service area. For the other hospitals, the share of the destroyed hospital ranges from 9% for NYU to 18% for Coney Island hospital. We leverage this variation to try to better understand the relative performance of different models. 3 Predicting Choices in a Changing Environment 3.1 The ARUM Framework To predict multinomial choices, economists have typically turned to additive random utility models (ARUM). Such models presume that decision-makers choose from a defined set of options so as to maximize their expected utility. Utility, in turn, is assumed to be a linearly separable combination of a deterministic component based on observable elements and an idiosyncratic shock, i.e., u ij = δ ij + ɛ ij, where u is utility, δ is the deterministic component of utility, and ɛ is the random component, while i and j index decision-maker and choice, respectively. This framework applies straightforwardly to a patient s choice of hospitals. Consider a patient i who becomes ill with condition c. Needing care, the patient chooses the specific hospital h from the set of available hospitals H (h = 1,..., N) based on their expected utility of going and receiving treatment there. The utility patient i with condition c receives from care at hospital h can be represented as: u ihc = δ ihc + ɛ ihc = f(x ic, Y h θ) + ɛ ihc, (1) where X ic are observable characteristics about the patient and their condition, Y h are observable characteristics of the hospital s ability to treat the condition, f( ) is a function of 7

8 X and Y with parameters θ, and ɛ ihc is a random shock affecting the relative likelihood that patient i chooses hospital h. To make predictions, the economist fully specifies f( ) and the distribution of ɛ ex ante. She then uses historical data to estimate the parameters of f( ). To make out-of-sample predictions, the recovered ˆθ are applied to the observable characteristics to generate a predicted value of δ ihc. When combined with knowledge of the distribution of the unobserved information, the estimates pin down the likelihood of observing different choices out-of-sample. This approach holds irrespective of whether the choice environment is changing or not. Predictive models performance will vary depending on the reasonableness of the assumptions the economist made about the f( ), including what elements should be in X and Y, and the distribution of ɛ. 3.2 Integrating Machine Learning into the ARUM Framework Machine learning models have also been applied to multinomial choice problems. The different algorithms recover predictions about the likelihood of different outcomes conditional on the choice set and observables. For the most part, assumptions about the distribution of ɛ are not made and are not required in order to make out-of-sample predictions so long as the choice-set has not changed. However, absent modification, they cannot be used to make predictions when a consumer s choice set differs from those observed in the training data. 2 To circumvent this problem, however, an econometrician may impose a distributional assumption. With this assumption, counterfactual predictions are straightforward, and the machine learning approach converges to the ARUM framework. Whereas the economist is solely responsible for identifying the elements of f( ) in the canonical framework, when using machine learning algorithms, this structure is endogenously recovered from the data. 2 Regularized versions of standard maximum likelihood models are a counterexample to this. This is because they involve distributional assumptions. 8

9 Once it has been recovered using historical data, the econometrician generates estimates of ˆδ for the out-of-sample data using the observable information. These are tranformed into predicted probabilities for the changed environment using the assumed distribution of the error, just as in canonical implementations of the ARUM framework. While it is clear that machine learning can easily be integrated into the ARUM framework, one still must answer the question of what distributional assumption one should make in any particular context. Although ARUM frameworks have used a multiplicity of distributional assumptions on ɛ, the economic literature on hospital choice has almost invariably assumed that ɛ are independent and identically distributed random variables drawn from the type- I extreme value distribution. While this implies that consumers with identical δs will on average exhibit similar preference patterns which are independent of irrelevant alternatives (IIA), the availability of highly detailed micro data are assumed to enable the analyst to efficiently account for heterogeneity in patients choices. 3 Our prior assessment of many commonly applied logit models suggests this assumption is not obviously a bad one. Therefore, in our assessment of the relative performance of different machine learning models, we make the logit assumption. This implies that the probability that patient i with condition c receives care at hospital h takes the familiar form: s ihc = exp(δ ihc ) j H exp(δ ijc). (2) 3 See, e.g., discussion in Ackerberg et al. (2007, p. 4185). 9

10 4 Models Compared 4.1 Econometric Models of Patient Choice In this paper, we compare the predictive performance of machine learning algortithms to three standard econometric models. One is a very rich parametric model (Inter) that performed better than the other parametric logit models in our previous work (Raval et al., 2015a). It includes interactions of hospital indicators with acuity, major diagnostic category, and time as well as many interactions between patient characteristics and travel time. The second is a grouping model in our previous paper, a semiparametric bin estimator (Semipar), similar to that outlined in Raval et al. (2015b), that we found to be the most accurate in most disaster settings. This model assumes one can flexibly account for consumer heterogeneity across choices by constructing small and homogeneous groups based upon a small set of patient characteristics, including zip code, age, disease acuity, and diagnosis category. It then leverages the assumption that IIA holds within groups, and so hospital choice probabilities change proportionally to the observed shares of the group with a change in the choice set. In our implementation of this approach, we allow for group sizes as small as twenty, such that for some groups very few patients are used to predict substitution patterns. As discussed in Carlson et al. (2013) and Raval et al. (2015b), this flexible approach is computationally efficient despite being equivalent to including a fixed effect for each grouphospital interaction in a multinomial logit model. The third econometric model (Indic) assumes there is no patient level heterogeneity. In other words, everyone within the relevant area has, on average, the same preferences for each hospital. As a result, patient choices can be modeled as being proportional to aggregate market shares, and δ can be estimated using only hospital indicators as covariates. In other 10

11 words, this model could be estimated with aggregate data. As in our prior work, we use it as a reference point. 4.2 Machine Learning Models of Patient Choice Decision Tree Models We now examine several machine learning models that are also grouping models, like Semipar, in that they partition patients based on characteristics and estimate the same probabilities for all patients in the same group. The main difference between these models and Semipar is that Semipar defined the set of groups ex-ante; the decision tree models we will look at use information on the choice of patients while grouping and use sophisticated algorithms to create the groups. Estimating a decision tree model requires one to partition patients into groups. While there are many possible approaches, we examine perhaps the most popular approach known as CART due to Breiman et al. (1984). The CART approach acts as a greedy algorithm, splitting the data into two groups at each node based on the split that minimizes the error criterion. It recursively partitions the data by growing the tree through successive splits. The major advantage of the decision tree model is it can allow complex interactions between the variables considered; as mentioned above, the parametric and semiparametric logit models have to pre-specify a set of interactions. This greedy algorithm runs the risk of overfitting the data by creating too many splits. Typically, the tree model is then pruned by removing excessive splits which likely contribute little to the out of sample performance of the tree; the model we estimate does so by limiting the tree to a pre-specified depth and removing splits beyond that depth. For example, we might set the depth to five and remove splits that result in more than 5 levels. While the decision tree is simple to understand and interpret, they are known to not 11

12 provide the best predictive power. As Breiman (2001b) notes, While trees rate an A+ on interpretability, they are good, but not great, predictors. Give them, say, a B on prediction. Models that improve on the decision tree by averaging the predictions of many trees are known to provide much better predictive power. In this paper, we examine two such models; random forests (which Breiman (2001b) assigns an A+ for prediction) and gradient boosted trees. A random forest model, originally due to Breiman (2001a), provides two major improvements over the basic decision tree by injecting two sources of stochasticity into the formation of trees. First, a whole forest of trees are built by estimating different tree models on bootstrap samples of the original dataset. This procedure is known as bagging. Second, the set of variables that are considered for splitting is random for each tree. Random forests are clearly difficult to interpret, as the model is a collection of hundreds or thousands of base decision tree models. However, they are known to perform very well for prediction. Breiman (2001b) cites considerable evidence in the machine learning literature that random forests outperform other machine learning models. In the economics literature, Bajari et al. (2015a) find that random forest models were most accurate in predicting aggregate demand out of sample in their study. The second approach to improve the performance of decision trees is gradient boosting (Freund and Schapire, 1995; Friedman et al., 2000; Friedman, 2001). Gradient boosting estimates the decision tree model repeatedly, with each iteration overweighting observations that were classified incorrectly in the previous iteration. For example, with linear regression, a boosting procedure would overweight observations with large residuals. The final prediction is then a weighted average across all of the different models produced. Boosting can be thought of as an additive expansion in a set of elementary basis functions (in our case, trees). A shrinkage parameter scales how much each new tree adds to the overall prediction, and acts as a form of regularization. Boosting is also an extremely good prediction algorithm, 12

13 and has been called the best off-the-shelf classifier in the world (Hastie et al., 2005) Regularization In contrast to grouping models, we also apply a machine learning framework to some of the grading models that have been used in the literature. In particular, we use elastic net penalized regression framework in order to select the most relevant variables from the set of all variables used in the parametric models described in (Raval et al., 2015a). Elastic net can be viewed as a weighted average of a LASSO and a ridge regression. LASSO regression shrinks coefficients towards zero and is helpful in choosing among many variables to avoid overfitting. Ridge regression helps to choose between potentially highly co-linear variables. Since the set of variables we are considering is both large and are highly correlated, we utilized this penalized framework (Hastie et al., 2005). We use the version of elastic net regularization that was developed for the conditional logit by Reid and Tibshirani (2014). 4 In the case of the conditional logit model, the new objective function is: log L(β) + λ(α K β k + 1 K 2 (1 α) k=1 }{{} LASSO k=1 β 2 k }{{} Ridge Where λ and α are tuning parameters. We use the clogitl1 package in R to estimate this model and cross validate to select the tuning parameter λ. We use a value of α =.95. As outlined in Belloni et al. (2011) and Belloni et al. (2012), the coefficients of a lasso estimator can be biased towards zero. Therefore, we apply the two-step approach outlined in that paper to estimate unbiased coefficients. While that paper suggests this approach for standard LASSO penalized regression, we apply it here in the case of elastic net penalized 4 While the specific framework in that paper is different than ours, the McFadden logit model can be viewed as a special case of the one described in that article. ) 13

14 regression on the grounds that similar concerns likely apply Inputs to Machine Learning Models Estimating any of the machine learning models requires us to set the variables used for estimation and the values for hyperparameters of the algorithm. We use nine variables from the patient characteristics available in the discharge data the patient s zip code, disease acuity (DRG weight), the Major Diagnosis Category (MDC) of the patient s diagnosis, the patient s age, an indicator for medical vs. surgical admission, an indicator for Emergency admission, an indicator for whether the patient was black, and an indicator for whether the patient was female. The first four of these variables were used in the semiparametric bin model Semipar. For the decision tree model we estimate (Tree), we use the R package rpart (Therneau et al., 2010). The main two hyperparameters are the minimum size of any node in the tree and the number of levels of the tree. We set the minimum size of the node to 20, the same value we use for Semipar and all other tree models, and use 5 fold cross validation to set the number of levels of the tree. For the random forest model we estimate (RF ), we also use 5 fold cross validation to set the number of levels of the tree. We use R package randomforest (Liaw and Wiener, 2002), based on the original work of Breiman (2001a) to estimate the model and set the number of trees to 500. For gradient boosting (GBM ), we use R package GBM (Ridgeway, 2006). We have not been able to cross validate the hyperparameters of this algorithm yet; we set the number of trees to 1500, the maximum tree depth to 10, and the shrinkage rate to We implement all of these models using R package caret (Kuhn, 2008). 14

15 5 Prediction We estimate all of the models in Section 3 on data from the period before the disaster, and assess each model s predictive performance on data from the period after the disaster. Each model is thus out of sample along two dimensions. First, it is estimated on an earlier time period, and second, the choice set available to patients has changed with the disaster. The change in the choice set is crucial to see how well each model predicts patients choices after a major change in market structure. 5.1 Relative Performance We compare the relative performance of the models on their predictions of aggregate market shares, aggregate diversion ratios post-disaster, and individual hospital choices for each destroyed hospital s service area Aggregate Shares A simple way to assess performance on aggregate shares is to plot the time series of predictions against observed shares. In Figure 1, we do this for the Sumter disaster for four models Semipar, Inter, GBM, and RF and 6 hospitals. The observed shares are the dotted red line. The grey dot-dash vertical line depicts the quarter of the disaster. With the disaster, Sumter Regional s market share falls from about 50 percent to zero. Both the Semipar, Inter, and GBM models closely track each other, as well as the actual changes in market shares for most of the remaining hospitals. For example, they all get the observed market shares for Flint River approximately correct. RF, on the other hand, has fairly different predictions. Often it performs poorly, with predictions in the pre-period that are quite off from observed shares. But it does correctly predict the post-disaster share for Phoebe Putney, which the other models underpredict, as well as the market share for 15

16 Palmyra Medical at the end of the period. All of the models overpredict the share going to the outside option. Figure 1 Aggregate Market Shares, Predicted and Observed, for Sumter Note: Red dotted line is the observed series of market shares. The grey vertical dot-dash line depicts the quarter of the disaster. To see if these broad patterns are general, we examine the performance of all of the models across all of the destroyed hospitals using the criterion of root mean squared error (RMSE). At the aggregate level, the RMSE is defined as: 1 RMSE = [y j ŷ j ] N 2. J j Here y j is the share of alternative j, ŷ j the model prediction, and N J the total number of alternatives. To look at relative differences across models, we examine the percent improvement in 16

17 (a) Aggregate Share (b) Aggregate Diversion Ratio Figure 2 Relative Improvement in RMSE of Aggregate Predictions Note: Improvement is Percentage Improvement in RMSE for each model over the Indic model. Parametric models are circles, semiparametric models are triangles, and machine learning models diamonds.

18 RMSE for each model over the baseline of the Indic model. The Indic model provides a useful baseline as it is a simple model that only requires data on market shares. We define the percent improvement as: 1 RMSE Model RMSE Indic. Our results are shown in Figure 2a, which depicts the relative improvement in RMSE for each destroyed hospital s service area and model in the period after the disaster. Each row is a different destroyed hospital s service area. The models are distinguished both by color and by shape, with the parametric models as circles, the semiparametric model as a triangle, and the three machine learning tree models as diamonds. While the decision tree model Tree does not perform badly, as expected, it is never one of the best models at prediction. It always predicts worse than Semipar, with Semipar between 0.5 and 10 percentage points better across service areas. Surprisingly, the random forest model performs much worse than Tree. It has much higher RMSE than the other models in all cases except Sumter, performing between 50 to 300 percent worse than Indic. For Sumter, however, it outperforms all of the models. This poor performance is most likely due to the way RF predicts. The random forest model does not average the probabilities for each tree; instead, it constructs probabilities based on the average probability of the class prediction of each tree. This procedure likely underweights probabilities for small hospitals, which may have a small probability of choice but never be any tree s class prediction. Across all of the hospitals, the differences between the parametric model Inter, semiparametric model Semipar, and the gradient boosting model GBM are relatively small. For example, Semipar ranges from 9 percentage points worse than GBM to 12 percentage points better across the hospital service areas. However, GBM is usually the best of these models. Of these models, GBM is the best model in four cases, Semipar is the best model in one case, and RF in one case, although all models underperform our baseline model, Indic, for Bellevue. 18

19 These results illustrate the extent to which models match the levels of consumer choice probabilities before and after a choice was eliminated. However, in many applications, the change in consumers choice probabilities for different options after removing an object from the choice set is the object of interest, such as the diversion ratio referred to in Garmon (2016) and the Horizontal Merger Guidelines. Therefore, we examine the RMSE of the aggregate diversion ratio following the disaster. We define the aggregate diversion ratio for hospital j as: y j,1 y j,0 y dest,0 where y j,1 is the share of hospital j in the period after the disaster, y j,0 the share of hospital j before the disaster, and y dest,0 the share of the destroyed hospital. Assuming that all changes in market shares after the disaster are due to the closure of the destroyed hospital, the diversion ratio tells us the fraction of the destroyed hospital s patients that went to hospital j. For the New York hospitals, the denominator of the diversion ratio includes all destroyed hospitals in the choice set. Figure 2b depicts the relative improvement in RMSE for each model over Indic. We see a very different picture for aggregate diversions. While the Tree model again slightly underperforms Semipar, the random forest model does much better on diversion ratios than they did on shares; its earlier poor performance stems primarily from missing the levels of shares. RF is the best model for aggregate diversions in three cases, and is substantially better than the next best model in all three of those cases. It only performs badly for Bellevue, where it is the worst model, performing about 50% worse than Indic. In the other three cases, GBM is the best model. Thus, the decision tree based machine learning models conclusively beat the parametric and semiparametric logit models for aggregate diversions. 19

20 5.1.2 Individual Predictions Since the shape of demand is determined by individual heterogeneous consumers, predictions on individual choice are key for assessing welfare in differentiated product markets. Figure 3 depicts the percent improvement over Indic for all models and across all of the hospitals for individual choices. We again measure model performance using RMSE, although we found in our previous paper general agreement across alternative performance metrics. 5 The gradient boosting model almost always performs the best, on average, across all of our experimental settings: GBM performs the best for all of the destroyed hospitals except Sumter, for which it is second best after Inter. Across hospitals, GBM is 2 to 8 percentage points better than Semipar, and between 4 to 14 percentage points better than Indic. The other tree models Tree and RF are always clearly worse than GBM ; Semipar is better than RF in five cases and better than Tree in four. Again, the Tree model is better than RF in half of the occasions, indicating that the probability estimation procedure of RF may understate probabilities for hospitals that patients are relatively unlikely to visit. In most situations, researchers will not have access to natural experiments like ours in order to assess models, but could use in-sample model performance to evaluate models. We examine whether in-sample performance can provide a good guide to out of sample performance in Figure 4. For each of the destroyed hospitals, we compare each model s performance for individual predictions in the period before the disaster to after the disaster. The blue line is the linear best-fit line across the models. Unlike our results with parametric and semiparametric logit models, the performance of the model before the disaster is not necessarily a good guide to its performance afterwards; two of the linear relationships are not even upward sloping. The main culprit is RF, as it tends to be in the lower right corner of each figure, performing much better in the pre-period 5 These alternative metrics are Mean Absolute Error, zero-one loss based on whether the patient went to the choice with the highest probability, and relative entropy (a log likelihood based statistic). 20

21 Note: Figure 3 Relative Improvement in RMSE of Individual Predictions Improvement is Percentage Improvement in RMSE for each model over the Indic model. than it does in the post-period. This is consistent with the hypothesis that RF overfits the data. The GBM model, on the other hand, does not appear to be overfitting. However, all of the models do tend to do worse compared to Indic in the post-period than they did in the pre-period. 5.2 Prediction Under a Changing Environment The above results demonstrate that, on average, the flexible machine learning models tend to predict very well after the disaster induced change in the choice set. However, it is possible that this is because we include in our estimates many patients for whom their preferred hospital was unaffected by the disaster, and so the destruction of a non-preferred hospital had no impact on their choices. The greater the number of patients included in our calculations 21

Note: Figure 4 Relative Improvement in RMSE of Individual Predictions, Post-Period vs. Pre-Period Improvement is Percentage Improvement in RMSE for each model over the Indic model.

22 Note: Figure 4 Relative Improvement in RMSE of Individual Predictions, Post-Period vs. Pre-Period Improvement is Percentage Improvement in RMSE for each model over the Indic model. that prefer a non-destroyed hospital, the more our out-of-sample validation resembles more traditional split-the-sample validation. In that environment, the flexibility may reflect a type of overfitting that delivers good predictions in the existing choice environment, but fails at extrapolations out of that environment. We focus on the patients who were more likely to experience the elimination of their preferred hospital following the natural disaster by examining patients whose characteristics place them in bins with a greater share of discharges from the destroyed hospital in the pre-disaster period. For the first approach, we calculate the RMSE for each bin produced by Semipar and examine how bin level performance varies by the bin s share of the destroyed hospital for all of the models; Figure 5 depicts this relationship for Sumter and NYU. The size of each 22

23 (a) Sumter (b) NYU Figure 5 Bin Level RMSE by Destroyed Hospital Share for Semipar model Note: Blue solid line is the loess trend, weighting each bin by its number of patients. Each point is the RMSE for a particular bin, and its size is proportional to the number of patients in each bin. point is proportional to the number of patients in each bin; the blue solid line is the loess trend weighting each bin by its number of patients. For Sumter, the average RMSE increases as the share of the destroyed hospital increases, flattens out, and then increases again. For NYU, the average RMSE about doubles when going from the lowest pre-disaster share to the highest pre-disaster share. This pattern is intuitive when there is a change in the choice environment, the models generally do not predict as well. While we should not expect the models to perform as well when there is a change in the choice environment, some models may perform relatively better than others. Therefore, we examine the relative performance by plotting the loess trend from each model across the bins. Figure 6 depicts these graphs for each model for all of the hospitals. Except for Sumter, where Inter is the best model for very low shares of the destroyed hospital, GBM is always the best model for low shares of the destroyed hospital. For Moore and St. Johns, GBM is always the best model; for the other disasters, the parametric logit model performs better at 23

Figure 6 Bin Level RMSE by Destroyed Hospital Share for All Models Note: Each line is the loess trend for a different model, weighting each bin by its number of patients in the pre-period.

24 Figure 6 Bin Level RMSE by Destroyed Hospital Share for All Models Note: Each line is the loess trend for a different model, weighting each bin by its number of patients in the pre-period. high shares of the destroyed hospital, although the share cutoff before which Inter is better is typically above 40 percent. This may explain why Inter outperforms GBM at Sumter; Sumter is the only disaster where the overall pre-disaster share of the destroyed hospital is about 50%. Except for Moore, Inter always outperforms RF and Tree. 5.3 Model Combination So far, we have examined the performance of each model separately. However, one major finding of machine learning is that ensembles of models can perform better than one individual model (Van der Laan et al., 2007). In this study, both GBM and RF are already combinations of hundreds of base learners and perform very well. These findings suggest that combining the predictions from multiple models may lead to better predictions of behavior 24

25 than using a single preferred model. While there are several ways to combine models, we apply a simple regression based approach that has been developed in the literature on optimal model combination for optimally combining macroeconomics forecasts (Timmermann (2006)). To apply the method to our context, we treat each patient as an observation, and regress the predictions from all the models on observed patient behavior. We constrain the coefficients on the models predictions to be non-negative and to sum to one. Thus, each coefficient in the regression can be interpreted as a model weight, and many models will be given zero weight. We perform this analysis separately for each disaster, which enables us to see the variation in our findings across the different settings. The regression framework implicitly deals with the correlations in predictions across models. If two models are very highly correlated but one is a better predictor than the other, only the better of the two models might receive some weight in the optimal model combination. Formally, we regress each patient s choice of hospital on the predicted probabilities from all of the models in the period after the disaster without including a constant, as below: y ih = β Semipar ŷ Semipar ih β RF ŷ RF ih + ɛ where y ih is the observed choice for patient i and hospital h and ŷ Semipar ih is the predicted probability for patient i and hospital h for Semipar. We include all of the parametric logit models tested in Raval et al. (2015a) as well as the machine learning decision tree models tested in this paper. Table II displays the model weights from these regressions for all models with positive weight for some experiment. We highlight three major findings. First, there is no one preferred model. Within a given disaster, there is no single model that receives all of the weight; the largest weight any model receives is 61%. On average, two of the machine 25

26 Table II Model Weights for Optimal Model Combination Model Sumter Moore NYU Coney Bellevue StJohn s Average All Parametric Logit Models RF GBM Tree Semipar Note: The second through seventh columns provide the model weights for the optimal model combination for each experiment s service area in the period after the disaster. The last column provides the average weight for each model across the different experiments. learning models receive about three-quarters of the weight, with GBM received almost half, at 48%, and RF receiving 23% of the weight. All of the parametric logit models combined received about one-fourth of the weight. Thus, there appears to be a role for grading models like parametric logits as well as grouping models like GBM and RF in optimal prediction. Second, the Semipar model appears to be dominated by the decision tree models; it receives zero weight on average, and its highest share is 2% for one service area. Since Semipar is a grouping model like the decision tree models, it appears not to provide extra information for prediction after including the machine learning models. Third, the one disaster where the parametric logit models receive a majority of the weight (54%) is Sumter. For Sumter, the destroyed hospital had about half of the market share in the service area. For groups in which the destroyed hospital has a large share, choice probabilities will be based upon data from only a few individuals and so will have a high variance. Grouping models like RF and GBM may not perform as well as grading models that are more global and so use data on all patients in the market. Thus, machine learning models may underperform traditional parametric logit models in cases where the change in the choice environment is very large. 26

27 6 Conclusion TBD 27

28 References Ackerberg, Daniel, C. Lanier Benkard, Steven Berry, and Ariel Pakes, Econometric Tools for Analyzing Market Outcomes, Handbook of Econometrics, 2007, 6, Athey, Susan, Machine Learning and Causal Inference for Policy Evaluation, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM 2015, pp and Guido Imbens, The State of Applied Econometrics-Causality and Policy Evaluation, arxiv preprint arxiv: , Bajari, Patrick, Denis Nekipelov, Stephen P Ryan, and Miaoyu Yang, Demand Estimation with Machine Learning and Model Combination, Technical Report, National Bureau of Economic Research 2015.,,, and, Machine Learning Methods for Demand Estimation, The American Economic Review, 2015, 105 (5), Belloni, Alexandre, Daniel Chen, Victor Chernozhukov, and Christian Hansen, Sparse Models and Methods for Optimal Instruments with an application to Eminent Domain, Econometrica, 2012, 80 (6), , Victor Chernozhukov, and Christian Hansen, LASSO methods for Gaussian Instrumental Variables Models, MIT Department of Economics Working Paper, Breiman, Leo, Random Forests, Machine Learning, 2001, 45 (1), 5 32., Statistical Modeling: The Two Cultures, Statistical Science, 2001, 16 (3), , Jerome Friedman, Charles J. Stone, and R.A. Olshen, Classification and Regression Trees, Chapman and Hall, Capps, Cory, David Dranove, and Mark Satterthwaite, Competition and Market Power in Option Demand Markets, RAND Journal of Economics, 2003, 34 (4), Carlson, Julie A., Leemore S. Dafny, Beth A. Freeborn, Pauline M. Ippolito, and Brett W. Wendling, Economics at the FTC: Physician Acquisitions, Standard Essential Patents, and Accuracy of Credit Reporting, Review of Industrial Organization, 2013, 43 (4), Ciliberto, Federico and David Dranove, The Effect of Physician Hospital Affiliations on Hospital Prices in California, Journal of Health Economics, 2006, 25 (1), der Laan, Mark J. Van, Eric C. Polley, and Alan E. Hubbard, Super Learner, Statistical Applications in Genetics and Molecular Biology, 2007, 6 (1). Freund, Yoav and Robert E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, in European Conference on Computational Learning Theory Springer 1995, pp

29 Friedman, Jerome H., Greedy Function Approximation: A Gradient Boosting Machine, Annals of Statistics, 2001, pp Friedman, Jerome, Trevor Hastie, and Robert Tibshirani, Additive Logistic Regression: a Statistical View of Boosting, The Annals of Statistics, 2000, 28 (2), Garmon, Christopher, The Accuracy of Hospital Merger Screening Methods, mimeo, Gilchrist, Duncan Sheppard and Emily Glassberg Sands, Something to Talk About: Social Spillovers in Movie Consumption, Journal of Political Economy, forthcoming. Goel, Sharad, Justin M. Rao, and Ravi Shroff, Personalized Risk Assessments in the Criminal Justice System, American Economic Review, May 2016, 106 (5), , Justin M Rao, and Ravi Shroff, Precinct or Prejudice? Understanding Racial Disparities in New York City s Stop-and-Frisk Policy, Annals of Applied Statistics, 2016, 10 (1), Hastie, Trevor, Robert Tibshirani, Jerome Friedman, and James Franklin, The Elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, 2005, 27 (2), James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning, Vol. 112, Springer, Kalouptsidi, Myrto, Time to build and fluctuations in bulk shipping, The American Economic Review, 2014, 104 (2), Kleinberg, Jon, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer, Prediction Policy Problems, The American Economic Review, 2015, 105 (5), Kuhn, Max, Caret package, Journal of Statistical Software, 2008, 28 (5). Liaw, Andy and Matthew Wiener, Classification and Regression by randomforest, R News, 2002, 2 (3), Marschak, Jacob, Economic Measurements for Policy and Prediction, in Economic Information, Decision, and Prediction, Springer, 1974, pp McFadden, Daniel, Econometric Models of Probabilistic Choice, in Daniel McFadden and Charles F. Manski, eds., Structural Analysis of Discrete Data and Econometric Applications, Cambridge: The MIT Press, Nevo, Aviv and Michael D. Whinston, Taking the Dogma out of Econometrics: Structural Modeling and Credible Inference, The Journal of Economic Perspectives, 2010, pp Raval, Devesh, Ted Rosenbaum, and Nathan E. Wilson, Industrial Reorganization: Learning about Patient Substitution Patterns from Natural Experiments, mimeo, 2015.,, and Steven A. Tenn, A Semiparametric Discrete Choice Model: An Application to Hospital Mergers, mimeo, Reid, Stephen and Rob Tibshirani, Regularization Paths for Conditional Logistic Regression: The clogitl1 Package, Journal of Statistical Software, 2014, 58 (12). 29

30 Ridgeway, Greg, gbm: Generalized Boosted Regression Models, R package version, 2006, 1 (3). Therneau, Terry M, Beth Atkinson, and Brian Ripley, rpart: Recursive Partitioning, R package version, 2010, 3, Timmermann, Allan, Forecast Combinations, Handbook of Economic Forecasting, 2006, 1, Varian, Hal R., Big Data: New Tricks for Econometrics, The Journal of Economic Perspectives, 2014, 28 (2), Wager, Stefan and Susan Athey, Estimation and Inference of Heterogeneous Treatment Effects using Random Forests, arxiv preprint arxiv: ,

31 A Disaster Timelines In this section, we give a brief narrative descriptions of the destruction in the areas surrounding the destroyed hospitals. A.1 St. John s (Northridge Earthquake) On January 17th, 1994, an earthquake rated 6.7 on the Richter scale hit the Los Angeles Metropolitan area 32 km northwest of Los Angeles. This earthquake killed 61 people, injured 9,000, and seriously damaged 30,000 homes. According to the USGS, the neighborhoods worst affected by the earthquake were the San Fernando Valley, Northridge and Sherman Oaks, while the neighborhoods of Fillmore, Glendale, Santa Clarita, Santa Monica, Simi Valley and western and central Los Angeles also suffered significant damage. 6 Over 1,600 housing units in Santa Monica alone were damaged with a total cost of $70 million. 7 The earthquake damaged a number of major highways of the area; in our service area, the most important was the I-10 (Santa Monica Freeway) that passed through Santa Monica. It reopened on April 11, By the same time, many of those with damaged houses had found new housing. 9 Santa Monica Hospital, located close to St. John s, remained open but at a reduced capacity of 178 beds compared to 298 beds before the disaster. In July 1995, Santa Monica Hospital merged with UCLA Medical Center. 10 St. John s hospital reopened for inpatient services on October 3, 1994, although with only about half of the employees and inpatient beds and without its North Wing (which was razed). 11 A.2 Sumter (Americus Tornado) On March 1, 2007, a tornado went through the center of the town of Americus, GA, damaging 993 houses and 217 businesses. The tornado also completely destroyed Sumter Regional Hospital. An inspection of the damage map in the text and GIS maps of destroyed structures suggests that the damage was relatively localized the northwest part of the city was not damaged, and very few people in the service area outside of the town of Americus were affected. 12 Despite the tornado, employment remains roughly constant in the Americus Micropolitan Statistical Area after the disaster, at 15,628 in February 2007 before the disaster and 15,551 in February 2008 one year 6 See 7 See 8 See 9 See html?pagewanted=all. 10 See 11 See 12 See for the GIS map. 31

Figure 7 Damage Map in Los Angeles, CA Note: Darker green areas indicate the earthquake intensity measured by the Modified Mercalli Intensity (MMI); an MMI value of 7 reflects non-structural damage

32 Figure 7 Damage Map in Los Angeles, CA Note: Darker green areas indicate the earthquake intensity measured by the Modified Mercalli Intensity (MMI); an MMI value of 7 reflects non-structural damage and a value of 8 moderate structural damage. The areas that experienced the quake with greater intensity were shaded in a darker color, with the MMI in the area ranging from Any areas with an MMI of below 7 were not colored. The zip codes included in the service area are outlined in pink. Sources: USGS Shakemap, OSHPD Discharge Data later. 13 While Sumter Regional slowly re-introduced some services such as urgent care, they did not reopen for inpatient admissions until April 1, 2008 in a temporary facility with 76 beds and 71,000 sq ft of space. Sumter Regional subsequently merged with Phoebe Putney Hospital in October 2008, with the full merge completed on July 1, On December 2011, a new facility was built with 76 beds and 183,000 square feet of space. 14 A.3 NYU, Bellevue, and Coney Island (Superstorm Sandy) Superstorm Sandy hit the New York Metropolitan area on October 28th - 29th, The storm caused severe localized damage and flooding, shutdown the New York City Subway system, and caused many people in the area to lose electrical power. By November 5th, normal service had been restored on the subways (with minor exceptions). 15 Major bridges reopen on October 30th 13 See 212BF9673EB816FE50F D1695.tc_instance6. 14 See and 15 See 32

33 and NYC schools reopen on November 5th. 16 By November 5th, power is restored to 70 percent of New Yorkers, and to all New Yorkers by November 15th. FEMA damage inspection data reveals that most of the damage from Sandy occured in areas adjacent to water. 17 Manhattan is relatively unaffected, with even areas next to the water suffering little damage. In the Coney Island area, the island tip suffers more damage, but even here, most block groups suffer less than 50 percent damage. Areas on the Long Island Sound farther east of Coney Island, such as Long Beach, are much more affected. NYU Langone Medical Center suffered about $1 billion in damage due to Sandy, with its main generators flooded. While some outpatient services reopened in early November, it only partially reopened inpatient services on December 27, 2012, including some surgical services and medical and surgical intensive care. The maternity unit and pediatrics reopened on January 14th, While NYU Langone opened an urgent care center on January 17, 2013, a true emergency room did not open until April 24, 2014, more than a year later. 19 Bellevue Hospital Center reopened limited outpatient services on November 19th, However, Bellevue dis not fully reopen inpatient services until February 7th, Coney Island Hospital opened an urgent care center by December 3, 2012, but patients were not admitted inpatient. It had reopened ambulance service and most of its inpatient beds by February 20th, 2013, although at that time trauma care and labor and delivery remained closed. The labor and delivery unit did not reopen until June 13th, A.4 Moore (Moore Tornado) A tornado went through the Oklahoma City suburb of Moore on May 20, The tornado destroyed two schools and more than 1,000 buildings (damaging more than 1,200 more) in the area of Moore and killed 24 people. Interstate 35 was briefly closed for a few hours due to the storm. 23 Maps of the tornado s path demonstrate that while some areas were severely damaged, nearby areas 16 See 17 See the damage map at 18 See 19 See and 20 See 21 See Closure html. 22 See intake/, and 23 See 33

Sources: FEMA, NY Discharge Data Figure 9 Damage Map in Coney Island, NY Note: Green shading

34 Figure 8 Damage Map in Manhattan, NY Note: Green shading indicate flood-affected areas. The zip codes included in the service area for Bellevue are outlined in pink. Sources: FEMA, NY Discharge Data Figure 9 Damage Map in Coney Island, NY Note: Green shading indicate flood-affected areas. The zip codes included in the service area are outlined in pink. Sources: FEMA, NY Discharge Data 34

Figure 10 Damage Map in Moore, OK Note: The green area indicates the damage path of the tornado. The zip codes included in the service area are outlined in pink.

35 Figure 10 Damage Map in Moore, OK Note: The green area indicates the damage path of the tornado. The zip codes included in the service area are outlined in pink. Sources: NOAA, OK Discharge Data were relatively unaffected. 24 Emergency services, but not inpatient admissions, temporarily reopened at Moore Medical Center on December 2, Groundbreaking for a new hospital took place on May 20, 2014 with a tentative opening of fall B Dataset Construction For each dataset, we drop newborns, transfers, and court-ordered admissions. Newborns do not decide which hospital to be born in (admissions of their mothers, who do, are included in the dataset); similarly, government officials or physicians, and not patients, may decide hospitals for court-ordered admissions and transfers. We drop diseases of the eye, psychological diseases, and rehabilitation based on Major Diagnostic Category (MDC) codes, as patients with these diseases may have other options for treatment beyond general hospitals. We also drop patients whose MDC code is uncategorized (0), and neo-natal patients above age one. We also exclude patients who are missing gender or an indicator for whether the admission is for a Medical Diagnosis Related Group (DRG). We also remove patients not going to General Acute Care hospitals. For each disaster, we estimate models on the pre-period prior to the disaster and then validate them on the period after the disaster. We omit the month of the disaster from either period, 24 See and interactive/2013/05/20/us/oklahoma-tornado-map.html for maps of the tornado s path. 25 See and com/2013/11/20/moore-medical-center-destroyed-in-tornado-to-reopen-in-december/. 35

How Do Machine Learning Algorithms Perform in. Changing Environments? Evidence from Disaster. Induced Hospital Closures

How Do Machine Learning Algorithms Perform in Changing Environments? Evidence from Disaster Induced Hospital Closures Devesh Raval Ted Rosenbaum Nathan E. Wilson Federal Trade Commission draval@ftc.gov