How Do Machine Learning Algorithms Perform in. Changing Environments? Evidence from Disaster. Induced Hospital Closures

Size: px

Start display at page:

Download "How Do Machine Learning Algorithms Perform in. Changing Environments? Evidence from Disaster. Induced Hospital Closures"

Brice Perry
5 years ago
Views:

1 How Do Machine Learning Algorithms Perform in Changing Environments? Evidence from Disaster Induced Hospital Closures Devesh Raval Ted Rosenbaum Nathan E. Wilson Federal Trade Commission Federal Trade Commission Federal Trade Commission August 24, 2018 Abstract Researchers have found that machine-learning methods can be better at prediction than econometric models. However, this finding is from contexts where the choice environment is stable. We study the relative performance of machine learning algorithms in environments where the choice environment changes substantially, examining patients decisions when natural disasters closed previously available hospitals. While machine learning algorithms outperform traditional econometric models in prediction even following changes in the choice set, the gain they provide shrinks for large changes in the choice set. We further show that traditional methods that impose greater structure on patients utility functions can provide important additional information when there are major changes in the choice environment. JEL Codes: C18, I11, L1, L41 Keywords: machine learning, hospitals, natural experiment, patient choice, prediction The views expressed in this article are those of the authors. They do not necessarily represent those of the Federal Trade Commission or any of its Commissioners. We are grateful to Jonathan Byars, Gregory Dowd, Aaron Keller, Laura Kmitch, and Peter Nguon for their excellent research assistance. We also thank Chris Garmon, Marty Gaynor, Dan Hosken, Dave Schmidt, and participants at the 2017 IIOC, 2018 ASHEcon Meetings, and 2018 Bank of Canada Data Science for their comments. The usual caveat applies. 1

2 1 Introduction The proliferation of rich consumer-level datasets has led to the rise of what Breiman (2001b) calls the algorithmic modeling culture, wherein analysts treat the statistical model as a black box and predict choices using algorithms trained on existing datasets. This theory agnostic approach to prediction stands in sharp contrast to the traditional economic method of using theory to derive a statistical model. Nevertheless, economists have also begun to use algorithmic methods, in part because comparisons of machine learning algorithms to canonical econometric models have suggested the former often outperform the latter for prediction. 1 Existing evaluations of algorithmic prediction have focused on settings where individuals face the same choices in the training and test data sets. However, evaluating policy questions often involves modeling a substantial shift in the choice environment. For example, a health insurance reform may change the set of insurance products that consumers can buy, a merger may alter the products available in the marketplace, and an exit of one brand may alter the sales performance of a multi-product firm s remaining offerings. For such questions, it is less obvious whether machine learning methods can usefully be applied. As Athey (2017) remarks: [M]uch less attention has been paid to the limitations of pure prediction methods. When SML [supervised machine learning] applications are used off the shelf without understanding the underlying assumptions or ensuring that conditions 1 See Hastie et al. (2005) and James et al. (2013) for an introduction to machine learning; Varian (2014) provides a readable introduction for economists. Kalouptsidi (2014); Gilchrist and Sands (forthcoming); Goel et al. (2016a,b); Athey and Imbens (2016) all provide recent applications of machine learning to economic questions, while Kleinberg et al. (2015); Bajari et al. (2015a,b); Rose (2016) compare machine learning models to econometric models. 2

3 like stability [of the environment] are met, then the validity and usefulness of the conclusions can be compromised. The issue is that machine learning models do not aim to capture economic primitives. Therefore, it is uncertain how well they can be used to predict behavior in a changed environment where only individuals primitives would be constant. In this paper, we evaluate the performance of various prediction models after major changes in the choice environment for hospital markets. Specifically, we exploit a set of natural disasters that closed one or more hospitals but left the majority of the surrounding area relatively undisturbed. 2 These shocks exogenously altered consumers choice sets, creating a benchmark patients actual choices in the post-disaster period against which to assess the performance of different predictive models calibrated on pre-disaster data. We implement our comparison by first estimating consumer choice models on pre-disaster data, and then predict consumer decisions after the disaster has changed consumers choice sets. By comparing the different models predictions to actual post-disaster choices, we are able to compare the out-of-sample predictive performance of traditional econometric models to various machine learning algorithms when the choice environment has changed. Our main prediction criterion is the fraction of actual choices that we correctly predict using the highest estimated probability as the predicted choice. 3 We evaluate two types of machine learning algorithms. First, we examine a set of models that partition the space of patients into types, and estimate choice probabilities separately for each type, that build upon Raval et al. (2017b). We examine both an individual decision tree, as well as two methods, random forests and gradient boosted trees, that are well known to 2 We examine the performance of non-machine learning models for these disasters in Raval et al. (2017a). 3 We find similar results using root mean squared error as a criterion. 3

4 improve prediction performance over a single tree. Second, we use a regularized multinomial logit model to select the variables most relevant for predicting choice from the large set of possible interactions of a set of predictor variables. We compare all of the machine learning methods to a conditional logit model similar to those previously used in the analysis of local hospital markets (Ho, 2006; Gowrisankaran et al., 2015). We find that the gradient boosting and random forest models calibrated on pre-disaster data generally outperform all other models at predicting patient choice after a disaster has closed a hospital. The random forest and gradient boosting model are both about 21% better than the benchmark conditional logit model on the percent of patient choices that are correctly predicted, for an overall improvement of seven percentage points. A regularized multinomial logit model performs almost as well, improving on the baseline model by 19%. Either the random forest or gradient boosting model is the best model for all of the experiments, and they are the best two models for four of the six experiments. However, we do find a large difference between these two models on computational time. For our largest dataset, the random forest takes minutes to run while the gradient boosting model takes several hours. The next best model, the regularized logit, performs much worse, taking almost a week for our largest dataset. While we consistently find that the machine learning models perform best at prediction, their relative performance deteriorates for patients who were more likely to have had a major change in their choice set. This result consistently emerged across a variety of analyses. First, we identify sets of patients who were especially likely to have gone to the destroyed hospital, and who therefore experienced a larger shock to their environment than other patients. The random forest and gradient boosting models are 12% better than the benchmark conditional 4

5 logit for patients whose predicted probability of going to the destroyed hospital was above 30%, compared to 24% better for patients whose predicted probability was less than 10%. For the disaster where the destroyed hospital had a 50% share of the market, all of the machine learning models perform worse than our benchmark for patients who were predicted to have a 50% or greater probability of going to the destroyed hospital. Second, we examine patients that previously went to the destroyed hospital; for these patients, both the random forest and gradient boosting models perform less well compared to our baseline econometric model relative to their performance on the full sample. These results indicate that models that rely on the data to speak will see their performance deteriorate relative to those that impose structure on the choice process when there are few data points to which to listen. We formalize this insight through a model combination approach that estimates weights on the probabilities from different models in order to show how the relative importance of models changes with larger changes in environment. Using a validation sample for which patients see no change in choice set, the benchmark conditional logit model receives a weight of 2% on average. On the test sample, where we look at patients following the disaster, it receives a weight of 9%, while for the test sample patients whose predicted share of the destroyed hospital was above 30%, it receives a weight of 18%. The conditional logit model has almost half the weight using test datasets for the disaster with the largest post-disaster change in choice set. These results indicate that traditional parametric logit models may still have an important role to play in counterfactual prediction when there are large changes in the environment. In most situations, an analyst will not have an experiment to use to evaluate model performance. Instead, they will only be able to gauge accuracy by holding out a portion 5

6 of their data, and testing how well different models calibrated on the remainder of the data do in predicting outcomes in the hold out sample. We explore if success in this traditional testing framework is a good proxy for predictive accuracy in changing environments. We find that predictive accuracy in a validation sample formed by holding out 20% of the training data provides a good guide for which models do best at predicting choices after disasters. Just like the test sample, model performance on the validation sample indicates that the gradient boosting and random forest models are the two best models. Overall, our work contributes to the emerging literature in economics and quantitative social science on the application of machine learning techniques (Varian, 2014; Athey, 2015; Kleinberg et al., 2015). Within this literature, Bajari et al. (2015a,b) and Rose (2016) are the papers most similar to our own. They consider the relative out-of-sample performance of several machine learning models compared to econometric models of aggregate demand and health care expenditures, respectively. Both find that machine learning models often outperform traditional econometric models. A major difference between their work and ours is that the choice environment changes dramatically in our test, which is precisely where an empirical approach grounded based on a formal model of human behavior might be necessary. The paper proceeds as follows. Section 2 discusses our data and experimental settings. Then, in Section 3, we describe the different models we test. Section 4 presents the results on model performance, Section 5 examines the computational time required for the machine learning algorithms, and Section 6 examines how model performance deteriorates for patients experiencing a greater change in environment. Section 7 examines the robustness of our findings. Finally, we conclude in Section 8. 6

7 2 Natural Experiments 2.1 Disasters We exploit the unexpected closures of six hospitals in four different regions following a natural disaster. Table 1 below lists the locations of the disasters, when they took place, the nature of the event, and the hospital(s) affected. Our sample includes various types of disasters affecting urban as well as rural markets, and elite academic medical centers as well as community health centers. Because of this considerable heterogeneity in the treated groups, we expect any results that appear consistent across our experimental settings to have a high degree of external validity. Table 1: Natural Disasters Location Month/Year Severe Weather Hospital(s) Closed Northridge, CA Jan-94 Earthquake St. John s Hospital Americus, GA Mar-07 Tornado Sumter Regional Hospital New York, NY Oct-12 Superstorm Sandy NYU Langone Bellevue Hospital Center Coney Island Hospital Moore, OK May-13 Tornado Moore Medical Center For a natural disaster to provide a good natural experiment to assess choice models, it must satisfy several criteria. First, the data on hospital choices both pre- and post-disaster must be rich enough that we can estimate and meaningfully compare different demand models. Second, the choice environment must in fact undergo a substantial change postdisaster. Finally, the damage from the disaster must be narrow enough that individuals choice sets are the only thing that change. Below, we describe the data we have on hospital markets affected by these disasters, paying special attention to why they satisfy our criteria. 7

8 2.2 Service Areas and Choice Sets Our primary sources of data are the inpatient hospital discharge data collected by state departments of health. 4 Such patient-hospital data have frequently been used by researchers (Capps et al., 2003; Ciliberto and Dranove, 2006; Garmon, 2017). They include a host of characteristics describing the patient receiving care as well as the type of clinical care being provided. 5 To assess the performance of different predictive methods when consumers choice sets change, we first identify the population exposed to the loss of a choice. We do this by constructing the 90% service area for each destroyed hospital using the discharge data. We define a hospital s 90% service area as the smallest set of zipcodes that accounts for at least 90% of the hospital s admissions. To address concerns that this set may include many zip codes where the hospital is competitively insignificant, we exclude any zip code where the hospital s share in the pre-disaster period is below 4%. We assume that any individual that lived in this service area and required care at a general acute care hospital would have considered the destroyed hospital as a possible choice. We define the set of relevant substitute hospitals by including all hospitals in the choice set that have a share of above 1% for the patients in the 90% service area as defined above in a given month (quarter for the smaller Sumter and Moore). Hospitals not meeting this threshold are combined into an outside option. 4 For the most part, we rely on data provided directly from the relevant state agencies. For New York, we use both data provided by the state agency and discharge data separately obtained from HCUP. The HCUP data have the desirable characteristic of including a patient identifier that allows us to observe whether a patient had visited the destroyed hospital in the recent past, which we exploit as one of our extensions discussed below. 5 Precise details on the construction of our estimation samples are provided in Appendix B. 8

9 We test the different demand models on admissions taking place after the disaster, but we exclude the period immediately surrounding the disaster. 6 We drop observations during this period in order to avoid including injuries from the disaster and to ensure that the choice environment as much as possible resembles the pre-period. For example, we wish to avoid the possibility of changes in traffic patterns due to residual damage, which might affect our analysis. Table 2 displays some characteristics of each destroyed hospital s market environment, including the number of admissions before and after the disaster, the share of the service area that went to the destroyed hospital before the disaster, the number of zipcodes in the service area, and the number of rival hospitals. We refer to the data from the time period before the disaster as the training data and the data from the time period after the disaster as the test data. We also indicate the average acuity of patients choosing the destroyed hospital during the pre-disaster period, measured by average MS-DRG weight. 7 The service area for Sumter Regional experienced a massive change post-disaster, as the share of the destroyed hospital was over 50 percent. For the other disasters, the disruption was smaller though still significant as the share of the destroyed hospital ranges from 9 to 19 percent. Thus, the destroyed hospitals consistently have a large enough share in each service area that patients choice environments are likely to have changed substantially. We have a substantial number of patient admissions before and after each disaster with which to parameterize and test the different models. The specific number of admissions during either the pre or post period ranges from the thousands for Moore and Sumter 6 In general, this is the the month of the disaster. We describe the periods dropped for each disaster in Appendix B. 7 DRG weights are designed to measure the complexity of patients treatments, so reporting average weights are a way of measuring variation in treatment complexity between hospitals and regions. 9

10 to tens of thousands for the New York hospitals and St. John s. 8 Thus, we have enough admissions in the pre- and post-periods for all of the disasters to estimate and reasonably test different discrete choice models. Table 2: Descriptive Statistics of Affected Hospital Service Areas Pre-Period Post-Period Zip Choice Set Destroyed Destroyed Admissions Admissions Codes Size Share Acuity Sumter 6,940 5, % 1.02 NYU 79,950 16, % 1.41 Coney 46,588 9, % 1.16 Bellevue 46,260 9, % 1.19 Moore 3,184 10, % 0.91 St. Johns 97,030 18, % 1.30 Note: The first column indicates the number of admissions before the disaster, the second column the number of admissions in the post-period, the third column the number of zip codes included in the data set, the fourth column the number of choices (including the outside option), the fifth column the share of admissions in the pre-period from the 90% service area that went to the destroyed hospital, and the sixth column the average DRG weight of destroyed hospital admissions in the pre-period. 2.3 Disaster Damage An important question for the validity of our research design is whether or not the disaster meaningfully disrupted the wider area as well as causing the closure of particular facilities. In such a circumstance, one might worry that predictions based on the pre-period would not be meaningful following the disaster. Therefore, we have carefully assessed whether or not the disasters effects could reasonably be characterized as narrow. Below, we present graphical evidence of the comparatively modest scope of damage in Sumter, Moore, New York (NYU, Bellevue, and Coney Island), and Los Angeles in Figure 1 - Figure 4. In each figure, zip codes in the service area are outlined. 8 The New York service areas do overlap. The service area for NYU is much larger than Bellevue, so most of the zip codes for Bellevue are also in the service area for NYU, but the reverse is not true. NYU has about a 3 percent share in the Coney service area. 10

11 In all four circumstances, the figures suggest that the extent of the damage was limited compared to the size of the affected hospitals service areas. For example, Figure 1 shows the path of the storm that destroyed Sumter Regional Hospital as a green line. Its path was very narrow, cutting through Americus city without affecting the areas surrounding Americus. The Moore tornado had a similar effect for the city of Moore relative to its neighboring suburbs. The flooding from Hurricane Sandy the damage from which is depicted in green shading primarily affected areas adjacent to water. We show the damage surrounding NYU, Bellevue, and Coney Island and their service areas in Figure 3. All of these figures illustrate that the actual damage in Manhattan from Sandy most of which classified by FEMA as Minor damage was concentrated in a relatively small part of the Manhattan hospitals service areas. In Coney Island most of the flooding affected the three zip codes at the bottom of the service area that are directly adjacent to Long Island Sound. Figure 4 shows that the damage in the Los Angeles area was more widespread than the other disasters; we depict the intensity of earthquake shaking with darker green shading. While the Santa Monica area was particularly hard hit, many areas nearby received little structural damage from the earthquake. Overall, our analysis and the Figures show that the disasters used in our analysis generally affected only narrow slices of the overall service areas studied. As a result, we believe our analyses are unlikely to suffer from significant biases due to unobserved disruptions. However, we also include robustness checks where we separately analyze areas with different levels of damage. 11

Sources: City of Americus, GA Discharge Data.

12 Figure 1: Damage Map in Americus, GA Note: The green line indicates the path of the tornado and the shaded area around it is the government designated damage area. The zip codes included in the service area are outlined in pink. Sources: City of Americus, GA Discharge Data. Figure 2: Damage Map in Moore, OK Note: The green area indicates the damage path of the tornado. The zip codes included in the service area are outlined in pink. Sources: NOAA, OK Discharge Data 12

13 Figure 3: Damage Map in New York, NY Note: Green dots indicate buildings with damage classified as Minor, Major, or Destroyed by FEMA. The zip codes included in the service area for Bellevue are outlined in gray, for NYU are outlined in pink, and for Coney Island are outlined in blue. The other border colors are for zip codes that are in the service areas of multiple hospitals (maroon is for NYU and Bellevue and red is for NYU and Coney Island). Sources: FEMA, NY Discharge Data 13

Figure 4: Damage Map in Los Angeles, CA Note: Darker green areas indicate the earthquake intensity measured by the Modified Mercalli Intensity (MMI); an MMI value of 7 reflects non-structural damage

14 Figure 4: Damage Map in Los Angeles, CA Note: Darker green areas indicate the earthquake intensity measured by the Modified Mercalli Intensity (MMI); an MMI value of 7 reflects non-structural damage and a value of 8 moderate structural damage. The areas that experienced the quake with greater intensity were shaded in a darker color, with the MMI in the area ranging from Any areas with an MMI of below 7 were not colored. The zip codes included in the service area are outlined in pink. Sources: USGS Shakemap, OSHPD Discharge Data 3 Models In this section, we describe the different models that we test in this paper. We start with the traditional econometric framework for hospital choice, and then describe machine learning models based on regularization and on decision trees. 3.1 Econometric Models of Patient Choice To predict multinomial choices, economists have typically turned to the additive random utility model, which presumes that decision-makers choose from a defined set of options to maximize their expected utility. In this model, patient i s utility from receiving care from a given hospital h is assumed to be a linearly separable combination of a deterministic 14

15 component based on observable elements and an idiosyncratic shock: u ih = δ ih + ɛ ih. (1) The hospital choice literature has generally assumed that ɛ ih is distributed Type I extreme value. In that case, the probability that patient i receives care at hospital h takes the familiar logit form: s ih = exp(δ ih ) h H exp(δ ih), (2) where H is the set of all possible hospital choices. Given an assumption on the functional form of δ ih, one can make counterfactual predictions on how patient choices change as the utility of hospitals change. The existing empirical economics literature on hospital choice has typically ex-ante specified a model for deterministic utility δ ih that applies to all patients. In this paper, we examine a model of the following form: δ ih = γ k z ik α h + k f(d ih, z ik ) + α h, (3) where i indexes patient, j indexes hospital, and k indexes patient characteristics. Then, z ik are patient characteristics (e.g., age, condition, location, etc.), α h are hospital indicators (alternative specific constants, in the language of McFadden et al. (1977)), and d ih is the travel time between the patient s zip code and the hospital. The function f(d ih, z ik ) represents 15

16 distance interacted with patient characteristics. 9 Thus, the first term represents interactions between patient characteristics and hospital indicators, the second term some function of distance interacted with patient characteristics, and the third term α h alternative specific constants. In this paper, we consider one such model (Inter) that includes interactions of hospital indicators with disease acuity, major diagnostic category, and travel time as well as interactions of several patient characteristics disease acuity, race, sex, age, diagnostic category with travel time and the square of travel time. In Appendix C, we show that this model performs better post-disaster compared to several other parametric logit models used in the literature, including those used in Capps et al. (2003), Ho (2006), Gowrisankaran et al. (2015), and Garmon (2017). We estimate this model via maximum likelihood. 3.2 Machine Learning Models We view machine learning models as providing alternative methods for parameterizing deterministic utility δ in equation (1). We further assume that hospitals are always differentiated by idiosyncratic logit shocks, even when this assumption is not used in the estimation process. With this approach, we straightforwardly can use both machine learning and econometric models for counterfactual predictions. 10 We examine two types of machine learning models. The first type of model is a regular- 9 For travel time, we use ArcGIS to calculate the travel time (including traffic) between the centroid of the patient s zip code of residence and each hospital s address. 10 The distributional assumption is not required for counterfactual predictions when using the binary metric of whether or not the model correctly predicted a patient s choice. In that case, one only needs to assume that removing a choice does not alter the ordering of patients preferences. However, to consider predictive performance using root mean squared error (RMSE), one must specify the distribution of the idiosyncratic error. 16

17 ization model that estimates a large set of interactions but shrinks their coefficients towards zero. The second type of model groups patients and estimates preferences for each group by using decision trees to construct groups. For all of the machine learning models, we use the same set of predictor variables: the patient s zip code, disease acuity (DRG weight), the Major Diagnosis Category (MDC) of the patient s diagnosis, the patient s age, and indicators for medical vs. surgical admission, whether the patient was black, whether the patient was female, and whether the admission was on an emergency basis. We estimate all of the models using the scikit package in Python, and set tuning parameters using crossvalidation Regularization The Inter model described above includes many interactions of hospital indicator variables and travel time with patient characteristics. However, it is difficult to know which interactions are most important to include. A standard approach in machine learning for variable selection is to estimate a regularization algorithm, which penalizes the magnitude of coefficients. We implement a LASSO regression (Tibshirani, 1996), which penalizes the absolute value of coefficients as in the equation below: K log L(β) + λ β k k=1 where L(β) is the log likelihood of a multinomial logit model, β are the coefficients of the model, and λ is a tuning parameter regulating the degree of shrinkage. The LASSO regularization can shrink coefficients to zero,and thus provides an approach to variable selection. 17

18 In particular, we estimate a model where we have hospital indicator variables interacted with two way interactions between our set of predictors. To give an example of such an interaction, a hospital indicator interacted with a zip code interacted with a MDC code allows patients from a particular zip code coming into the hospital for a particular condition such as cardiac conditions or pregnancy to have their own valuation of hosptial quality. This procedure can generate hundreds or thousands of interactions Grouping through Decision Trees An alternative approach to estimating demand partitions the space of all patients into a large set of mutually exclusive groups, and then assumes homogenous preferences within each of those group. Choice probabilities are estimated using only the observations in each group. Thus, a grouping model makes no assumptions on the similarity of preferences across groups, and the addition of one additional observation will only affect the predicted outcomes of members of its group. A grouping approach models deterministic utility as: δ ih = δ g(zi )h, (4) for some set of groups g(z i ) that depend upon patient characteristics z i. Thus, this model is equivalent to including a fixed effect for each group-hospital interaction in a multinomial logit model, with the IIA property of proportional substitution holding with each group. Given a set of groups, predicted choice probabilities can be estimated as the empirical shares of hospitals within each group. The main question with grouping models is how to determine 18

19 the set of groups g(z i ). Setting the size of the groups leads to a bias variance tradeoff, as a smaller group size allows greater flexibility in order to potentially model very heterogeneous preferences but also increases the variance of predicted choice probabilities. For hospital choice, the first grouping model that has been used in the literature is a semiparametric bin estimator similar to that outlined in Raval et al. (2017b) (Semipar). This model constructs groups based on an exogenously determined set of patients observable characteristics. In this paper, we group patients based on their zip code, disease acuity (DRG weight), age group, and area of diagnosis (MDC), in that order. The model is estimated using the procedure in Raval et al. (2017b). The main parameter in the model is a minimum group size; if patients are initially placed in groups below this minimum group size, grouping variables are dropped for patients until the size of the group is above the minimum group size. Thus, an initial grouping would be based on zip code, acuity, age, and diagnosis category; if the group size was too small, diagnosis category would be dropped. We drop variables in the reverse order of how they were listed above. While intuitive, this method for grouping is ad hoc and requires the ex ante definition of the order of observable characteristics used for prediction. A set of machine learning models based on decision trees provide algorithmic approaches to allow the data to determine the optimal groups. The first grouping machine learning model (DT ) we estimate is what is known as a decision tree. While there are many possible ways of estimating tree models, we employ the approach known as CART (Breiman et al., 1984), which is perhaps the most popular. The CART approach acts as a greedy algorithm, separating the data into two groups at each node based on the split that minimizes the prediction error criterion. Thus, it recursively 19

20 partitions the data by growing the tree through successive splits. The CART algorithm runs the risk of overfitting the data by creating too many splits. To address this problem, the tree model is pruned by removing excessive splits that likely contribute little to the out-of-sample performance of the tree. While a single decision tree is easy to understand and interpret, the literature has tended to conclude that the approach does not provide the best predictive power. For example, Breiman (2001b) noted, While trees rate an A+ on interpretability, they are good, but not great, predictors. Give them, say, a B on prediction. Models that average the predictions of many tree models are known to have much better predictive power. Our second grouping machine learning model (RF ) leverages this insight, interjecting stochasticity into the tree construction process so as to create a random forest. This type of model, which Breiman (2001b) gives an A+ for prediction, was originally presented in Breiman (2001a). Our random forest introduces two sources of stochasticity into the formation of trees. First, a whole forest of trees are built by estimating different tree models on bootstrap samples of the original dataset. This procedure is known as bagging. Second, the set of variables that are considered for splitting is different and randomly selected for each tree. Breiman (2001b) cites considerable evidence in the machine learning literature that random forests outperform other machine learning models. In the economics literature, Bajari et al. (2015a) find that random forest models were most accurate in predicting aggregate demand out of sample. Our third grouping machine learning model (GBM ) also derives from decision tree modeling, but uses gradient boosting to generate a multiplicity of trees (Freund and Schapire, 1995; Friedman et al., 2000; Friedman, 2001). Whereas random forest models introduce new 20

21 sources of randomness by varying the training data and available explanatory variables, gradient boosting builds off of a single underlying tree structure. However, it creates multiple generations of the original model by overweighting observations that were classified incorrectly in the previous iteration. 11 The final prediction is then a weighted average across all of the different models produced. Boosting can be thought of as an additive expansion in a set of elementary basis functions (in our case, trees). A shrinkage parameter scales how much each new tree adds to the overall prediction, and acts as a form of regularization. Boosting is also an extremely good prediction algorithm, and was at one time deemed the best off-the-shelf classifier in the world (Hastie et al., 2005). We estimate each of the local machine learning models using the Scikit package in Python. For each of these methods, the main tuning parameter is the minimum size of the node ( leaf ), which we cross validate separately for each experiment, testing values of 10, 25, 50, and 100. For the random forest and gradient boosting methods, we set the number of estimators to Average Predictive Performance We estimate all of the models in Section 3 on training data from the period before the disaster, and assess each model s predictive performance on test data from the period after the disaster. Our primary measure is the share of choices correctly predicted by the models in the post-disaster period. We consider a model to predict a choice when the choice probability for that choice is higher than for any of the alternatives. This goodness-of-fit measure is 11 In a linear regression, a boosting procedure would overweight observations with large residuals. 21

easily interpretable and transparent. We first show how well the models perform on an absolute basis on the percent of predictions correctly predicted, averaging across the experiments.

22 easily interpretable and transparent. We first show how well the models perform on an absolute basis on the percent of predictions correctly predicted, averaging across the experiments. Figure 5 contains these estimates. On average, our baseline econometric model Inter predicts choices correctly 39.6% of the time. Semipar, the semiparametric bin model, predicts choices correctly 41.4% of the time. The machine learning models do significantly better RF and GBM correctly predict 46.4% of choices, Regular 45.6%, and DT 44% of choices. Figure 5: Percent Correct Averaged over all Experiments Henceforth, we compare all of the machine learning models to the econometric model Inter, and present the percent improvement in predictive accuracy for all other models relative to Inter, as it is the best performing model akin to those standard in the econometric 22

literature. 12 Figure 6 depicts the percent improvement in the share of correct predictions relative to Inter, averaged over all of the experiments. GBM and RF perform the best, providing a 20.

23 literature. 12 Figure 6 depicts the percent improvement in the share of correct predictions relative to Inter, averaged over all of the experiments. GBM and RF perform the best, providing a 20.5% and 20.6% increase in predictive accuracy, respectively, for about a seven percentage point increase in average percent correct. Regular performs 18.8% better than Inter, and so is slightly worse than the two best machine learning models. These outperform DT and Semipar, which are 15% and 6% better than Inter. Thus, the two models that build upon an individual decision tree do perform somewhat better than an individual tree. Figure 6: Percent Improvement in Predictive Accuracy using Percent Correct Averaged over all Experiments Note: Predictive Accuracy is measured as the percent correct, averaged over all experiments and measured relative to Inter. 12 In Appendix C, we show that Inter and Semipar perform better than all other models used in the econometric literature that we test. 23

24 We next consider performance at the individual experiment level; in Figure 7, we plot these results, with Semipar in purple, RF in blue, GBM in green, DT in yellow, and Regular in red. The models performance varies substantially across disasters. For example, in Sumter, none of the models perform substantially better than Inter, with RF the best at a 2.2% improvement. The DT and Semipar models perform worse than Inter. The machine learning models perform dramatically better for Moore, with RF and Regular having a 63% higher share of correct predictions. For the other four experiments, RF and GBM perform between 10 to 20% better than Inter, and one of the two is the best model. In general, RF and GBM consistently improve upon the predictive performance of the best of the standard econometric specifications, and are the best two models for 4 of the 6 disasters. 4.1 Validation Sample Performance In most situations, researchers will not have access to natural experiments like ours in order to assess models, but could use performance on a validation sample that is similar to the training sample to gauge expected performance. We examine whether performance on a validation sample can provide a good guide to performance after a major change in the choice set by estimating the models on a random 80% sample of the training data (the train sample) and then examining their performance on the excluded 20% of the sample (the validation sample). We then compare model performance on these samples to performance in the test sample of post-disaster data in Figure 8. We find a similar ordering of relative model performance between the validation sample and the test sample. For example, the GBM and the RF are the two best models using the 24

25 Figure 7: Percent Improvement in Predictive Accuracy using Percent Correct By Experiment Note: Percent correct measured relative to Inter. training, validation, and test samples. The main exception is the decision tree model DT, which performs better than Regular in the validation sample but worse in the test sample. Thus, our experiments suggest that a validation sample can provide a good guide to model performance even after a major change in environment. However, using the validation sample tends to suggest cardinally better model performance. All of the machine learning models perform worse in the validation sample than the training sample, and worse in the test sample than the validation sample. For example, the RF model is 35% better than Inter in the training data, 24% better in the validation sample, and 21% better in the test sample. 25

Figure 8: Average Percent Improvement in Predictive Accuracy using Percent Correct, on the Training, Validation, and Test samples Note: Percent correct measured relative to Inter.

26 Figure 8: Average Percent Improvement in Predictive Accuracy using Percent Correct, on the Training, Validation, and Test samples Note: Percent correct measured relative to Inter. The training sample is a random 80% of the data pre-disaster, the validation sample a random 20% of the data pre-disaster, and the test sample data post-disaster. 4.2 Model Combination One major finding of machine learning is that ensembles of models can perform better than one individual model (Van der Laan et al., 2007). 13 These findings suggest that combining the predictions from multiple models may lead to better predictions of behavior than using a single preferred model. In this section, we examine how well the model combination approach does in prediction compared to non-hybrid approaches. While there are several ways to combine models, we apply a simple regression based ap- 13 In this study, both GBM and RF are already combinations of hundreds of base learners and perform very well. 26

27 proach that has been developed in the literature on optimal model combination for macroeconomic forecasts (Timmermann (2006)). To apply the method to our context, we treat each patient as an observation, and regress the predictions from all the models on observed patient behavior. We constrain the coefficients on the models predictions to be non-negative and to sum to one. Thus, each coefficient in the regression can be interpreted as a model weight, and many models will be given zero weight. We perform this analysis separately for each disaster, which enables us to see the variation in our findings across the different settings. The regression framework implicitly deals with the correlations in predictions across models. If two models are very highly correlated but one is a better predictor than the other, only the better of the two models receives weight in the optimal model combination. Formally, we regress each patient s choice of hospital on the predicted probabilities from all of the models without including a constant, as below: y ih = β Semipar ŷ Semipar ih β RF ŷ RF ih + ɛ where y ih is the observed choice for patient i and hospital h and ŷ Semipar ih is the predicted probability for patient i and hospital h for Semipar. We include all of the models tested in this paper in the model combination. We examine the performance of the model combination predictions estimated on the 20% validation sample (allowing estimated weights to vary by disaster) in the period after the disaster. In Figure 9, we depict the percent improvement over the Inter model including the model combination, averaging across all of the service areas. The model combination 27

Figure 9: Average Percent Improvement in Predictive Accuracy Averaged over all Experiments, Including the Model Combination approach Note: Percent correct measured relative to Inter.

28 Figure 9: Average Percent Improvement in Predictive Accuracy Averaged over all Experiments, Including the Model Combination approach Note: Percent correct measured relative to Inter. is the best model, and performs slightly better than RF and GBM. It provides a 21.4% improvement on Inter, compared to 20.6% and 20.5% for RF and GBM. Across experiments, the model combination is the best model for 4 of the 6 experiments, although again by small margins over the best model. Thus, we find evidence that a model combination performs better, but not substantially better, than the best individual model. 5 Estimation Time While the RF and GBM models frequently have similar predictive accuracy, the computational time required for each of them is very different for the parameters that we use. The 28

29 comptational time for the algorithms is in Table 3 for St. Johns, one of our largest datasets, and Moore, one of our smallest. St. Johns has about 30 times the number of admissions as Moore. These computations were done using the Scikit package in Python 3 on a server where the algorithms were permitted to use up to 40 cores at a time. The fastest machine learning algorithm is the decision tree; it took 0.7 seconds to estimate the decision tree for Moore and about 5 seconds for St. Johns. Of the machine learning algorithms that generalize decision trees, RF is by far the fastest. For Moore it took 45 seconds, while for St. Johns it took 6 minutes. GBM is two orders of magnitude slower than random forest, taking 16 hours for St. Johns and 30 minutes for Moore. Finally, Regular is three orders of magnitude slower than random forest, taking nearly 6 days to run for the California dataset and about 2 hours for Moore. The time difference between GBM and RF is likely largely due to the fact that the trees in a random forest model can be constructed in parallel, while the trees in a gradient boosting method are constructed sequentially. We caution that our results on computational time are to some extent specific to the model hyperparameters and software. Nevertheless, from looking at computational time and prediction together it becomes clear that RF and GBM should be the preferred machine learning models for this class of discrete choice problems. Table 3: Computational Time by Machine Learning Algorithm Model St Johns Moore DT 4.9 seconds 0.7 seconds RF 6.3 minutes 45 seconds GBM 16.2 hours 30 minutes Regular 6.0 days 2.1 hours 29

30 6 Performance in Changing Environment The above results demonstrate that, on average, the flexible machine learning models tend to predict better than conventional econometric models after the disaster induced change in the choice set. However, this could be because we include many patients for whom their preferred hospital was unaffected by the disaster, and so the destruction of a non-preferred hospital had no effect on their choices. To assess this possibility, we focus on the patients who were more likely to experience the elimination of their preferred hospital following the natural disaster in three ways. First, we examine predictive accuracy across patients as a function of the probability a similar patient would have gone to the destroyed hospital in the pre-disaster period. We calculate these likelihoods based on the bins constructed by Semipar. Second, for our New York and California hospitals, we look at patients that had an admission at the destroyed hospital in the pre-period. Third, we examine the weight that the model combination approach places on the different models between the validation and test datasets. 6.1 Probability of Using Destroyed Hospital In Figure 10, we show the performance of the machine learning models relative to Inter, broken down by share of discharges in the pre-disaster period, using both the percent correct and RMSE criteria. The figure shows that the relative improvement of all of the machine learning models over Inter is declining in the share of the destroyed hospital. GBM and RF continue to improve over Inter, but they are only 11.7% and 12.4% better, compared to a 30

31 24.1% and 23.8% improvement for groups for which the share of the destroyed hospital was below 10%. DT is only 3% better than Inter for groups for which the pre-disaster share of the destroyed hospital is above 30%, while Semipar performs worse than Inter for these groups. While this Figure is instructive, for many disasters only a few patients had a predicted share of the destroyed hospital greater than 30%. Therefore, we also look at results separately for Sumter, because the pre-disaster share of the destroyed hospital was 50%, and there was significant variation across groups in the pre-disaster share of the destroyed hospital. We display these results in Figure 11. In general, machine learning models did worse in areas with a larger share of the destroyed hospital, with all of the models performing worse than Inter for a destroyed hospital share of 50% or greater. For example, RF is 1.2% worse than Inter for groups with a predicted destroyed hospital share between 50% and 80%, and 0.7% worse for groups with a predicted destroyed hospital share above 80%. GBM is 5.1% worse than Inter for groups with a predicted destroyed hospital share between 50% and 80%, and 3.9% worse for groups with a predicted destroyed hospital share above 80%. In contrast, many of the models perform better than Inter for groups with a predicted share of the destroyed hospital between 15 and 50%. We also quantify these results using our model combination approach. In Table 4, we display the average model weights on the validation sample data, the post-disaster test data, and only observations in the post-disaster test sample for which the destroyed hospital had at least a 30% share. We highlight two major findings. First, there is no one preferred model. Within a given disaster, no one model receives all of the weight. On average using the test data, the RF model received almost 26% of the weight, the GBM model receiving 31

Figure 10: Percent Improvement in Predictive Accuracy for Percent Correct By Share of Destroyed Hospital Averaged over all Experiments Note: Percent correct measured relative to Inter.

32 Figure 10: Percent Improvement in Predictive Accuracy for Percent Correct By Share of Destroyed Hospital Averaged over all Experiments Note: Percent correct measured relative to Inter. 43% and Inter 9% of the weight. Second, we find that the parametric logit model has a much larger share in the postdisaster model combination, especially when the disaster had a large effect on the choice environment. The share of Inter rises from 2% using the validation sample, to 9% using the test sample, to 18% using the test sample on only observations with a destroyed hospital share above 30%. In addition, looking just at the results for Sumter, which had the biggest change in choice set, we found that the model combination s share of Inter is very large at 48% to 57% for the datasets based on the test data, but only 11% in the validation dataset. The Regular model, which like Inter estimates a linear set of interactions, also has its weight 32

33 Figure 11: Percent Improvement in Predictive Accuracy for Percent Correct By Share of Destroyed Hospital Sumter Note: Percent correct measured relative to Inter. increase from 0% in the validation sample to 15% in the test sample to 20% in the test sample on only observations with a destroyed hospital share above 30%. Again, machine learning models perform relatively less well than traditional parametric logit models where the change in the choice environment is very large Previous Patients For our second approach, we focus on predictions for patients with a previous admission in the destroyed hospital, who would have likely have gone to the destroyed hospital in the absence of the disaster. We have a total of 633, 491, 624, and 1,036 admissions for such 14 Individual model combination results are available upon request. 33

34 Table 4: Average Model Weights for Optimal Model Combination Dataset Validation Test Test, Destroyed Share 30% Inter RF GBM DT Regular Semipar Note: The second through fourth columns provides the average weight for each model across the different experiments using the 20% validation sample, the test sample after the disaster, and only observations in the post-disaster test sample for which the destroyed hospital had at least a 30% share, respectively. patients for NYU, Bellevue, Coney, and St. Johns. We depict these results in Figure 12 broken down on an experiment by experiment basis. For these patients, we continue to find that the best machine learning models outperform our benchmark model Inter. GBM is 5 to 7% better than Inter for each experiment for the previous admission patients, and RF is 1.7 to 13.5% better. However, this rate of improvement is substantially lower than we found for patients in general, in which these models were 10 to 20% better than Inter for the same experiments. The DT model is worse than Inter for 3 of the 4 experiments, while Regular is the best of the machine learning models for 2 cases and worse than Inter in one case. In general, the relative performance of machine learning models falls when only examining previously admitted patients. Overall, these results suggest that local machine learning approaches perform less well relative to standard econometric approaches when focusing on people with the largest change in their choice environment. 34

35 Figure 12: Percent Improvement in Predictive Accuracy for Percent Correct By Share of Destroyed Hospital for Previous Patients Note: Percent correct measured relative to Inter. 7 Robustness In this section, we evaluate the robustness of our conclusions to removing areas with more damage from the disaster, to examining only specific patient groups, as well as by using RMSE instead of percent correct as a prediction criterion, and find that doing so does not lead to substantially different conclusions than described above. 35

36 7.1 Removing Destroyed Areas Our first approach to evaluating the robustness of our conclusions is to consider the effect of removing the areas most affected by the disaster from our estimates of model performance after the disaster. If destruction from the disaster affects how patients make decisions beyond just the change in the choice set (for example, they are forced to move), then models estimated before the disaster may not be able to predict patients decisions after the disaster. We focus on Sumter, Coney Island, and Northridge. We do not remove any areas for NYU or Bellevue, as the area immediately nearby these hospitals had very little post-sandy damage. For Moore, removing the zip codes through which the tornado traversed would remove almost all of the patients from the choice set, so we do not conduct this robustness check for Moore. For Sumter, we remove the two zip codes comprising the city of Americus, for the tornado mainly damaged the city rather than its outlying areas. For Coney Island, we remove the three zip codes that submitted the most post-disaster claims to FEMA; these zip codes are on the Long Island Sound and likely suffered more from flooding after Sandy. For St. Johns, we remove zip codes with an average Modified Mercalli Intensity (MMI) of 8 or above based on zip code level data from an official report on the Northridge disaster for the state of California. The US Geological Survey defines MMI values of 8 and above as causing structural damage. This removes 9 zip codes, including all 5 zip codes in Santa Monica The zip codes removed are and for Sumter; 90025, 90064, 90401, 90402, 90403, 90404, 90405, 91403, and for St. Johns; and 11224, 11235, and for Coney. See home/webmap/viewer.html?webmap=f27a0d274df34a77986f6e38deba2035 for Census block level estimates of Sandy damage based on FEMA reports. See ftp://ftp.ecn.purdue.edu/ayhan/aditya/northridge94/ OES%20Reports/NR%20EQ%20Report_Part%20A.pdf, Appendix C, for the Northridge MMI data by zip code. 36

37 The areas removed tend to have higher market shares for the destroyed hospital. Thus, removing destroyed areas cuts Sumter s market share from about 50 percent to 31 percent, St. John s market share falls from 17 to 14 percent, and Coney s from about 18 to 10 percent. We estimate the models on the full pre-disaster sample but separately evaluate our performance validation measures based on whether the patient came from an area with or without significant damage. In Figure 13, we display these results for the damaged areas in the left figure, and for the relatively non-damaged areas in the right figure. We find that the machine learning models almost always outperform the econometric models for Coney and St. Johns in both the damaged and non-damaged areas, although their margin of improvement is larger in the destroyed areas for Coney and smaller for St. Johns. Only DT appears to perform substantially worse in the destroyed areas. For Sumter, GBM and RF slightly underperform Inter in the damaged areas, and out perform Inter in the non destroyed areas, consistent with our evidence on how the models performed with different shares of the destroyed hospital in the previous section. Regular and Semipar perform slightly better than Inter in the destroyed areas, but much worse in the non destroyed areas. 7.2 Patient Heterogeneity For our second robustness check, we consider the performance of different predictive models for different types for patients. First, we examine seven important classes of patients based on their diagnosis: cardiac patients (with a Major Diagnostic Category of 5), obstetrics patients (with a Major Diagnostic Category of 14), non-emergency vs. emergency patients, 37

38 (a) Damaged Areas (b) Non-Damaged Areas Figure 13: Relative Improvement in Percent Correct Damaged vs. Non-Damaged Areas Note: Improvement is Percentage Improvement in Percent Correct for each model over Inter. and terciles of the disease acuity of patients, measuring disease acuity by DRG weight. We estimate the models on all patients, but then separately examine their performance for patients in the given groups. Figure 14 displays the results of our different robustness checks. The local machine learning models continue to do significantly better than Inter and Semipar for all of the groups. Their relative performance is better for emergency compared to non-emergency patients, and for pregnancy compared to cardiac patients. For example, RF is 44% better than Inter for emergency patients compared to 35% better for non-emergency patients, and is 35% better for pregnancy patients compared to 24% better for cardiac patients. We find similar relative performance for machine learning models between low acuity, medium acuity, and high acuity patients. In addition, we check whether our conclusions hold if we restrict the data sample to only 38

(a) Emergency and MDC (b) Acuity Figure 14: Average Relative Improvement for Individual Predictions: Robustness Note: Improvement is Percentage Improvement in Percent Correct for each model over the

39 (a) Emergency and MDC (b) Acuity Figure 14: Average Relative Improvement for Individual Predictions: Robustness Note: Improvement is Percentage Improvement in Percent Correct for each model over the Inter model. We examine cardiac, pregnancy, emergency, and non-emergency patients separately in the left figure, and disease acuity (DRG weight) divided into terciles in the right figure. the Medicare population, and then examine the performance of the models for this population after the disaster. 16 The Medicare sample is considerably smaller than the full dataset, which may hurt the performance of the local models because of power considerations. Finally, the Medicare sample has a different variety of treatment conditions than the full sample; for example, Medicare patients probably do not value the pediatric or obstetrics departments of a hospital. We depict these estimates in Figure 15; the machine learning models continue to improve over our baseline econometric model for Medicare patients. For example, on average, RF is only 15.5% better than Inter, and GBM is 14.0% better than Inter, for Medicare patients, 16 For the states for which Fee for Service Medicare and Managed Care Medicare are distinguished, we exclude Managed Care Medicare patients. The Medicare sample should have unrestricted access across all of the hospitals in the choice set. 39

40 compared to 20.6% and 20.5% for patients in the full sample. Figure 15: Average Relative Improvement for Individual Predictions: Medicare Patients Note: Improvement is Percentage Improvement in Percent Correct for each model over the Inter model. We examine Medicare patients only and reestimate all of the models on the Medicare only sample in order to develop predictions. 7.3 RMSE as Prediction Criterion Our baseline prediction criterion of percent correct ignores the models estimates of probabilities for non-selected choices. However, estimates of welfare depend on probabilities of all hospitals in the choice set, and not just the chosen hospital. Therefore, we also present many of our results using root mean squared error across the probabilities of all choices, which penalizes models that incorrectly predict probabilities for low probability hospitals that none of the models would select as the most likely choice. We find similar results to 40

41 our earlier results with percent correct using RMSE as a prediction criteriont. In Figure 16, we depict the percent improvement in RMSE (so the negative in the change in RMSE) relative to Inter, averaged over all of the experiments. We again find that GBM and RF are the best models. However, the margin of improvement over Inter is much smaller; GBM and RF are both about 3.7% better than Inter. The regularization model Regular is about 2.5% better than Inter. The only major difference compared to our results for percent predicted is that the DT model performs relatively much worse, at only a 0.6% improvement over Inter. In Figure 17, we show these results by disaster. For Sumter, all of the machine learning models are now worse than Inter; RF is 0.7% worse and GBM is 1.5% worse. However, for all of the other experiments, we find that RF or GBM are the best two models. 8 Conclusion In this paper, we have shown that machine learning models perform significantly better than traditional econometric models in predicting patient decisions, even when the environment experiences significant changes. In addition, the relative performance of these algorithms on a validation, holdout sample mirrored performance on the post-disaster test sample. However, we also showed that machine learning algorithms are not a panacea; their performance relative to econometric models deteriorated with larger changes in choice environment, and it may be desirable to supplement an algorithmic model with a standard parametric one. Structural econometric models are still required to answer many counterfactual policy 41

Figure 16: Percent Improvement in Predictive Accuracy using RMSE Averaged over all Experiments Note: Predictive Accuracy is Measured as RMSE, averaged over all experiments and measured relative to

42 Figure 16: Percent Improvement in Predictive Accuracy using RMSE Averaged over all Experiments Note: Predictive Accuracy is Measured as RMSE, averaged over all experiments and measured relative to Inter. questions, such as the effects of entry and product repositioning. 17 For example, forecasting entry would require modeling the quality of the entering hospital, which would be extremely difficult to do across hundreds of endogeneously determined groups and thousands of trees in a random forest model. Future work should examine how machine learning models could be applied, or blended with traditional econometric models, to tackle these questions. In addition, as Raval and Rosenbaum (2018) shows, distance to provider is perhaps the most important variable in spatial models, but all of the machine learning models we have employed have included zip codes rather than distance. Since the machine learning models 17 See Raval and Rosenbaum (2018) and Raval and Rosenbaum (forthcoming) for examples of each in the hospital context. 42

43 Figure 17: Percent Improvement in Predictive Accuracy using RMSE By Experiment Note: RMSE measured relative to Inter. are variants of a multinomial, and not a conditional logit; only coefficients and not predictor values are allowed to vary across choices for the same individual. Future work should examine how machine learning algorithms could apply to the conditional logit. 18 Finally, this paper has examined the performance of machine learning models in hospital choice. However, our experimental technique could be used in other choice environments by exploiting firm experiments in which products to stock (Conlon and Mortimer, 2013), temporary stockouts or failures in vertical contract negotiations, or even product exit. 18 Some work has been done to this effect; see Reid and Tibshirani (2014). 43

44 References Aguirregabiria, Victor, Empirical Industrial Organization: Models, Methods, and Applications, Athey, Susan, Machine Learning and Causal Inference for Policy Evaluation, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM 2015, pp. 5 6., Beyond Prediction: Using Big Data for Policy Problems, Science, 2017, 355 (6324), and Guido Imbens, The State of Applied Econometrics-Causality and Policy Evaluation, arxiv preprint arxiv: , Bajari, Patrick, Denis Nekipelov, Stephen P Ryan, and Miaoyu Yang, Demand Estimation with Machine Learning and Model Combination, Technical Report, National Bureau of Economic Research 2015.,,, and, Machine Learning Methods for Demand Estimation, The American Economic Review, 2015, 105 (5), Breiman, Leo, Random Forests, Machine Learning, 2001, 45 (1), 5 32., Statistical Modeling: The Two Cultures, Statistical Science, 2001, 16 (3), , Jerome Friedman, Charles J. Stone, and R.A. Olshen, Classification and Regression Trees, Chapman and Hall, Capps, Cory, David Dranove, and Mark Satterthwaite, Competition and Market Power in Option Demand Markets, RAND Journal of Economics, 2003, 34 (4), Ciliberto, Federico and David Dranove, The Effect of Physician Hospital Affiliations on Hospital Prices in California, Journal of Health Economics, 2006, 25 (1), Conlon, Christopher T. and Julie Holland Mortimer, An Experimental Approach to Merger Evaluation, NBER Working Paper, der Laan, Mark J. Van, Eric C. Polley, and Alan E. Hubbard, Super Learner, Statistical Applications in Genetics and Molecular Biology, 2007, 6 (1). Freund, Yoav and Robert E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, in European Conference on Computational Learning Theory Springer 1995, pp Friedman, Jerome H., Greedy Function Approximation: A Gradient Boosting Machine, Annals of Statistics, 2001, pp Friedman, Jerome, Trevor Hastie, and Robert Tibshirani, Additive Logistic Regression: a Statistical View of Boosting, The Annals of Statistics, 2000, 28 (2), Garmon, Christopher, The Accuracy of Hospital Merger Screening Methods, The RAND Journal of Economics, 2017, 48 (4),

45 Gilchrist, Duncan Sheppard and Emily Glassberg Sands, Something to Talk About: Social Spillovers in Movie Consumption, Journal of Political Economy, forthcoming. Goel, Sharad, Justin M. Rao, and Ravi Shroff, Personalized Risk Assessments in the Criminal Justice System, American Economic Review, May 2016, 106 (5), , Justin M Rao, and Ravi Shroff, Precinct or Prejudice? Understanding Racial Disparities in New York City s Stop-and-Frisk Policy, Annals of Applied Statistics, 2016, 10 (1), Gowrisankaran, Gautam, Aviv Nevo, and Robert Town, Mergers when Prices are Negotiated: Evidence from the Hospital Industry, American Economic Review, 2015, 105 (1), Hastie, Trevor, Robert Tibshirani, Jerome Friedman, and James Franklin, The Elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, 2005, 27 (2), Ho, Katherine, The Welfare Effects of Restricted Hospital Choice in the US Medical Care Market, Journal of Applied Econometrics, 2006, 21 (7), James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning, Vol. 112, Springer, Kalouptsidi, Myrto, Time to build and fluctuations in bulk shipping, The American Economic Review, 2014, 104 (2), Kleinberg, Jon, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer, Prediction Policy Problems, The American Economic Review, 2015, 105 (5), Lancaster, Kelvin J., A New Approach to Consumer Theory, The Journal of Political Economy, 1966, pp May, Sean M., How Well Does Willingness-to-Pay Predict the Price Effects of Hospital Mergers?, mimeo, McFadden, Daniel, Antti Talvitie, Stephen Cosslett, Ibrahim Hasan, Michael Johnson, Fred Reid, and Kenneth Train, Demand Model Estimation and Validation, Vol. 5, Institute of Transportation Studies, Raval, Devesh and Ted Rosenbaum, Why is Distance Important for Hospital Choice? Separating Home Bias from Transport Costs, mimeo, and, Why Do Previous Choices Matter for Hospital Demand? Decomposing Switching Costs from Unobserved Preferences, Review of Economics and Statistics, forthcoming.,, and Nathan E. Wilson, Using Disaster Induced Hospital Closures to Evaluate Models of Hospital Choice, mimeo, 2017.,, and Steven A. Tenn, A Semiparametric Discrete Choice Model: An Application to Hospital Mergers, Economic Inquiry, 2017, 55, Reid, Stephen and Rob Tibshirani, Regularization Paths for Conditional Logistic Regression: The clogitl1 Package, Journal of Statistical Software, 2014, 58 (12). 45

46 Rose, Sherri, Machine Learning for Risk Adjustment, Health Services Research, 2016, 51 (6), Tibshirani, Robert, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (Methodological), 1996, pp Timmermann, Allan, Forecast Combinations, Handbook of Economic Forecasting, 2006, 1, Varian, Hal R., Big Data: New Tricks for Econometrics, The Journal of Economic Perspectives, 2014, 28 (2),

47 A Disaster Timelines In this section, we give a brief narrative descriptions of the destruction in the areas surrounding the destroyed hospitals. A.1 St. John s (Northridge Earthquake) On January 17th, 1994, an earthquake rated 6.7 on the Richter scale hit the Los Angeles Metropolitan area 32 km northwest of Los Angeles. This earthquake killed 61 people, injured 9,000, and seriously damaged 30,000 homes. According to the USGS, the neighborhoods worst affected by the earthquake were the San Fernando Valley, Northridge and Sherman Oaks, while the neighborhoods of Fillmore, Glendale, Santa Clarita, Santa Monica, Simi Valley and western and central Los Angeles also suffered significant damage. 19 Over 1,600 housing units in Santa Monica alone were damaged with a total cost of $70 million. 20 The earthquake damaged a number of major highways of the area; in our service area, the most important was the I-10 (Santa Monica Freeway) that passed through Santa Monica. It reopened on April 11, By the same time, many of those with damaged houses had found new housing. 22 Santa Monica Hospital, located close to St. John s, remained open but at a reduced capacity of 178 beds compared to 298 beds before the disaster. In July 1995, Santa Monica Hospital merged with UCLA Medical Center. 23 St. John s hospital reopened for inpatient services on October 3, 1994, although with only about half of the employees and inpatient beds and without its North Wing (which was razed). 24 A.2 Sumter (Americus Tornado) On March 1, 2007, a tornado went through the center of the town of Americus, GA, damaging 993 houses and 217 businesses. The tornado also completely destroyed Sumter Regional Hospital. An inspection of the damage map in the text and GIS maps of destroyed structures suggests that the damage was relatively localized the northwest part of the city was not damaged, and very few people in the service area outside of the town of Americus were affected. 25 Despite the tornado, employment remains roughly constant in the Americus Micropolitan Statistical Area after the disaster, at 15,628 in February 2007 before the disaster and 15,551 in February 2008 one year 19 See 20 See 21 See 22 See html?pagewanted=all. 23 See 24 See 25 See for the GIS map. 47

48 later. 26 While Sumter Regional slowly re-introduced some services such as urgent care, they did not reopen for inpatient admissions until April 1, 2008 in a temporary facility with 76 beds and 71,000 sq ft of space. Sumter Regional subsequently merged with Phoebe Putney Hospital in October 2008, with the full merge completed on July 1, On December 2011, a new facility was built with 76 beds and 183,000 square feet of space. 27 A.3 NYU, Bellevue, and Coney Island (Superstorm Sandy) Superstorm Sandy hit the New York Metropolitan area on October 28th - 29th, The storm caused severe localized damage and flooding, shutdown the New York City Subway system, and caused many people in the area to lose electrical power. By November 5th, normal service had been restored on the subways (with minor exceptions). 28 Major bridges reopen on October 30th and NYC schools reopen on November 5th. 29 By November 5th, power is restored to 70 percent of New Yorkers, and to all New Yorkers by November 15th. FEMA damage inspection data reveals that most of the damage from Sandy occured in areas adjacent to water. 30 Manhattan is relatively unaffected, with even areas next to the water suffering little damage. In the Coney Island area, the island tip suffers more damage, but even here, most block groups suffer less than 50 percent damage. Areas on the Long Island Sound farther east of Coney Island, such as Long Beach, are much more affected. NYU Langone Medical Center suffered about $1 billion in damage due to Sandy, with its main generators flooded. While some outpatient services reopened in early November, it only partially reopened inpatient services on December 27, 2012, including some surgical services and medical and surgical intensive care. The maternity unit and pediatrics reopened on January 14th, While NYU Langone opened an urgent care center on January 17, 2013, a true emergency room did not open until April 24, 2014, more than a year later. 32 Bellevue Hospital Center reopened limited outpatient services on November 19th, However, Bellevue dis not fully reopen inpatient services until February 7th, Coney Island Hospital opened an urgent care center by December 3, 2012, but patients were not admitted inpatient. 26 See 212BF9673EB816FE50F D1695.tc_instance6. 27 See and 28 See 29 See 30 See the damage map at 31 See 32 See and 33 See 34 See 48

49 It had reopened ambulance service and most of its inpatient beds by February 20th, 2013, although at that time trauma care and labor and delivery remained closed. The labor and delivery unit did not reopen until June 13th, A.4 Moore (Moore Tornado) A tornado went through the Oklahoma City suburb of Moore on May 20, The tornado destroyed two schools and more than 1,000 buildings (damaging more than 1,200 more) in the area of Moore and killed 24 people. Interstate 35 was briefly closed for a few hours due to the storm. 36 Maps of the tornado s path demonstrate that while some areas were severely damaged, nearby areas were relatively unaffected. 37 Emergency services, but not inpatient admissions, temporarily reopened at Moore Medical Center on December 2, Groundbreaking for a new hospital took place on May 20, 2014 with a tentative opening of fall B Dataset Construction For each dataset, we drop newborns, transfers, and court-ordered admissions. Newborns do not decide which hospital to be born in (admissions of their mothers, who do, are included in the dataset); similarly, government officials or physicians, and not patients, may decide hospitals for court-ordered admissions and transfers. We drop diseases of the eye, psychological diseases, and rehabilitation based on Major Diagnostic Category (MDC) codes, as patients with these diseases may have other options for treatment beyond general hospitals. We also drop patients whose MDC code is uncategorized (0), and neo-natal patients above age one. We also exclude patients who are missing gender or an indicator for whether the admission is for a Medical Diagnosis Related Group (DRG). We also remove patients not going to General Acute Care hospitals. For each disaster, we estimate models on the pre-period prior to the disaster and then validate them on the period after the disaster. We omit the month of the disaster from either period, excluding anyone either admitted or discharged in the disaster month. The length of the pre-period and post-period in general depends upon the length of the discharge data that we have available. Closure html. 35 See intake/, and 36 See 37 See and interactive/2013/05/20/us/oklahoma-tornado-map.html for maps of the tornado s path. 38 See and com/2013/11/20/moore-medical-center-destroyed-in-tornado-to-reopen-in-december/. 49

50 Table B-1 contains the disaster date and the pre-period and post-period for each disaster, where months are defined by time of admission. NYU hospital began limited inpatient service on December 27, 2012; unfortunately, we only have month and not date of admission and so cannot remove all patients admitted after December 27th. Right now, we drop 65 patients admitted in December to NYU; this patient population is very small compared to the size and typical capacity of NYU. For California, we exclude all patients going to Kaiser hospitals, as Kaiser is a vertically integrated insurer and almost all patients with Kaiser insurance go to Kaiser hospitals, and very few patients without Kaiser insurance go to Kaiser hospitals. This is in line with the literature examining hospital choice in California including Capps et al. (2003). We also exclude February though April 1994, as the I-10 Santa Monica freeway that goes through Santa Monica only reopens in April. Table B-1: Pre and Post Periods for Disasters Hospital Closure Date Pre-Period Post-Period Partial Reopen Full Reopen St. Johns 1/17/94 1/92 to 1/94 5/94 to 9/94 10/3/94 10/3/94 Sumter 3/1/07 1/06 to 2/07 4/07 to 3/08 4/1/08 4/1/08 NYU 10/29/12 1/12 to 9/12 11/12 to 12/12 12/27/12 4/24/14 Bellevue 10/31/12 1/12 to 9/12 11/12 to 12/12 2/7/13 2/7/13 Coney 10/29/12 1/12 to 9/12 11/12 to 12/12 2/20/13 6/11/13 Moore 5/20/13 1/12 to 4/13 6/13 to 12/13 NA NA C Econometric Model Performance In this section, we examine the performance of several parametric logit models used in the literature, and show that Inter and Semipar perform better than all of these models. Below, we discuss a different parameterization of δ ih from equation (5) that encompasses the different models used in the literature: δ ih = m β mk x jm z ik + γ k k z ik α h + m f(d ih ; x hm z ik ) + α h, (5) where i indexes patient, j indexes hospital, k indexes patient characteristics, and m indexes hospital characteristics. Then, z ik are patient characteristics (e.g., age, condition, location, etc.), x hm are hospital characteristics (e.g., academic medical center, trauma center status, etc.), α h are hospital indicators (alternative specific constants, in the language of McFadden et al. (1977)), and d ih is the travel time between the patient s zip code and the hospital. The function f(d ih ; x hm z ik ) k 50

51 represents distance interacted with hospital and patient characteristics. 39 Thus, the first term represents interactions between patient characteristics and hospital characteristics, the second term interactions between patient characteristics and hospital indicators, the third term some function of distance interacted with patient and/or hospital characteristics, and the fourth term α h alternative specific constants. Our first and simplest model (Indic) uses just a set of alternative specific constants α j to model δ. In other words, this specification assumes there is no patient level heterogeneity (i.e., only α are non-zero in equation (5)), so all patients within the relevant area have, on average, the same preferences for each hospital. In this framework, patient choices can be modeled as being proportional to observed aggregate market shares and so the model can be estimated with aggregate data. The most important differentiator among traditional econometric models that allow for patientlevel heterogeneity is whether or not they assume that patients choices can be modeled exclusively in characteristics space (Lancaster, 1966; Aguirregabiria, 2011). We consider two models (CDS and Char) that model δ using just a set of interaction terms between patient attributes (age, sex, income, condition, diagnosis) and hospital characteristics (for-profit status, teaching hospital, nursing intensity, presence of delivery room, etc.) and several interactions between patient characteristics and travel time. 40 γ and α j are set to zero. CDS is based on Capps et al. (2003), while Char is based on Garmon (2017). Although these models lack controls for unobserved hospital heterogeneity, these models have the virtue of straightforwardly allowing the econometrician to predict demand for hospitals that do not yet exist. A contrasting set of specifications relaxes the strong assumption of no unobserved hospital characteristics by including alternative specific constants (α) and functions thereof (z ik α j ). These models differ in the richness of the included variables other than the hospital indicators. first such model (Time) just includes hospital indicators, travel time, and travel time squared; May (2013) claims that this parsimonious specification performs just as well as more complicated models. The second model (Ho) is based on Ho (2006), and includes α j and interactions between hospital characteristics and patient characteristics (so β i is non zero). However, it excludes interactions with hospital indicators (so γ is always zero). The third model (GNT ), is based on Gowrisankaran et al. (2015), and includes a large set of interactions between travel time and patient characteristics. However, it only includes a small number of interactions of hospital indicators and hospital char- 39 For travel time, we use ArcGIS to calculate the travel time (including traffic) between the centroid of the patient s zip code of residence and each hospital s address. 40 We obtain these hospital characteristics from the annual hospital characteristics data provided by the American Hospital Association (AHA) Guide and Medicare Cost Reports; they include such details as forprofit status, whether or not a hospital is an academic medical center or a children s hospital, the number of beds, the ratio of nurses to beds, the presence of different hospital services such as an MRI or cardiac ICU, and the number of residents per bed. For a few hospitals in California, New York and Oklahoma, the AHA and Medicare Cost Reports only contain data on the total hospital system rather than individual hospitals. For the AHA Guide, see For the Medicare Cost Reports, see The 51

52 acteristics with patient characteristics. Finally, we consider a fourth model (Inter) that includes interactions of hospital indicators with acuity, major diagnostic category, and travel time as well as interactions between patient characteristics and travel time. Since it includes hospital indicators, it excludes interactions between hospital characteristics and patient characteristics (which would be absorbed). As is standard, we estimate all of these models using maximum likelihood. Figure 18 shows the average percent of observations with correct predictions from the econometric models discussed above, as well as Semipar. In these graphs, this percent is computed for each experiment independently and then the average is weighted across all experiments. We equally weight experiments, not patients. Figure 18 indicates that the best two performing models are the grouping model Semipar and the flexible econometric model Inter. Semipar correctly predicts 41% of the time on average, while Inter correctly predicts 39% of the time on average. By comparison, Indic correctly predicts 28% of the time on average, while Time correctly predicts 36% of the time. Ho and GNT perform slightly better than Time but not as well as Inter. This suggests that performance can be greatly improved by including drive time and that additional controls have modest but consistent effects. Interestingly, the inclusion of hospital indicators is not critical for individual level predictive accuracy. Both of the two models that omit hospital indicators (CDS and Char) have roughly similar predictive accuracy to Time, Ho, and GNT. In Figure 19, we show these results broken down on an experiment by experiment basis. While the share accurately predicted varies across experiments, the general narrative about models relative performance from above remains. Semipar and Inter typically perform the best (or near the best), while Indic and the characteristics based models generally do significantly worse. 41 Below, we display these results using RMSE as an evaluation criterion in Figure 20 and Figure 21. Semipar and Inter continue to perform better than the other econometric models for all of the experiments using the RMSE as a metric. 41 The one exception where Indic performs relatively well is in Coney. 52

53 Figure 18: Percent Correct Averaged over all Experiments 53

54 Figure 19: Percent Correct Shown by Experiment 54

55 Figure 20: RMSE Averaged over all Experiments 55

56 Figure 21: RMSE Shown by Experiment 56

How Do Machine Learning Algorithms Perform in Changing Environments? Evidence from Disaster Induced Hospital Closures

How Do Machine Learning Algorithms Perform in Changing Environments? Evidence from Disaster Induced Hospital Closures Devesh Raval Federal Trade Commission draval@ftc.gov Ted Rosenbaum Federal Trade Commission