%CEM: A SAS MACRO TO PERFORM COARSENED EXACT MATCHING

Size: px

Start display at page:

Download "%CEM: A SAS MACRO TO PERFORM COARSENED EXACT MATCHING"

Barrie Sparks
5 years ago
Views:

1 %CEM: A SAS MACRO TO PERFORM COARSENED EXACT MATCHING STEFANO VERZILLO PAOLO BERTA MATTEO BOSSI Working Paper n DECEMBER 2015 DIPARTIMENTO DI ECONOMIA, MANAGEMENT E METODI QUANTITATIVI Via Conservatorio Milano tel (21522) - fax (21505) E Mail: dipeco@unimi.it

2 %CEM: A SAS Macro to perform Coarsened Exact Matching S. Verzillo Univ. of Milan P. Berta CRISP-Univ. of Milan Bicocca September 16, 2013 M. Bossi DLG % CEM is a SAS macro which allows researchers to perform the recently introduced Coarsened Exact Matching (CEM) technique. CEM is a non-parametric matching method to avoid the confounding influence of pre-treatment control variables directly improving causal inference in quasi experimental studies. CEM authors originally provided few software solutions for R, Stata and SPSS packages to perform their matching algorithm. The % CEM macro integrates the already available software alternatives introducing a completely automated Coarsened Exact Matching macro for SAS users. Both the matching strategy -including some standard coarsening options- and the associated L 1 multivariate imbalance measure are provided. An empirical application estimating the causal effect of regional health systems on the intra-hospital mortality using multiple artificial datasets from a large administrative database completes the paper. Keywords: Coarsened Exact Matching, Causal inference, SAS, SAS/IML 1 Introduction An important branch of the existing literature on causal inference is represented by observational studies that often are the only available alternatives to the lack of randomization, especially in non-u.s. countries. The key goal of these studies is to measure the effect of a binary treatment (T) on an outcome of interest by two subgroups of individuals: treated and controls. The main assumption, unfortunately, lies on the absence of random assignment of units to treatment and control states. As a result, treatment and control groups of individuals may differ substantially on the multidimensional distribution of their observable covariates [9]. Then, to avoid the confounding influence of pre-treatment control variables on the estimated causal effect, different econometric techniques were early introduced by the literature. Methods available to control for this bias could be both parametric or not, and essentially consist of propensity score-based techniques, 1

3 matching algorithms and model stratification [1]. Propensity score matching [8] is the most commonly used within parametric methods [6]. It consists on estimating the individual conditional probability of assignment to the treatment status for all the selected individuals given their observed covariates. Then the estimated propensity score could be used in model estimation to accommodate general heterogeneity in different ways: a regression covariate as well as a matching parameter or a stratification rule. But usually finding a matching solution with propensity score does not guarantee a good balance to all of the selected covariates. Infact, improving balance on most of them could leave the remainders unbalanced often introducing also more bias with respect to the initial distribution. In addition to this, propensity score matching has the drawback of violating the congruence principle, which requires congruencies between the data and analysis spaces metrics (the metric of the two spaces is different by definition). It is well-known how parametric methods force covariates of input data from a multi-dimensional original space in a new space usually defined by the univariate propensity score. Mielke and Berry [7] show how violating the congruence principle produces less robust inferences. Otherwise matching is a non-parametric method that can be highly effective in removing imbalance in observed covariates between treatment and control groups. Exactly balanced data avoid in controlling for the observables (X) allowing researchers in estimating causal effects through a simple mean difference between the selected groups of individuals. An additional problem stems in the fact that most of the existing matching methods a priori guarantee the sample size of the two sub-groups but only occasionally reduce the imbalance between treated and controls units (hence occasionally reducing bias of the estimate effect). To avoid these substantial problems Iacus, King and Porro [5] introduced a new class of matching algorithms: Monotonic Imbalance Bounding (MIB) methods. Within this class of algorithms a specific method that appears really helpfull in empirical applications is Coarsened Exact Matching (CEM). CEM authors originally provided few software solutions for standard softwares like R, Stata and SPSS to perform their algorithm. Our %CEM SAS macro integrates the already available software alternatives introducing a completely automated Coarsened Exact Matching macro for SAS users. The paper is organized as follows: section 2 is a short review of the CEM algorithm, section 3 introduces the % CEM macro list of parameters while section 4 reports an empirical application with some results we obtain testing the macro speed with different options of variables binning and an increasing number of records. Conclusions complete the paper. 2 Coarsened Exact Matching CEM belongs to the set of matching approaches based on stratification (also known in literature as sub-classification ). In comparison with other methods 2

4 CEM firstly meets the principle of not reducing the original data space, operating in the multidimensional variable space itself. A second innovation of CEM as a member of MIB class of methods is that it does not fix a-priori the number of matched observations ex-ante but it lets the number of matched units be the result of the setting coarsening parameters on the observables. CEM is, this way, expressly defined to overcome the issue of increasing imbalance on some variables when improving it for others, which represents a serious problem when performing parametric methods as propensity score. With CEM the researcher chooses the maximal level of allowed imbalance ex-ante and then CEM produces a matched sample of a-priori unknown size. Additionally to this CEM exhibits really interesting computational advantages because CEM algorithm represents all the observable information in a single text string associated to each observation. The result is that Coarsened Exact Matching has the same complexity of simple frequency tabulation. On the other hand the most serious drawback of CEM consists on the fact that setting a level of coarsening too fine means discarding lots of units. So choosing the level of coarsening appropriately is the crucial point when running CEM. If the binning is too large then important information, potentially useful for better matching results, may be missed. Otherwise the smaller is the coarsening the larger is the number of discarded observations and the solution may be unavailable or less efficient. So that, the main results of CEM are threefold: less covariate imbalance, less model dependence and less resulting statistical bias. The CEM authors documented in their original paper how in many empirical applications CEM eliminates much of the heterogeneity producing causal estimates. Given the assumption that the coarsening choices have to be done on the basis of researchers substantive information, in order to support SAS users in this trouble we have automated a series of standard coarsening options which choose -case by case- the bin-widths for continuous variables. These standard alternatives are automatically produced by % CEM macro. The CEM structure is the subsequent: 1. it coarsens each of the observed variables (X) following the researcher willingness (differently if categorical or continuous); 2. it applies a matching algorithm (1:1 or 1:n) to the strata identified by the attributes of the coarsened variables; 3. empty strata, where no treated or controls are included, are discarded while strata with at least one treated or control units are retained; 4. CEM weights are computed for each stratum (s) as follows: w i = m C /m T m s T /ms C where ms T and ms C and m C and m T are respectively: the formers the frequencies of treated and controls individuals in the stratum and the latter the frequencies of treated and controls being matched in the same stratum. Additionally weights of zero are given to unmatched units; 3

5 5. finally an imbalance measure called L 1 is computed; The L 1 was introduced by the authors ( [3]) to measure the distance between the multivariate histograms (H) of the original and the matched populations producing a measure of global balance. The L 1 balance measure is computed into the % CEM macro invoking an ad hoc macro called % L 1. An alternative multidimensional balance measure called G I has recently been introduced by the literature [2] with it s % G I SAS macro code. This macro could be easily combined by a SAS user with the % CEM substituting the % G I macro instead of our % L 1 into the SAS original % CEM macro (only adapting it with the names of our macro parameters). The original L 1 is defined as follows: L 1 (H) = 1 2 Â f l1 l k g l1 l k (1) l 1 l k 2H(X) where f l1 l k and g l1 l k are the relative frequencies of treated and controls belonging to the cells with coordinates l 1 l k in the multivariate cross tabulation (H). The L 1 provides an easy interpretation: conditioning on the coarsening level if the empirical distributions before and after CEM are completely separated then L 1 =1 while if the distributions perfectly overlap then L 1 =0. Otherwise L 1 2 [0, 1]. For example if L 1 = 0.81, it means that a 19% of the two multidimensional histograms overlap. A good matching performance is reached if L 1 of the matched population is less or equal to the L 1 of the original population. Optimizing the absolute differences of treated and controls relative frequencies of the full matrix (H) we adopt a reduced approach using the SAS/IML language. Computation of the relative differences is performed separately for each sub-matrix (both outside and onto the principal diagonal) of the complete original matrix (H) and then the additive property of L 1 guarantees to obtain same results. Then to simultaneously compare different pre-defined coarsening levels a standard proc gplot of each of their L 1 measures is offered with % CEM allowing researchers in choosing the more efficient binning solution to their purposes. 3 List of Macro Parameters Based on recent contributes in matching literature [3] a SAS (SAS Institute Inc.) macro program to perform Coarsened Exact Matching and evaluate the global imbalance of pre/after-matching populations is written. A complete list of % CEM parameters is the following: % CEM (lib=, data=, id=, treat=, del mis=, match type=, coddataset=) where: 4

6 * lib: name of the directory containing the original dataset; * data: name of the SAS dataset to be read. It must be organized with one row for each observation to be matched (individuals or firms), K observed continuous or categorical covariates, a treatment indicator variable and the ID primary-key variable; * id: ID primary-key variable; * treat: a dummy indicator variable, 1-treated and 0-untreated; * del mis: option for missing values: 0 for keeping as additional categories and 1 for deleting before matching; * match type: 1 (1:1 matching) or N (1:N matching with associated strata weights); * coddataset: option to assign a code to the different dataset tested; The macro computes CEM between treated and controls using both SAS and SAS/IML languages. At this end, after defining the type of matching (1:1 or 1:n) and missing value options, it computes the matching algorithm on the provided subjects (data). % CEM automatically creates some multiple sets of strata, depending on the level of coarsening, each with equal values of the observable covariates (X). For continuous variables the macro performs by default quintiles, quartiles, percentiles and original values as matching alternatives. For nominal and ordered variables the macro assumes that the user already specifies data in the desired number of categories. Categorical variables cannot be coarsened by default without specific choices on how the coarsening would take place. Indeed coarsening choices are assumptions that are strictly based on a substantial and extensive knowledge on both variables interpretations and measurement scales. Then subjects belonging to strata with at least one treated and one control units are retained while the left-overs are pruned by the sample. Therefore, depending on the specified matching option % CEM performs exact matching randomly selecting the desired number of treated and controls, or it includes all of them by calculating CEM weights for each strata. Finally the L 1 multidimensional imbalance measure is computed for each of the default coarsening option and compared to the others with a simple graphical representation. 4 Empirical Application This section details of applying % CEM macro for matching treated and control patients in an artificial regional study focused on the incidence of in-hospital mortality in Lombardy Region (Italy). The purpose of this study is to assess 5

7 if there is a different risk of mortality for citizens resident in Lombardy and citizens resident elsewhere in Italy but discharged from a Lombard hospital. The Italian National Healthcare System (NHS) provides universal healthcare coverage, but a recent policy of devolution (2001) has transferred several important administrative and organizational responsibilities from the central government to the 20 regions. Among the 20 regions, Lombardy is one of the most important in terms of socio-demographic and economic aspects. It contains about 10 million citizens (equal to 16% of the Italian population) and it ranks among the most competitive areas in Europe. The Lombardy healthcare system comprises approximately 200 hospitals, 2 million of discharges annually and 16 billion of Euros devoted to healthcare expenditures (73% of the total regional budget). A regional reform in 1997 radically transformed the healthcare system into a quasi-market in which citizens can freely choose the provider, regardless of the ownership (private for profit, private not for profit, or public). The Italian NHS provides for each citizen free hospitalization in any of the Italian regions. This determines that each years about 150,000 Italian citizens from other regions are admitted in Lombard hospitals. The empirical application of this work would like to understand if there is a difference in term of in-hospital mortality between patients from other regions respect to Lombard inhabitants. For this reason we apply the %CEM macro to extract two subgroups, one of treated (Lombard patients) and one of control (not-lombard patients) patients, selected according to specific characteristics, in order to verify the average difference on their mortality risk. The database was originally abstracted from the administrative regional healthcare information system that collects data about the patients admitted to hospitals in the Lombardy region in In 2011 discharges were around of which 79% were ordinary and 21% were day-hospital or daysurgery. Moreover, hospitalizations of residents outside the Lombardy region account for 10% of the whole admissions. The hospital discharge data contains basic demographic information (age, gender), information on hospitalization (length of stay, special-care unit use, transfers within the same hospital or through other facilities, within-hospital mortality,...) and six diagnosis codes and procedures defined according to the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). Only ordinary hospitalizations for patients aged more than 2 years were retained in the sample. The response variable was an in-hospital binary mortality index, indicating whether or not the patient died during the hospitalization. Selected variables at the patient level were chosen as reasonable major determinants of patient mortality in our example. To apply the %CEM macro we control for patient s age (AGE, expressed in years), gender (SEX, a dummy variable equal to 1 if the patient is male and 0 otherwise), coexisting conditions expressed by the Elixhauser index (COMORB; Elixahuser, 1998), presence of selected comorbidities at admission such as cardiovascular diseases (CARDIO, expressed as a dummy variable) and cancer (ONCO, expressed as a dummy variable), length of stay (LOS, expressed in days), transit in an Intensive Care Unit (ICU), presence of a 6

8 principal diagnosis indicating an admission to emergency (EMERG, expressed as a dummy variable), Diagnostic Related Group (DRG) and Major Class of DRG (MDC). In addition to this, in order to artificially test the macro on a consistent number of variables, we have introduced the following characteristics usually not affecting mortality: financing category (rehabilitation, long term care, etc..), length of stay before surgery, type of discharge (voluntary, to a different hospital, at home, etc.),ward of discharge, four different variables indicating a re-admission according to its type and the number of days occurred between a previous discharge and a following hospitalization. The following code was submitted: % CEM (lib=health, data=discharges, id=patient id, treat=treat, del mis=1, match type=1, coddataset=1) To calculate the simulation test speed of the macro we built 28 datasets for different number of records and different number of numeric or categorical variables. In table 1 we describe the characteristics of these datasets, composed of a range of records between 100,000 and 1,000,000 and a number of variables varying between 2 and 18. To monitories all the parameters that could influence the speed of the macro in each dataset we arbitrarily tested a different number of numeric and categorical variables. %CEM macro speed performances were tested on a notebook with the following technical characteristics: OS Windows 7 (X64) Quad-Core processor Intel(R) Core(TM) i5-2430m 2.40GHz 4.00 Gb Ram Memory Speed test are calculated as the difference in seconds. At each session of test the SAS Log, the output and the Work Library are cleaned. In table 2 we present main results. The code identifying the datasets are referred to the code assigned in table 1. As expected time increases with the number of records and variables and varies from 7 minutes for the dataset with 100,000 records and 2 variables (1 numeric variable and 1 categorical variable) to nearly 2 days and 2 hours for the dataset with 1,000,000 records and 18 variables (8 numeric variable and 10 categorical variables). The executions of the macro, obviously, produce different selections of patients depending on the way the stratification code is constructed. Each numeric variable is classified according to 4 criteria: the exact value of the variable, the value of the percentile of the exact value, the value of the quartile and the value of the quintile. Each categorical variable is combined in a single code, defined as the whole set of the individual categories. The combinations of these 4 values for each numerical variable with the original values of the categorical variables produce a new code for the stratum to be assigned to each patient. At this point %CEM combines records with same 7

9 Table 1: Speed Test: Dataset s characteristics DatsetCode Tot Record Tot Var Tot Numerical Var Tot Class Var 1 100, , , , , , , , , , , , , , , , , , , , , ,000, ,000, ,000, ,000, ,000, ,000, ,000,

10 Table 2: Speed Test: Time of execution DatasetCode Time HH:MM:SS Time seconds 1 00:07: :19: :28: :26: :47: :03: :20: :10: :22: :29: :29: :40: :56: :50: :14: :26: :32: :32: :42: :10: :30: :52: :58: :51: :11: :34: :55: :21:

11 Figure 1: L1 Plot Table 3: Speed Test: Dataset s characteristics Model1 Model2 Model3 Model4 Model5 Model6 Model7 Model8 Intercept *** *** *** *** *** *** *** *** TotRec *** *** *** *** *** TotVar *** *** TotVarClass *** *** * TotVarNum *** *** * R-square *** = p-value ** = p-value * = p-value 0.01 stratum code according to the specified matching rule. This way the %CEM macro selects an equal number of records for treated and control patients (it s a 1:1 matching) or a different number of records for treated and controls of the stratum assigning them a specific CEM weight (in a 1:n matching). The different selection criterion produces more or less balanced sub populations. At the end of the execution the macro calculates the L 1 parameter, as discussed before, and plots a graph of all the L 1 balance values. Analyzing the L 1 plot we can choose the best matching system corresponding to the lower L 1 value (an example is given by Figure 1). Analyzing the contribution of the number of records and the number and type of variables included to the speed of the %CEM macro we estimate a model that analyzes time of execution with respect to the different characteristics of the dataset. Results are presented in table 3. The analysis put in evidence as the number of numeric variables is the main predictor of the time of execution of the macro. The number of variables included and the variable type (categorical or numerical) contribute to explain the time of execution in all the models as the number of records. Models with the best goodness of fit are models 5, 6, 7 and 8 where there is a combined effect of the number of records and all the different number of variables. We tested also the interaction between the number of records with all the number 10

12 of variables but the parameters estimated were not significant. To test for matching validity we set an artificial model that estimates the probability of being a Lombard patient versus the probability of being a patient resident in a different Italian region. If the matching remove all the unbalance we expect to obtain non-significant coefficients for all the covariates included. where: Treat ln( i )=a 0 + S 1 Treat j b j X ji + # i (2) i i = 1... I Patients, j = 1... J Covariates at patient level Treat i is a dummy variable = 1 if the patient is resident in Lombardy or = 0 otherwise X ji are the J covariates at patient level, corresponding to the variable used to match the patients Table 4 confirms the hypothesis tested. Finally we estimate a standard treated model to measure the effect of being Lombard on the in-hospital mortality in our empirical example: we analyze the in-hospital mortality difference in the original data and after the %CEM application on the matched observations. Results are shown in tables 5 and 6. where: p i ln( )=a 0 + btreat 1 p i + # i (3) i i = 1... I-th patient p i is the probability of dying in-hospital for the i-th patient Treat i is a dummy variable equal to 1 if the patient is resident in Lombardy and zero otherwise The estimated model in table 5 shows as for the patients that live in Lombardy the risk of dying in hospital is 3 times higher respect to the patients from different regions. This could means that the patients from other regions come to Lombardy for specific hospitalization related moreover to high specialization with lower risk of death. Patients from other regions with high risk of dying could ask to come back at home and then we can t follow the rest of their life. After the application of %CEM macro we compare patients with same characteristics and the model estimated in table 6 puts in evidence a not-significant risk for patients living in Lombardy and patients living outside Lombardy. 11

13 Table 4: Test for unbalance Variable Estimate StdErr P Value Intercept Sex (F vs M) Age Cardio COMORB (1 vs 5+) COMORB (2 vs 5+) COMORB (3 vs 5+) COMORB (4 vs 5+) LOS Mdc Mdc Mdc Mdc NA Rep Rep Rep Rep Clafi1 DD Clafi1 DO Clafi1 ZU Urg TYPE DISCH (1 vs 7) TYPE DISCH (2 vs 7) TYPE DISCH (3 vs 7) TYPE DISCH (4 vs 7) TYPE DISCH (5 vs 7) TYPE DISCH (6 vs 7) Onco ICU Drg Drg Drg Drg Drg Drg Drg LOS Pre Surg Readm S GG Readm MDC S GG Readm MDC S AC Readm MDC GG

14 Table 5: Analysis of in-hospital mortality before %CEM application Variable Estimate StdErr P Value Intercept Lombardo Table 6: Analysis of in-hospital mortality after %CEM application Variable Estimate StdErr P Value Intercept Lombardo Concluding Remarks Coarsened Exact Matching is a non-parametric matching method introduced to avoid the confounding influence of pre-treatment control variables directly improving causal inference in quasi experimental studies. CEM s authors originally provided few software solutions for R, Stata and SPSS. % CEM now allows researchers in performing the Coarsened Exact Matching (CEM) technique also with the SAS software taking advantages of this software performances. The %CEM macro fills this way an important lack in the actual software literature providing the possibility of processing the CEM matching algorithm on a huge number of records with a consistent time saving. The macro code is illustrated using an ad hoc example of matching treated and control patients in a regional Italian study of the incidence of in-hospital mortality considering the individual place of residence. Macro s execution time depends on the number of records (results provided from 100,000 up to 1 million) and variables (from 2 to 18, both numerical or categorical) included. The analysis of execution times makes evidence of how time depends strongly on both the number of records and the number of numerical variables (goodness of fit around 0.93). Our empirical application makes evidence of how, in our example, unbalanced observable characteristics of treated and control patients directly affect the treatment estimate. 13

15 References [1] Cochran, W. G., The effectiveness of adjustment by sub-classification in removing bias in observational studies, Biometrics, vol. 24, pp , (1968); [2] Camillo, F., D Attoma, I., % GI SAS Macro: A SAS Macro for Measuring and Testing Global Imbalance of Covariates within Subgroups, Journal of Statistical Software, vol. 51, Code Snippet 1, (2012); [3] Iacus, S.M., King, G., Porro, G. Cem: Software for Coarsened Exact Matching, Journal of Statistical Software, 30(9), 127 (2009); [4] Iacus, S. M., Porro, G., Random Recursive Partitioning: A Matching Method for the Estimation of the Average Treatment Effect, Journal of Applied Econometrics, vol. 24, pp [349], (2009); [5] Iacus, S. M., King G. and Porro G., Multivariate matching methods that are Monotonic Imbalance Bounding, Journal of the American Statistical Association, (2011); [6] Imbens, G. and Wooldridge, J.M., Recent developments in the Econometric of Program Evaluation, IZA Discussion Paper, No. 3640, (2008); [7] Mielke, P. W., and Kenneth J. B., Permutation methods: A distance function approach, New York: Springer, 2007; [8] Rosenbaum, P. R. and Rubin, D. B., The central role of the propensity score in observational studies for causal effects, Biometrika, vol.70, 41-55, (1983); [9] Rubin, D.B., Practical implications of modes of statistical inference for causal effects and the critical role of the assignment mechanism, Biometrika, vol.47, (1991). 14

BIOSTATISTICAL METHODS

BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH PROPENSITY SCORE Confounding Definition: A situation in which the effect or association between an exposure (a predictor or risk factor) and