Using Classification Tree Analysis to Generate Propensity Score Weights
|
|
- Daniel Lang
- 5 years ago
- Views:
Transcription
1 Using Classification Tree Analysis to Generate Propensity Score Weights Ariel Linden, DrPH 1,2, Paul R. Yarnold, PhD 3 1 President, Linden Consulting Group, LLC, Ann Arbor, Michigan, USA 2 Research Scientist, Division of General Medicine, Medical School - University of Michigan, Ann Arbor, Michigan, USA 3 President, Optimal Data Analysis, LLC, Chicago, Illinois, USA Corresponding Author Information: Ariel Linden, DrPH Linden Consulting Group, LLC 1301 North Bay Drive Ann Arbor, MI USA Phone: (971) alinden@lindenconsulting.org Key Words: propensity score, machine learning, classification tree analysis, causal inference Running Header: classification tree analysis and propensity scores Acknowledgement: We wish to thank Julia Adler-Milstein for her review and feedback on the manuscript. Dr. Yarnold s work was partly funded by a grant from the National Cancer Institute (1R01CA A1). This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: /jep.12744
2 ABSTRACT Rationale, aims and objectives: Widely-used in the evaluation of non-randomized interventions, propensity scores estimate the probability that an individual will be assigned to the treatment group given the observed characteristics. Machine learning algorithms have been proposed as an alternative to conventional logistic regression-based modelling of propensity scores in order to avoid many limitations of linear methods. In this paper we introduce the use of classification tree analysis (CTA) to generate propensity scores. CTA is a decision-tree -like classification model that provides accurate, parsimonious decision rules that are easy to visually display and interpret, reports P values derived via permutation tests performed at each node, and evaluates potential model cross-generalizability. Method: Using empirical data, we identify all statistically valid CTA propensity score models and then use them to compute strata-specific, observation-level propensity score weights that are subsequently applied in outcomes analyses. We compare findings obtained using this framework to the conventional method (logistic regression) and a popular alternative machine learning approach (boosted regression), by evaluating covariate balance using standardized differences, model predictive accuracy, and treatment effect estimates obtained using median regression and a weighted CTA outcomes model. Results: While all models had some imbalanced covariates, main-effects logistic regression yielded the lowest average standardized difference, whereas CTA yielded the greatest predictive accuracy. Nevertheless, treatment effect estimates were generally consistent across all models.
3 Conclusions: Assessing standardized differences in means as a test of covariate balance is inappropriate for machine learning algorithms that segment the sample into two or more strata. Because the CTA algorithm identifies all statistically valid propensity score models for a sample, it is most likely to identify a correctly specified propensity score model, and should be considered as an alternative approach to modeling the propensity score.
4 1. INTRODUCTION Introduced in 1983, the propensity score joined other widely-used methods (e.g. instrumental variables [1,2]) that explicitly model treatment assignment in order to estimate treatment effects in non-randomized studies. The propensity score is defined as the probability of assignment to the treatment group given the observed characteristics [3]. It has been demonstrated that, in sufficiently large samples, if treatment and control groups have similar distributions of the propensity score they generally have similar distributions of the covariates used to create the propensity score (i.e., they exhibit covariate balance). The observed baseline covariates can thus be considered independent of treatment assignment (as if they were randomized), and therefore will not bias treatment effect estimates [3]. Currently there is no consensus regarding how best to estimate the propensity score. In a survey of the literature, Weitzen et al. [4] reported that propensity score estimation is nearly universally performed via logistic regression, and that there is tremendous inconsistency in how models are estimated. For example, some investigators estimate models in which the variable selection process includes only main effects, while others estimate completely saturated models (including all possible interactions, and squared and cubed terms), while others use automated forward or backward stepwise procedures to select variables for model inclusion. The fundamental concern with this heterogeneous approach to propensity score estimation is that the resulting propensity score model is likely to be misspecified -- that is, the estimated probability of being in the treatment group may differ substantially from the corresponding true probability [5]. With increasing degrees of misspecification it may become
5 implausible to assume that the propensity score accurately represents the underlying covariate distributions, but rather that individuals are not conditionally exchangeable between study groups. In short, a misspecified propensity score may fail to achieve covariate balance between treatment groups, which will subsequently bias treatment effect estimates -- the greater the imbalance, the stronger the bias [6,7]. To avoid the limitations of conventional statistical methods, several investigators have suggested the use of machine learning algorithms as an alternate approach for estimating the propensity score [8-15]. Machine learning algorithms find the best fitting model through automated processes that search through the data to detect patterns that may include interactions between variables, as well as interactions within subsets of variables. This is in contrast to conventional statistics, where a model is chosen and estimated based on an a priori hypothesis about the relationship between the variables, and then statistical tests are performed to evaluate whether the data fit crucial assumptions underlying the validity of the findings [16]. In short, machine learning allows the data to dictate the form of the model, whereas conventional statistics attempts to fit the data to an investigator-specified model. While there are hundreds of machine learning classification algorithms to choose from [17], the models most often examined in the propensity score literature are classification and regression trees (CART) [8,9,11,12], neural networks [11], and ensemble methods such as boosted regression [10,12,14] and random forests [12]. Studies that have conducted head-to-head comparisons between machine learning algorithms and logistic regression for estimating the
6 propensity score have generally found that machine-learning models outperform logistic regression in terms of reduced bias (i.e. the difference between the estimated effect versus the true effect) in the outcome [10-12]. In this paper, we introduce classification tree analysis (CTA) [18,19] and assess whether it offers a superior alternative to logistic regression and boosted regression for estimating propensity scores. CTA is a decision-tree -like classification model that provides accurate, parsimonious decision rules that are easy to visually display and interpret, while reporting P values derived via permutation tests performed at each node -- making this approach particularly attractive to investigators coming from statistics-based disciplines as compared to other machine learning approaches. In our proposed approach, once a CTA model is generated, strata-specific propensity score weights are computed for all observations in the sample. These weights are then applied in the subsequent outcomes analysis. We illustrate the implementation of the CTAweighting framework and compare it to weighting approaches using propensity scores derived from conventional logistic regression as well as from boosted logistic regression that is presently the most popular machine learning approach for estimating the propensity score [10]. The paper is organized as follows. In the Methods section we provide a brief introduction to CTA, and describe the data source and analytic framework employed in the current study. The Results section reports and compares the results of the logistic regression, boosted regression and CTA-weighting framework. The Discussion section describes the specific advantages of the CTA-weighting framework for estimating the propensity score and evaluating treatment effects
7 compared with logistic regression and other machine learning approaches, and discusses how CTA can be applied more broadly within the causal inferential framework. 2. METHODS 2.1 A brief introduction to Classification Tree Analysis In its simplest form, CTA is an optimal discriminant analysis (ODA) model [20]. ODA is a machine-learning algorithm that finds the cutpoint(s) on an ordered attribute (variable) that maximally discriminates between two or more classes (e.g. treatment groups) [21]. The optimal cutpoint is determined by iterating through each value on the attribute and calculating the effect strength for sensitivity [ESS], which is the mean sensitivity amongst the classes, standardized to a 0-100% scale where 0 represents the discriminatory accuracy expected by chance, and 100% represents perfect discrimination. By definition, the maximally accurate predictive model uses the optimal cutpoint achieving the highest ESS. This model is further subjected to a nonparametric permutation test to assess the statistical validity of that cutpoint. Finally, reproducibility and generalizability of the model are assessed using cross-validation methods [18,22,23]. CTA models use one or more attributes to classify a sample of observations into two or more subgroups that are represented as model endpoints (these are called terminal nodes in alternative decision-tree methods). Subgroups are known as sample strata because the CTA model stratifies the sample into subgroups of observations that -- with respect to model attributes -- are homogeneous within and heterogeneous between strata [18]. The initial hierarchically-
8 optimal CTA algorithm involves chained ODA models in which the initial ( root ) node represents the attribute achieving the highest ESS value for the entire sample, and additional nodes yielding greatest ESS are iteratively added at every step on all model branches [24]. In contrast, the enumerated-optimal CTA algorithm explicitly evaluates all possible combinations of the first three nodes, which dominate the solution [25]. The most robust globally-optimal (GO) CTA algorithm explicitly evaluates all possible solutions (called the descendant family), and identifies the model reflecting the best combination of ESS and parsimony (i.e. the model yielding highest ESS using the fewest strata). The software that implements ODA and CTA models provides users with a vast array of options for controlling the modeling process, and a comprehensive description can be found elsewhere [18]. 2.2 Data We use data from a primary care-based medical home pilot program that invited patients to enroll if they had a chronic illness or were predicted to have high costs in the following year [26]. The goal of the program was to lower healthcare costs for program participants by providing intensified primary care [27]. The retrospectively collected data consist of observations for 374 program participants and 1,628 non-participants. Eleven pre-intervention characteristics were available; these included demographic variables (age and gender), health services utilization (primary care visits, other outpatient visits, laboratory tests, radiology tests, prescriptions filled, hospitalizations, emergency department visits, and home-health visits), and
9 total medical costs. The outcome was total medical costs in the program year (see [26] for a more comprehensive description). 2.3 Estimating the propensity score This study compared three different modelling approaches for estimating propensity scores. Fundamental characteristics of each approach are described in this section, and corresponding methods for computing propensity scores are described in the next section. The first approach, which is the most commonly used in practice, involves estimating a logistic regression model to predict program participation status using the eleven preintervention covariates described above, all entered as main effects. We also estimate a fully saturated logistic regression model which includes the eleven main effects, all possible interactions (including squared terms), and cubed terms for continuous variables. The fully saturated model represents the extreme use of logistic regression for estimating the propensity score, in which every possible relationship between the covariates and outcome (treatment assignment) is explored. The second approach uses a popular machine learning algorithm called boosted logistic regression for estimating the propensity score [10]. Boosted regression is a procedure in a family of machine learning classifiers called ensemble methods, which combines a large number of relatively simple models (e.g. decision trees) adaptively to optimize predictive performance. Boosting follows a sequential process in which decision trees are fitted iteratively to random subsets of the data, gradually increasing emphasis on observations modelled poorly by the
10 existing collection of trees. The final boosted model is a linear combination of many trees (usually hundreds to thousands) that can be thought of as a regression model where each term is a tree [28]. Here we apply the boosted approach to estimating propensity scores as described by McCaffrey et al [2004], and we implement it in Stata using the user-written program BOOST [29], setting the maximum iterations at 20,000, the shrinkage factor to , the percentage of data to be used as training data at 80%, the fraction of training observations to be used to fit an individual tree at 50%, and allow up to seven interactions to be assessed. All eleven preintervention covariates were used to predict program participation. The third approach uses the GO-CTA algorithm [18]. For any given dataset, multiple propensity score models having 90% power to test a non-directional hypothesis with experimentwise P <0.05 may be generated depending on the subset of covariates and interactions included. We utilize the GO-CTA approach to identify and select the optimal model in the family of all statistically valid CTA models that exist for the sample, evaluating all eleven preintervention covariates for inclusion. Point estimates and exact discrete 95% confidence intervals (CIs) are computed for ESS and D (ESS normed for parsimony) for every model in the family, for model performance as well as for chance: if model and chance 95% CIs overlap then the model is judged to be statistically invalid. The GO model is defined as the CTA model within the family of models which has the smallest D statistic. Generalizability of model performance is estimated presently using leave-one-out (LOO) cross-validation. We constrained all CTA models
11 to yield identical predictive accuracy in training and LOO analysis [18,30]. Once all the CTA models were generated, weights were computed for individuals in all end-point strata. 2.4 Generating propensity score weights Reflecting conventional practice, the inverse probability of treatment weight (IPTW) was computed for each individual in the sample [31]. The IPTW is based on the conditional probability of an individual receiving his/her own treatment: IPTW i = (Z i / p i ) + ([1 Z i ] / [1 p i ]). In this approach an individual i in the treatment group (Z = 1) receives a weight equal to the inverse of the estimated propensity score p, and an individual in the control group (Z = 0) receives a weight equal to the inverse of 1 minus p. The IPTW weights the treated and control groups to reflect the characteristics of the combined sample in order to estimate the average treatment effect [32,33]. In contrast, for CTA models a stratified weight is generated for each individual based on both their actual treatment assignment and their specific stratum (model endpoint): observations have identical weights if they are classified into the same endpoint and they have the same actual treatment assignment (i.e. treated or non-treated). CTA model-based stratified weights are computed using the following formula: n s Pr (Z = z) n z = z,s (1) where n s is the total number of individuals in a given stratum s, Pr (Z = z) is the estimated probability of assignment to treatment group z (i.e., the proportion of individuals actually
12 receiving treatment z in the sample), and n z = z,s is the total number of individuals in stratum s who were actually assigned to treatment z. Thus, the weight is proportional to the ratio of the number of individuals in a given stratum relative to the number of individuals within that stratum who do (not) receive treatment. Taken together, the stratification reduces bias in the observed covariates used to create the propensity score, and the weighting standardizes each treatment group to the target population. We developed this stratified weighting approach for the CTA models to ensure that weights conform exactly to the underlying geometry and findings of the CTA model. Although stratified weighting has been shown to produce less bias than IPTW when the propensity score is misspecified [34], we apply IPTW in the comparison models to be consistent with other (prior) studies using machine learning for generating propensity score weights [12]. 2.5 Estimating treatment effects For all propensity score weighted models, we estimated treatment effects using two approaches. In the first approach we estimate treatment effects using quantile (median) regression, in which the outcome variable (medical costs in the program year) is regressed on the treatment indicator, the weights specified as sampling weights, and standard errors and confidence intervals computed via a bootstrap procedure with 2,000 repetitions [35]. Quantile regression is used because medical costs are highly skewed and contain several outliers. In the second approach, we estimate treatment effects using another CTA model (other than the initial model that generated the propensity scores). Here, medical costs are specified as
13 the attribute, treatment assignment is specified as the class variable, and the weights are used for adjustment. Exact P values were estimated using 25,000 Monte Carlo experiments, and LOO analysis (single-case jackknife analysis) was performed to assess potential cross-generalizability of the model in correctly classifying individuals outside of the sample used for model estimation [23,36]. 2.6 Performance metrics We use several methods for assessing the performance across the propensity score estimation models. First, we use the absolute standardized difference statistic for assessing whether weighting on the propensity score successfully balanced the covariates [37]: SD = X T X C 2 2 ( ST ) + ( SC ) (2) 2 where the numerator is the absolute difference in means between the treatment and control groups (denoted as T and C, respectively) and the denominator is a 50:50 pooled standard deviation [38]. While there is currently no universally-recognized cut-off point as to what is considered the upper limit of balance, Normand et al. [39] suggest that a standardized difference of less than 0.10 is indicative of good balance. We use ESS to assess the accuracy of fit amongst the various outcome models. The ESS statistic is a chance-corrected (0 = the level of accuracy expected by chance) and maximum-
14 corrected (100 = perfect prediction) index of predictive accuracy. The formula for computing ESS for binary case classification is [40]: ESS = [(Mean Percent Accuracy in Classification 50)]/ 50 x 100% (3), where Mean Percent Accuracy in Classification = (sensitivity + specificity)/2 x 100 (4). Based on simulation studies, Yarnold and Soltysik [40] consider ESS values less than 25% to indicate a relatively weak, 25% to 50% to indicate a moderate, 50% to 75% to indicate a relatively strong, and 75% or greater to indicate a strong effect. Using ESS, an investigator may directly compare the performance among the various propensity score and outcome models, regardless of structural features of the analyses, such as sample size and the measurement metric. While ESS compares the predictive accuracy of every given model versus chance, different models may achieve the same level of normed accuracy using different numbers of sample strata. Because model complexity increases as the number of sample strata increases, the D (for distance ) statistic standardizes model ESS for parsimony. The formula for computing D for binary case classification is [18]: D = 100/(ESS/2)-2 (5), where the resulting value gives the number of additional effects of identical strength (i.e., ESS) observed for the model that are needed to obtain a theoretically ideal model having perfect accuracy using the minimum number of strata possible for the sample: if accuracy is perfect then D = 0 [41].
15 Finally, we assess the generalizability (external validity) of the models using LOO crossvalidation. We conduct these analyses to assess how well the model predicts treatment assignment to new study participants who may have somewhat different characteristics than those in the original sample. The ESS of the cross-validated model is compared to those of the original model using the entire data-set. The model is considered generalizable if the accuracy measures remain consistent with those of the original model. Current practice guidelines recommend constraining CTA models to have identical ESS in training (total sample) and LOO analysis as a means of inhibiting overfitting and maximizing cross-generalizability [18,42]. 2.6 Analytic software Stata 14.1 (StataCorp., College Station, TX, USA) was used to perform logistic regression and boosted logistic regression for estimating the propensity score and quantile regression for estimating treatment effects (outcome model). We estimated the two logistic regression models (main effects only, and fully saturated) using a user-written command for Stata, LOOCLASS [43], which performs LOO and produces several classification measures. We estimated a boosted logistic regression implementing the user-written program BOOST [29], within a modified wrapper program of LOOCLASS to provide the LOO estimates for the boosted model. Standardized differences were computed using a user-written command for Stata, COVBAL [44]. GO-CTA was conducted to generate and assess the accuracy of propensity score models, and to model outcomes, and was performed using CTA software [18,25]. 3. RESULTS
16 Table 1 presents the observed pre-intervention characteristics of the participants and nonparticipants in the pilot study [26]. Continuous variables are summarized by the mean and standard deviation, and categorical variables are presented as number and percent. For balance measures, we report the standardized difference, for which perfect balance is zero and the conventional P value, where variables with values 0.05 may be considered imbalanced. It is clear that the participant group differed markedly from the non-participant group on every characteristic. On average, participants were older, were less likely to be female, and had higher utilization and costs than non-participants. All standardized differences exceeded the recommended value of 0.10, and all P values were Thus, it is readily apparent that this non-randomized study exhibits substantial selection bias. Table 2 summarizes the structure (number of strata, smallest strata N) and performance (ESS, D) of all CTA models that emerged for discriminating between study participants and nonparticipants. Disqualified models either failed to achieve the minimum denominator criterion (N>34) specified in power analysis (steps 1-4), or had 95% CIs for D (steps 13-14) lower than for the globally-optimal model in the family (step 12). The descendant family (DF) thus consisted of the eight models in steps 5-12: note that all eight models had ESS 95% CIs that overlapped and reflected relatively strong normed predictive accuracy, and all eight models had chance ESS 95% CIs that overlapped and reflected relatively weak normed predictive accuracy. However, model 12 is unambiguously identified as the GO model since its 95% CI for D lay below corresponding 95% CIs for all other models in the DF [18,41].
17 Although all eight models in the DF may be used to construct propensity scores, for this exposition we limit our analysis to the four models illustrated in Appendix Figures 1-4, respectively. Results are reported for the least complex two-strata GO model [CTA-2] (Table 2, step 12); the most accurate (highest ESS) next-least complex four-strata model [CTA-4] (step 9); the sole intermediate-complexity six-strata model [CTA-6] (step 8); and the nine-strata model [CTA-9] -- the first and most complex member of the DF, offering greatest stratification granularity (step 5). All of these models had overlapping 95% CIs for ESS, and all correctly classified at least 3 of 4 non-participants, and 4 of 5 program participants. Table 3 presents standardized differences of all covariates and the average absolute standardized difference for each of the seven (logistic-main effects, logistic-saturated, boosted, and four CTA) weighted propensity score models. The main effects only logistic regression model with IPTW achieves the lowest average standardized difference amongst the models, but is far from ideal in achieving covariate balance. For example, age remains substantially unbalanced between participants and non-participants, and to a lesser degree so do the number of prescriptions filled, emergency department visits, and other outpatient visits. Interestingly, the saturated logistic regression model performed worse in achieving covariate balance than the main effects only model. This may be due to the very large number of covariates used in the estimation model (166) relative to the number of observations (2002), resulting in data patterns known as complete or quasi-complete separation [45]. All other models performed substantially worse than logistic regression with IPTW in achieving covariate balance.
18 Table 4 presents treatment effect estimates using median regression for the seven weighted propensity score models and also for a naïve estimate, which is simply a regression of the outcome on the treatment indicator without adjustment for confounding. All models show that the median costs of the participants in the program year were higher than the median costs of non-participants. The treatment effect estimates for the seven weighted models span a relatively narrow range between $738 and $1536, and all models except for the saturated logistic model (P <0.094) achieve statistical significance (P <0.0001). Table 5 presents treatment effect estimates using weighted CTA outcome models for the seven weighted propensity score models and also for the naïve estimate. For every analysis, the first row of data are for the training (full sample) analysis, and the second row for leave-one-out (LOO) one-sample jackknife analysis. For all models, observations having costs less than or equal to the tabled threshold value (each threshold value is computed using the indicated model) are predicted to be from the non-participant group, and observations having costs that are greater than the tabled threshold are predicted to be from the participant group. For example, the cutpoint in the unweighted naïve model indicates that non-participants were predicted to have medical costs $2664 while participants were predicted to have costs > $2664. The accuracy (and LOO cross-generalizability) of these predictions is represented by the respective sensitivities, overall ESS, and permutation P values. In the case of the naïve estimate, the full sample sensitivity of the non-participant group was 68.2%, indicating that 68.2% of nonparticipants we accurately predicted to have costs $2664. Similarly, the full sample sensitivity
19 of the participant group was 82.6%, indicating that 82.6% of participants were accurately predicted to have costs > $2664. The ESS for the naïve model was 50.8%, indicative of relatively strong overall classification accuracy [40]. Furthermore, the exact P< for the naïve model indicates that the participant group had statistically higher cost than then non-participant group. Finally, the model is generalizable, as indicated by LOO values that are nearly identical to those of the full sample. All weighted models were statistically significant (exact P<0.0001). Full sample weighted ESS (WESS) values ranged between 28.5% and 37.3%, and LOO WESS values ranged between 26.3% and 36.6%, indicative of moderate classification accuracy and crossgeneralizability [40]. Taken together the findings of the CTA outcomes analyses were qualitatively similar to findings derived via median regression. That is, the participant group had statistically higher costs than the non-participant group, across all models. We close with concluding comments. 4. DISCUSSION Given that main-effects logistic regression generated propensity scores weights that yielded the lowest mean standardized difference measure of covariate balance, and produced medianregression-based treatment effect estimates that were consistent with estimates of all the other models, one may question the value of using alternative approaches to generate propensity scores. However, in using empirical data where the true treatment effect is never known, we
20 highlight challenges investigators face when developing propensity score models using logistic regression to derive the best (i.e. least biased) estimate. First, neither the main effects nor the fully saturated logistic regression models generated propensity scores that yielded good covariate balance, indicated by standardized differences for several covariates that were substantially higher than the recommended upper bound of 0.10 [39]. If in fact it is possible to attain a correctly specified logistic regression model for the present sample, then it lies somewhere between these extreme (main effects only, versus completely saturated) specifications. However, a correctly specified logistic regression model is unlikely to be discovered by using a manual variable selection approach. Second, as an increasing number of variables, interactions, and polynomial terms are added to the model, violations of statistical assumptions underlying the validity of the model estimates become increasingly likely. This underscores a clear advantage of using automated machine learning algorithms, which require no statistical assumptions in selecting model terms, as an alternative to logistic regression for generating propensity scores. Third, as expected, varying the specifications for estimating the logistic regression model yielded qualitatively different findings. The main effects logistic regression model produced estimated treatment effects consistent in magnitude and statistical significance to estimates of all weighted models except for the saturated logistic regression model, which produced an estimated treatment effect that was substantially lower than that obtained by all other models, and was not statistically significant (Table 4). This finding supports conducting sensitivity analysis to assess
21 the consistency of findings obtained by different models (or specifications) as a standard practice in the propensity score modeling process, in order to increase confidence in the validity of the analytic results [46]. By design, the CTA framework conducts such a sensitivity analysis for the propensity score models. In the present study, analysis identified all 14 potential CTA-based propensity score models that exist for the study data, of which eight met all statistical validity criteria (for exposition we proceeded with four of these eight models). All CTA models produced overlapping outcomes, cutpoints, ESS, and P values for all weighted models, and exhibited consistency in the degree of generalizability of the estimates. In achieving similar outcomes under different propensity score model specifications, we gain confidence that the analytic approach produces valid results. Our empirical results also reveal that the standard approach to assess covariate balance as an indicator of comparability between study groups is problematic. None of the machine learning-based models (nor the saturated or boosted logistic models) achieved covariate balance using the criterion of an average standardized difference <0.10. This indicates that the standardized difference is not an appropriate metric for assessing comparability between study groups when such models are implemented. The standardized difference measures the difference in the means of two (assumingly normal) distributions. However, machine learning algorithms rarely deal with entire distributions of a variable, but rather subsets -- and interactions between subsets -- of available variables. Therefore, metrics based on distributional assumptions of the entire variable are not relevant to machine learning models.
22 On the other hand, CTA models by design provide results in a decision-tree-like format that allows for direct inspection of balance, with all individuals that end in the same stratum (terminal node) comparable on all the attributes that define that terminal node. Concomitantly, this format also indicates the degree of overlap between study groups in these covariate patterns. More specifically, any terminal node that contains 100% of observations from a single study group has no counterfactual and thus causal inferences cannot be made about the effects of the intervention on that subset of observations. In such cases, all observations with no counterfactual within a given terminal node may be dropped from the analysis, and the CTA model should be re-estimated. While this methodology is applicable to CTA and classification and regression trees (CART) algorithms that provide results in a decision-tree or decision-rules format, it is not clear how best to assess covariate balance when using black-box algorithms (e.g. boosted regression, random forests, support vector machines, etc.). An important issue associated with the use of machine learning tools for generating propensity score models is the choice and number of variables determined by the model. The recommended approach for estimating the propensity score is to be liberal in terms of including variables that may be associated with treatment assignment and/or the outcomes [46]. However, classification algorithms are specifically designed to exclude variables that do not contribute to predictive accuracy. Indeed, CTA explicitly maximizes ESS so forcing additional variables into a CTA model will reduce ESS and/or D. Moreover, many black-box machine learning algorithms do not report the number or identify of variables included in the model. Taken
23 together, it is clear that recommendations for estimating propensity score models could be improved by including the application of machine learning techniques. CTA methodology holds several advantages over conventional logistic regression for estimating propensity score models, such as using an automated process for optimizing variable selection, being unencumbered by the assumptions required of parametric models, and insensitivity to skewed data and outliers [18]. Moreover, the built-in sensitivity analysis for GO- CTA is more likely to consistently identify a correctly specified propensity score model than when using logistic regression. Additionally, while this paper has demonstrated the implementation of the CTA framework to generate propensity score weights for pretest-posttest studies with a binary treatment, the approach can be extended to any study design that may utilize propensity score weights (see for example [48-52]). CTA methodology also carries advantages over other machine learning algorithms for estimating propensity score models. In contrast to the more computationally-intensive machine learning techniques typically favored for generating propensity scores, CTA models offer transparency in the computational approach, interpretable formulae, and straightforward visual displays of the final model [53]. Moreover, the GO-CTA algorithm identifies all statistically valid propensity score models for a sample, which vary in terms of predictive accuracy (ESS) as well as parsimony (number of strata) [18]. As a general rule, a simpler model is always preferred over a more complex model, assuming both have the same classification accuracy. Finally, CTA includes permutation tests, adjusted for multiple comparisons, to ensure that the final model
24 meets rigorous statistical assumptions, and can use multiple methods to assess potential crossgeneralizability [18]. Thus, one may consider CTA as an all-in-one classification algorithm that combines the synergies of machine learning and conventional statistics. That is, the machine learning component ensures that the final model achieves maximum accuracy (as measured by cross-validated ESS), and the permutation tests, performed at each node, ensure that the model s discriminatory ability has met accepted levels of statistical significance. The primary limitation of the CTA framework -- as is the case with every approach used to evaluate non-randomized studies -- is the models are generated using only the available data. No matter how sophisticated the algorithm, unobservable factors such as unmeasured motivation to change health behaviors may confound the outcomes in healthcare interventions [54,55]. 5. CONCLUSION In summary, this paper introduced a novel machine learning framework for generating propensity score weights to evaluate treatment effects in observational studies. This framework offers many advantages over both logistic regression as well as other machine learning algorithms, such as explicit maximization of accuracy, parsimony, sensitivity, statistical robustness, and transparency. Because the CTA algorithm identifies all statistically valid propensity score models for a sample, it is most likely to identify a correctly specified propensity score model, and should be considered as an alternative approach to modeling the propensity score.
25
26 REFERENCES 1. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. Journal of the American Statistical Association 1996;91: Linden A, Adams J. Evaluating disease management program effectiveness: an introduction to instrumental variables. Journal of Evaluation in Clinical Practice 2006;12: Rosenbaum PR, Rubin DB. The central role of propensity score in observational studies for causal effects. Biometrika 1983;70: Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiology and Drug Safety 2004;13:841e Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. American Journal of Epidemiology 2006;163: Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 1993;49: Smith J, Todd P. Reconciling conflicting evidence on the performance of propensity-score matching methods. American Economic Review 2001;91: Cook EF, Goldman L. Asymmetric stratification. An outline for an efficient method for controlling confounding in cohort studies. American Journal of Epidemiology 1988;127:
27 9. Barosi G, Ambrosetti A, Centra A, Falcone A, Finelli C, Foa P, et al. Splenectomy and risk of blast transformation in myelofibrosis with myeloid metaplasia. Italian Cooperative Study Group on Myeloid with Myeloid Metaplasia. Blood 1998;91: McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 2004;9: Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiology and Drug Safety 2008;17: Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Statistics in Medicine 2010; 29(3): Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology 2010; 63(8): Wyss R, Ellis AR, Brookhart MA, Girman CJ, Funk MJ, LoCasale R, Stürmer T. The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bcart, and the covariate-balancing propensity score. American Journal of Epidemiology 2014;180: Neugebauer R, Schmittdiel JA, van der Laan MJ. A Case Study of the Impact of Data- Adaptive Versus Model-Based Estimation of the Propensity Scores on Causal Inferences
28 from Three Inverse Probability Weighting Estimators. The International Journal of Biostatistics 2016;12: Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical Science 2001;16: Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 2014;15: Yarnold PR, Soltysik RC. Maximizing Predictive Accuracy. Chicago, IL: ODA Books, DOI: /RG Yarnold PR, Soltysik RC, Bennett CL. Predicting in-hospital mortality of patients with AIDS-related Pneumocystis carinii pneumonia: An example of hierarchically optimal classification tree analysis. Statistics in Medicine 1997;16: Yarnold PR, Soltysik R.C. Theoretical distributions of optima for univariate discrimination of random data. Decision Sciences 1991;22: Linden A, Yarnold PR. Combining machine learning and propensity score weighting to estimate causal effects in multivalued treatments. Journal of Evaluation in Clinical Practice 2016;22: Linden A, Yarnold PR, Nallamothu BK. Using machine learning to model dose-response relationships. Journal of Evaluation in Clinical Practice 2016;22:
29 23. Linden A, Yarnold PR. Using machine learning to identify structural breaks in single-group interrupted time series designs. Journal of Evaluation in Clinical Practice 2016;22: Yarnold PR. Discriminating geriatric and non-geriatric patients using functional status information: An example of classification tree analysis via UniODA. Educational and Psychological Measurement 1996;56: Soltysik RC, Yarnold PR. Automated CTA software: Fundamental concepts and control commands. Optimal Data Analysis 2010;1: Linden A. Identifying spin in health management evaluations. Journal of Evaluation in Clinical Practice. 2011;17: Linden A, Adler-Milstein J. Medicare disease management in policy context. Health Care Finance Review 2008;29: Elith J, Leathwick JR, Hastie T. A working guide to boosted regression trees. Journal of Animal Ecology 2008;77: Schonlau M. Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 2005;5: Yarnold PR, Linden A. Novometric analysis with ordered class variables: The optimal alternative to linear regression analysis, Optimal Data Analysis 2016;5: Linden A, Adams JL. Using propensity score-based weighting in the evaluation of health management programme effectiveness. Journal of Evaluation in Clinical Practice 2010;16:
30 32. Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11: Rosenbaum PR. Model-based direct adjustment. Journal of the American Statistical Association 1987;82: Hong G. Marginal mean weighting through stratification: adjustment for selection bias in multilevel data. Journal of Educational and Behavioral Statistics 2010;35: Linden A, Adams J, Roberts N. Evaluating disease management program effectiveness: An introduction to the bootstrap technique. Disease Management and Health Outcomes 2005;13: Linden A, Adams J, Roberts N. The generalizability of disease management program results: getting from here to there. Managed Care Interface 2004;17: Flury BK, Reidwyl H. Standard distance in univariate and multivariate analysis. The American Statistician 1986;40: Linden A, Samuels SJ. Using balance statistics to determine the optimal number of controls in matching studies. Journal of Evaluation in Clinical Practice 2013;19: Normand SLT, Landrum MB, Guadagnoli E, Ayanian JZ, Ryan TJ, Cleary PD, McNeil BJ. Validating recommendations for coronary angiography following an acute myocardial infarction in the elderly: a matched analysis using propensity scores. Journal of Clinical Epidemiology 2001;54:
31 40. Yarnold PR, Soltysik RC. Optimal data analysis: Guidebook with software for Windows. Washington, D.C.: APA Books, Yarnold PR, Linden A. Theoretical aspects of the D statistic. Optimal Data Analysis 2016;5: Yarnold PR. Determining jackknife ESS for a CTA model with chaotic instability. Optimal Data Analysis 2016;5: Linden A. LOOCLASS: Stata module for generating classification statistics of Leave-One- Out cross-validation for binary outcomes. Statistical Software Components s458032, Boston College Department of Economics, Downloadable from [Accessed on 12 January 2017]. 44. Linden A. COVBAL: Stata module for generating covariate balance statistics. Statistical Software Components s458188, Boston College Department of Economics, [Accessed on 12 January 2017]. 45. Allison PD. Convergence failures in logistic regression. SAS Global Forum 2008;360: Linden A, Adams J, Roberts N. Strengthening the case for disease management effectiveness: unhiding the hidden bias. Journal of Evaluation in Clinical Practice 2006;12: Stuart EA. Matching methods for causal inference: a review and a look forward. Statistical Science 2010;25:1 21.
32 48. Linden A, Adams JL. Evaluating health management programmes over time. Application of propensity score-based weighting to longitudinal data. Journal of Evaluation in Clinical Practice 2010;16: Linden A, Adams JL. Applying a propensity-score based weighting model to interrupted time series data: improving causal inference in program evaluation. Journal of Evaluation in Clinical Practice. 2011;17: Linden A, Adams JL. Combining the regression-discontinuity design and propensity-score based weighting to improve causal inference in program evaluation. Journal of Evaluation in Clinical Practice. 2012;18(2): Linden A. Combining propensity score-based stratification and weighting to improve causal inference in the evaluation of health care interventions. Journal of Evaluation in Clinical Practice 2014;20(6): Linden A, Adams J, Roberts N. Evaluating disease management program effectiveness: An introduction to survival analysis. Disease Management 2004;7: Linden A, Yarnold PR. Using data mining techniques to characterize participation in observational studies. Journal of Evaluation in Clinical Practice 2016;22: Linden A, Roberts N. Disease management interventions: What s in the black box? Disease Management 2004;7: Linden A, Butterworth S, Roberts N. Disease management interventions II: What else is in the black box? Disease Management 2006;9:73-85.
33 Table 1: Baseline (12 months) characteristics of program participants and non-participants (Linden [2011]). Participants (N=374) Non-Participants (N=1628) Standardized difference P-value a Demographic characteristics Age 54.9 (6.71) 43.4 (11.99) <0.001 Female 211 (56.4%) 807 (49.6%) Utilization and Cost Primary care visits 11.3 (7.30) 4.6 (4.35) <0.001 Other outpatient visits 18.0 (16.65) 7.2 (10.61) <0.001 Laboratory tests 6.1 (5.27) 2.4 (3.31) <0.001 Radiology tests 3.2 (4.46) 1.3 (2.48) <0.001 Prescriptions filled 40.6 (29.96) 11.9 (17.14) <0.001 Hospitalizations 0.2 (0.52) 0.1 (0.29) <0.001 Emergency department visits 0.4 (1.03) 0.2 (0.50) <0.001 Home-health visits 0.1 (0.88) 0.0 (0.38) Total costs 8236 (9830) 3047 (5817) <0.001 a A two-tailed t-test for independent samples was used for continuous variables, and a Chi-square test was used for dichotomous variables. Continuous variables are reported as mean (standard deviation) and dichotomous variables are reported as N (percent).
34 Table 2: All CTA models discriminating between study participants and non-participants Smallest ESS for ESS for Step Strata Strata N Model (95% CI) Chance (95% CI) D (95% CI) ( ) 1.68 ( ) 7.52 ( ) ( ) 1.64 ( ) 7.26 ( ) ( ) 1.75 ( ) 5.34 ( ) ( ) 1.65 ( ) 6.87 ( ) ( ) 1.71 ( ) 4.47 ( ) ( ) 1.70 ( ) 3.97 ( ) ( ) 1.68 ( ) 3.76 ( ) ( ) 1.90 ( ) 3.44 ( ) ( ) 1.87 ( ) 2.43 ( ) ( ) 1.83 ( ) 2.82 ( ) ( ) 1.82 ( ) 2.89 ( ) ( ) 1.94 ( ) 1.46 ( ) ( ) 1.89 ( ) 2.40 ( ) ( ) 1.92 ( ) (13.0 1,248) Notes: Strata is the number of model endpoints (terminal nodes); smallest strata N is the number of observations in the endpoint with the smallest number of observations among all endpoints in the model; ESS is a measure of normed predictive accuracy (0=accuracy expected by chance; 100=perfect accuracy); exact 95% confidence intervals for model and chance ESS are computed using 10,000 bootstrap and Monte Carlo iterations, respectively; and the D statistic indicates the number of additional effects with equivalent ESS
Using machine learning to assess covariate balance in matching studies
bs_bs_banner Journal of Evaluation in Clinical Practice ISSN1365-2753 Using machine learning to assess covariate balance in matching studies Ariel Linden, DrPH 1,2 and Paul R. Yarnold, PhD 3 1 President,
More informationCombining machine learning and matching techniques to improve causal inference in program evaluation
bs_bs_banner Journal of Evaluation in Clinical Practice ISSN1365-2753 Combining machine learning and matching techniques to improve causal inference in program evaluation Ariel Linden DrPH 1,2 and Paul
More informationNovometric Analysis with Ordered Class Variables: The Optimal Alternative to Linear Regression Analysis
Novometric Analysis with Ordered Class Variables: The Optimal Alternative to Linear Regression Analysis Paul R. Yarnold, Ph.D., and Ariel Linden, Dr.P.H. Optimal Data Analysis, LLC Linden Consulting Group,
More informationComparative Accuracy of a Diagnostic Index Modeled Using (Optimized) Regression vs. Novometrics
Comparative Accuracy of a Diagnostic Index Modeled Using (Optimized) Regression vs. Novometrics Ariel Linden, Dr.P.H. and Paul R. Yarnold, Ph.D. Linden Consulting Group, LLC Optimal Data Analysis LLC Diagnostic
More informationParental Smoking Behavior, Ethnicity, Gender, and the Cigarette Smoking Behavior of High School Students
Parental Smoking Behavior, Ethnicity, Gender, and the Cigarette Smoking Behavior of High School Students Paul R. Yarnold, Ph.D. Optimal Data Analysis, LLC Novometric analysis is used to predict cigarette
More informationEvaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_
Journal of Evaluation in Clinical Practice ISSN 1356-1294 Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_1361 180..185 Ariel
More informationPropensity score methods to adjust for confounding in assessing treatment effects: bias and precision
ISPUB.COM The Internet Journal of Epidemiology Volume 7 Number 2 Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision Z Wang Abstract There is an increasing
More informationMaximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean
Maximizing the Accuracy of Multiple Regression Models using UniODA: Regression Away From the Mean Paul R. Yarnold, Ph.D., Fred B. Bryant, Ph.D., and Robert C. Soltysik, M.S. Optimal Data Analysis, LLC
More informationUsing Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation
Institute for Clinical Evaluative Sciences From the SelectedWorks of Peter Austin 2012 Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation
More informationRestricted vs. Unrestricted Optimal Analysis: Smoking Behavior of College Undergraduates
Restricted vs. Unrestricted Optimal Analysis: Smoking Behavior of College Undergraduates Paul R. Yarnold, Ph.D. Optimal Data Analysis, LLC Maximum-accuracy statistical methods involving (un)restricted
More informationPropensity scores and causal inference using machine learning methods
Propensity scores and causal inference using machine learning methods Austin Nichols (Abt) & Linden McBride (Cornell) July 27, 2017 Stata Conference Baltimore, MD Overview Machine learning methods dominant
More informationAn Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies
Multivariate Behavioral Research, 46:399 424, 2011 Copyright Taylor & Francis Group, LLC ISSN: 0027-3171 print/1532-7906 online DOI: 10.1080/00273171.2011.568786 An Introduction to Propensity Score Methods
More informationApplication of Propensity Score Models in Observational Studies
Paper 2522-2018 Application of Propensity Score Models in Observational Studies Nikki Carroll, Kaiser Permanente Colorado ABSTRACT Treatment effects from observational studies may be biased as patients
More informationResearch Article Estimating Measurement Error of the Patient Activation Measure for Respondents with Partially Missing Data
BioMed Research International Volume 0, Article ID 0, pages http://dx.doi.org/0./0/0 Research Article Estimating Measurement Error of the Patient Activation Measure for Respondents with Partially Missing
More informationCombining the regression discontinuity design and propensity score-based weighting to improve causal inference in program evaluationjep_
Journal of Evaluation in Clinical Practice ISSN 1365-2753 Combining the regression discontinuity design and propensity score-based weighting to improve causal inference in program evaluationjep_1768 317..325
More informationRecent advances in non-experimental comparison group designs
Recent advances in non-experimental comparison group designs Elizabeth Stuart Johns Hopkins Bloomberg School of Public Health Department of Mental Health Department of Biostatistics Department of Health
More informationLecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method
Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to
More informationBrief introduction to instrumental variables. IV Workshop, Bristol, Miguel A. Hernán Department of Epidemiology Harvard School of Public Health
Brief introduction to instrumental variables IV Workshop, Bristol, 2008 Miguel A. Hernán Department of Epidemiology Harvard School of Public Health Goal: To consistently estimate the average causal effect
More informationUsing Propensity Score Matching in Clinical Investigations: A Discussion and Illustration
208 International Journal of Statistics in Medical Research, 2015, 4, 208-216 Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration Carrie Hosman 1,* and Hitinder S.
More informationinvestigate. educate. inform.
investigate. educate. inform. Research Design What drives your research design? The battle between Qualitative and Quantitative is over Think before you leap What SHOULD drive your research design. Advanced
More informationPropensity Score Methods for Causal Inference with the PSMATCH Procedure
Paper SAS332-2017 Propensity Score Methods for Causal Inference with the PSMATCH Procedure Yang Yuan, Yiu-Fai Yung, and Maura Stokes, SAS Institute Inc. Abstract In a randomized study, subjects are randomly
More informationComparing Attributes Measured with Identical Likert-Type Scales in Single-Case Designs with UniODA
Comparing Attributes Measured with Identical Likert-Type Scales in Single-Case Designs with UniODA Paul R. Yarnold, Ph.D. Optimal Data Analysis, LLC Often at the advice of their physician, patients managing
More informationTHE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN
THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN Ji An & Laura M. Stapleton University of Maryland, College Park May, 2016 WHAT DOES A PROPENSITY
More informationChapter 1. Introduction
Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a
More informationPropensity Score Analysis Shenyang Guo, Ph.D.
Propensity Score Analysis Shenyang Guo, Ph.D. Upcoming Seminar: April 7-8, 2017, Philadelphia, Pennsylvania Propensity Score Analysis 1. Overview 1.1 Observational studies and challenges 1.2 Why and when
More information16:35 17:20 Alexander Luedtke (Fred Hutchinson Cancer Research Center)
Conference on Causal Inference in Longitudinal Studies September 21-23, 2017 Columbia University Thursday, September 21, 2017: tutorial 14:15 15:00 Miguel Hernan (Harvard University) 15:00 15:45 Miguel
More informationRecent developments for combining evidence within evidence streams: bias-adjusted meta-analysis
EFSA/EBTC Colloquium, 25 October 2017 Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis Julian Higgins University of Bristol 1 Introduction to concepts Standard
More informationBIOSTATISTICAL METHODS
BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH PROPENSITY SCORE Confounding Definition: A situation in which the effect or association between an exposure (a predictor or risk factor) and
More informationIntroducing a SAS macro for doubly robust estimation
Introducing a SAS macro for doubly robust estimation 1Michele Jonsson Funk, PhD, 1Daniel Westreich MSPH, 2Marie Davidian PhD, 3Chris Weisen PhD 1Department of Epidemiology and 3Odum Institute for Research
More informationCausal Validity Considerations for Including High Quality Non-Experimental Evidence in Systematic Reviews
Non-Experimental Evidence in Systematic Reviews OPRE REPORT #2018-63 DEKE, MATHEMATICA POLICY RESEARCH JUNE 2018 OVERVIEW Federally funded systematic reviews of research evidence play a central role in
More informationConfounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine
Confounding by indication developments in matching, and instrumental variable methods Richard Grieve London School of Hygiene and Tropical Medicine 1 Outline 1. Causal inference and confounding 2. Genetic
More informationInvestigating the robustness of the nonparametric Levene test with more than two groups
Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing
More informationAssessing the impact of unmeasured confounding: confounding functions for causal inference
Assessing the impact of unmeasured confounding: confounding functions for causal inference Jessica Kasza jessica.kasza@monash.edu Department of Epidemiology and Preventive Medicine, Monash University Victorian
More informationGlobally Optimal Statistical Classification Models, I: Binary Class Variable, One Ordered Attribute
Globally Optimal Statistical Classification Models, I: Binary Class Variable, One Ordered Attribute Paul R. Yarnold, Ph.D. and Robert C. Soltysik, M.S. Optimal Data Analysis, LLC Imagine a random sample
More informationPropensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research
2012 CCPRC Meeting Methodology Presession Workshop October 23, 2012, 2:00-5:00 p.m. Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy
More informationMODEL SELECTION STRATEGIES. Tony Panzarella
MODEL SELECTION STRATEGIES Tony Panzarella Lab Course March 20, 2014 2 Preamble Although focus will be on time-to-event data the same principles apply to other outcome data Lab Course March 20, 2014 3
More informationCitation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.
University of Groningen Latent instrumental variables Ebbes, P. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document
More informationChapter 17 Sensitivity Analysis and Model Validation
Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations
More informationPubH 7405: REGRESSION ANALYSIS. Propensity Score
PubH 7405: REGRESSION ANALYSIS Propensity Score INTRODUCTION: There is a growing interest in using observational (or nonrandomized) studies to estimate the effects of treatments on outcomes. In observational
More informationPropensity score methods : a simulation and case study involving breast cancer patients.
University of Louisville ThinkIR: The University of Louisville's Institutional Repository Electronic Theses and Dissertations 5-2016 Propensity score methods : a simulation and case study involving breast
More informationIdentifying spin in health management evaluationsjep_
Journal of Evaluation in Clinical Practice ISSN 1365-2753 Identifying spin in health management evaluationsjep_1611 1223..1230 Ariel Linden DrPH MS President, Linden Consulting Group, LLC, Hillsboro, OR,
More informationSupplement 2. Use of Directed Acyclic Graphs (DAGs)
Supplement 2. Use of Directed Acyclic Graphs (DAGs) Abstract This supplement describes how counterfactual theory is used to define causal effects and the conditions in which observed data can be used to
More informationSupplementary Appendix
Supplementary Appendix This appendix has been provided by the authors to give readers additional information about their work. Supplement to: Weintraub WS, Grau-Sepulveda MV, Weiss JM, et al. Comparative
More informationIntroduction to Observational Studies. Jane Pinelis
Introduction to Observational Studies Jane Pinelis 22 March 2018 Outline Motivating example Observational studies vs. randomized experiments Observational studies: basics Some adjustment strategies Matching
More informationDetection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis
Advanced Studies in Medical Sciences, Vol. 1, 2013, no. 3, 143-156 HIKARI Ltd, www.m-hikari.com Detection of Unknown Confounders by Bayesian Confirmatory Factor Analysis Emil Kupek Department of Public
More informationSUNGUR GUREL UNIVERSITY OF FLORIDA
THE PERFORMANCE OF PROPENSITY SCORE METHODS TO ESTIMATE THE AVERAGE TREATMENT EFFECT IN OBSERVATIONAL STUDIES WITH SELECTION BIAS: A MONTE CARLO SIMULATION STUDY By SUNGUR GUREL A THESIS PRESENTED TO THE
More informationA MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA
A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront
More informationResearch Designs For Evaluating Disease Management Program Effectiveness
Research Designs For Evaluating Disease Management Program Effectiveness Disease Management Colloquium June 27-30, 2004 Ariel Linden, Dr.P.H., M.S. President, Linden Consulting Group 1 What s the Plan?
More informationOHDSI Tutorial: Design and implementation of a comparative cohort study in observational healthcare data
OHDSI Tutorial: Design and implementation of a comparative cohort study in observational healthcare data Faculty: Martijn Schuemie (Janssen Research and Development) Marc Suchard (UCLA) Patrick Ryan (Janssen
More informationProgress in Risk Science and Causality
Progress in Risk Science and Causality Tony Cox, tcoxdenver@aol.com AAPCA March 27, 2017 1 Vision for causal analytics Represent understanding of how the world works by an explicit causal model. Learn,
More informationWhat is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics
What is Multilevel Modelling Vs Fixed Effects Will Cook Social Statistics Intro Multilevel models are commonly employed in the social sciences with data that is hierarchically structured Estimated effects
More informationA NEW TRIAL DESIGN FULLY INTEGRATING BIOMARKER INFORMATION FOR THE EVALUATION OF TREATMENT-EFFECT MECHANISMS IN PERSONALISED MEDICINE
A NEW TRIAL DESIGN FULLY INTEGRATING BIOMARKER INFORMATION FOR THE EVALUATION OF TREATMENT-EFFECT MECHANISMS IN PERSONALISED MEDICINE Dr Richard Emsley Centre for Biostatistics, Institute of Population
More informationAn Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy
Number XX An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 54 Gaither
More informationState-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups. Beth Ann Griffin Daniel McCaffrey
State-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups Beth Ann Griffin Daniel McCaffrey 1 Acknowledgements This work has been generously supported by NIDA
More informationA COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY
A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY Lingqi Tang 1, Thomas R. Belin 2, and Juwon Song 2 1 Center for Health Services Research,
More informationGraphical assessment of internal and external calibration of logistic regression models by using loess smoothers
Tutorial in Biostatistics Received 21 November 2012, Accepted 17 July 2013 Published online 23 August 2013 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.5941 Graphical assessment of
More informationLecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics
Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose
More informationChapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy
Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy Jee-Seon Kim and Peter M. Steiner Abstract Despite their appeal, randomized experiments cannot
More informationContext of Best Subset Regression
Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance
More informationSmall Group Presentations
Admin Assignment 1 due next Tuesday at 3pm in the Psychology course centre. Matrix Quiz during the first hour of next lecture. Assignment 2 due 13 May at 10am. I will upload and distribute these at the
More informationImpact and adjustment of selection bias. in the assessment of measurement equivalence
Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,
More informationImputation approaches for potential outcomes in causal inference
Int. J. Epidemiol. Advance Access published July 25, 2015 International Journal of Epidemiology, 2015, 1 7 doi: 10.1093/ije/dyv135 Education Corner Education Corner Imputation approaches for potential
More informationImplementing double-robust estimators of causal effects
The Stata Journal (2008) 8, Number 3, pp. 334 353 Implementing double-robust estimators of causal effects Richard Emsley Biostatistics, Health Methodology Research Group The University of Manchester, UK
More information1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp
The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve
More informationA Guide to Quasi-Experimental Designs
Western Kentucky University From the SelectedWorks of Matt Bogard Fall 2013 A Guide to Quasi-Experimental Designs Matt Bogard, Western Kentucky University Available at: https://works.bepress.com/matt_bogard/24/
More informationSensitivity Analysis in Observational Research: Introducing the E-value
Sensitivity Analysis in Observational Research: Introducing the E-value Tyler J. VanderWeele Harvard T.H. Chan School of Public Health Departments of Epidemiology and Biostatistics 1 Plan of Presentation
More informationPropensity Score Matching with Limited Overlap. Abstract
Propensity Score Matching with Limited Overlap Onur Baser Thomson-Medstat Abstract In this article, we have demostrated the application of two newly proposed estimators which accounts for lack of overlap
More informationPeter C. Austin Institute for Clinical Evaluative Sciences and University of Toronto
Multivariate Behavioral Research, 46:119 151, 2011 Copyright Taylor & Francis Group, LLC ISSN: 0027-3171 print/1532-7906 online DOI: 10.1080/00273171.2011.540480 A Tutorial and Case Study in Propensity
More informationDraft Proof - Do not copy, post, or distribute
1 Overview of Propensity Score Analysis Learning Objectives z Describe the advantages of propensity score methods for reducing bias in treatment effect estimates from observational studies z Present Rubin
More informationUnit 1 Exploring and Understanding Data
Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile
More informationPharmaSUG Paper HA-04 Two Roads Diverged in a Narrow Dataset...When Coarsened Exact Matching is More Appropriate than Propensity Score Matching
PharmaSUG 207 - Paper HA-04 Two Roads Diverged in a Narrow Dataset...When Coarsened Exact Matching is More Appropriate than Propensity Score Matching Aran Canes, Cigna Corporation ABSTRACT Coarsened Exact
More informationBias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study
STATISTICAL METHODS Epidemiology Biostatistics and Public Health - 2016, Volume 13, Number 1 Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation
More informationTitle: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection
Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John
More informationVersion No. 7 Date: July Please send comments or suggestions on this glossary to
Impact Evaluation Glossary Version No. 7 Date: July 2012 Please send comments or suggestions on this glossary to 3ie@3ieimpact.org. Recommended citation: 3ie (2012) 3ie impact evaluation glossary. International
More informationCurrent Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2
CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2 1 Arizona State University and 2 University of South Carolina ABSTRACT
More informationPropensity scores: what, why and why not?
Propensity scores: what, why and why not? Rhian Daniel, Cardiff University @statnav Joint workshop S3RI & Wessex Institute University of Southampton, 22nd March 2018 Rhian Daniel @statnav/propensity scores:
More informationChapter 10. Considerations for Statistical Analysis
Chapter 10. Considerations for Statistical Analysis Patrick G. Arbogast, Ph.D. (deceased) Kaiser Permanente Northwest, Portland, OR Tyler J. VanderWeele, Ph.D. Harvard School of Public Health, Boston,
More informationCochrane Pregnancy and Childbirth Group Methodological Guidelines
Cochrane Pregnancy and Childbirth Group Methodological Guidelines [Prepared by Simon Gates: July 2009, updated July 2012] These guidelines are intended to aid quality and consistency across the reviews
More informationReview of Veterinary Epidemiologic Research by Dohoo, Martin, and Stryhn
The Stata Journal (2004) 4, Number 1, pp. 89 92 Review of Veterinary Epidemiologic Research by Dohoo, Martin, and Stryhn Laurent Audigé AO Foundation laurent.audige@aofoundation.org Abstract. The new book
More informationPropensity score analysis with the latest SAS/STAT procedures PSMATCH and CAUSALTRT
Propensity score analysis with the latest SAS/STAT procedures PSMATCH and CAUSALTRT Yuriy Chechulin, Statistician Modeling, Advanced Analytics What is SAS Global Forum One of the largest global analytical
More informationComplier Average Causal Effect (CACE)
Complier Average Causal Effect (CACE) Booil Jo Stanford University Methodological Advancement Meeting Innovative Directions in Estimating Impact Office of Planning, Research & Evaluation Administration
More informationGlossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha
Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha attrition: When data are missing because we are unable to measure the outcomes of some of the
More informationIdentifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods
Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods Dean Eckles Department of Communication Stanford University dean@deaneckles.com Abstract
More informationInsights into different results from different causal contrasts in the presence of effect-measure modification y
pharmacoepidemiology and drug safety 2006; 15: 698 709 Published online 10 March 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/pds.1231 ORIGINAL REPORT Insights into different results
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationOptimal full matching for survival outcomes: a method that merits more widespread use
Research Article Received 3 November 2014, Accepted 6 July 2015 Published online 6 August 2015 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.6602 Optimal full matching for survival
More informationRemarks on Bayesian Control Charts
Remarks on Bayesian Control Charts Amir Ahmadi-Javid * and Mohsen Ebadi Department of Industrial Engineering, Amirkabir University of Technology, Tehran, Iran * Corresponding author; email address: ahmadi_javid@aut.ac.ir
More informationAnalysis and Interpretation of Data Part 1
Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying
More informationPropensity Score Analysis: Its rationale & potential for applied social/behavioral research. Bob Pruzek University at Albany
Propensity Score Analysis: Its rationale & potential for applied social/behavioral research Bob Pruzek University at Albany Aims: First, to introduce key ideas that underpin propensity score (PS) methodology
More informationApplying Machine Learning Methods in Medical Research Studies
Applying Machine Learning Methods in Medical Research Studies Daniel Stahl Department of Biostatistics and Health Informatics Psychiatry, Psychology & Neuroscience (IoPPN), King s College London daniel.r.stahl@kcl.ac.uk
More informationBy: Mei-Jie Zhang, Ph.D.
Propensity Scores By: Mei-Jie Zhang, Ph.D. Medical College of Wisconsin, Division of Biostatistics Friday, March 29, 2013 12:00-1:00 pm The Medical College of Wisconsin is accredited by the Accreditation
More informationLecture Outline Biost 517 Applied Biostatistics I
Lecture Outline Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 2: Statistical Classification of Scientific Questions Types of
More informationEpidemiologic Methods I & II Epidem 201AB Winter & Spring 2002
DETAILED COURSE OUTLINE Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002 Hal Morgenstern, Ph.D. Department of Epidemiology UCLA School of Public Health Page 1 I. THE NATURE OF EPIDEMIOLOGIC
More informationChapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)
Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it
More informationAbstract Title Page. Authors and Affiliations: Chi Chang, Michigan State University. SREE Spring 2015 Conference Abstract Template
Abstract Title Page Title: Sensitivity Analysis for Multivalued Treatment Effects: An Example of a Crosscountry Study of Teacher Participation and Job Satisfaction Authors and Affiliations: Chi Chang,
More informationMediation Analysis With Principal Stratification
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-30-009 Mediation Analysis With Principal Stratification Robert Gallop Dylan S. Small University of Pennsylvania
More informationThrowing the Baby Out With the Bathwater? The What Works Clearinghouse Criteria for Group Equivalence* A NIFDI White Paper.
Throwing the Baby Out With the Bathwater? The What Works Clearinghouse Criteria for Group Equivalence* A NIFDI White Paper December 13, 2013 Jean Stockard Professor Emerita, University of Oregon Director
More informationIntroduction to Applied Research in Economics
Introduction to Applied Research in Economics Dr. Kamiljon T. Akramov IFPRI, Washington, DC, USA Training Course on Introduction to Applied Econometric Analysis November 14, 2016, National Library, Dushanbe,
More informationModule 14: Missing Data Concepts
Module 14: Missing Data Concepts Jonathan Bartlett & James Carpenter London School of Hygiene & Tropical Medicine Supported by ESRC grant RES 189-25-0103 and MRC grant G0900724 Pre-requisites Module 3
More information