IMPROVING PROPENSITY SCORE METHODS IN PHARMACOEPIDEMIOLOGY

Size: px
Start display at page:

Download "IMPROVING PROPENSITY SCORE METHODS IN PHARMACOEPIDEMIOLOGY"

Transcription

1 IMPROVING PROPENSITY SCORE METHODS IN PHARMACOEPIDEMIOLOGY Mohammed Sanni Ali Division of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, the Netherlands

2 The work presented in this thesis was performed at the Division of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Utrecht, the Netherlands. The research presented in this thesis was conducted as part of the PROTECT consortium (Pharmacoepidemiological Research on Outcomes of Therapeutics by a European ConsorTium) which is a public-private partnership coordinated by the European Medicines Agency. The PROTECT project is supported by the Innovative Medicine Initiative Joint Undertaking ( under Grant Agreement no , resources of which are composed of financial contribution from the European Union s Seventh Framework Programme (FP7/ ) and EFPIA companies in kind contribution, the division of Pharmacoepidemiology & Clinical Pharmacology, Utrecht University, also received a direct financial contribution from Pfizer. Financial support for printing of this thesis was provided by Utrecht Institute for Pharmaceutical Sciences (UIPS), Koninklijke Nederlandse Maatschappij ter bevordering der Pharmacie (KNMP), the Dutch Heart Foundation and ChipSoft. Layout and Cover design: Off Page Cover Proposal: M. Sanni Ali and Ephrem Bekele Printed by: Off Page Sanni Ali, M. Improving Propensity Score Methods in Pharmacoepidemiology ISBN/EAN: Mohammed Sanni Ali For articles published or accepted for publication, the copyright has been transferred to the respective publisher. No part of this thesis may be reproduced, stored in a retrieval system, or transmitted in any form or by any means with out the permission of the author, or when appropriate, the publisher of the manuscript.

3 IMPROVING PROPENSITY SCORE METHODS IN PHARMACOEPIDEMIOLOGY Verbeteren van propensity score methoden in de farmacoepidemiologie (met een samenvatting in het Nederlands) PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof. dr. G.J. van der Zwaan, ingevolge het besluit van het college voor promoties in het openbaar te verdedigen op woensdag 1 oktober 2014 des middags te 2.30 uur door Mohammed Sanni Ali geboren op 10 mei 1982 te Dessie, Ethiopië

4 Promotoren Co-promotoren Prof. dr. A. de Boer Prof. dr. A.W. Hoes Dr. O.H. Klungel Dr. R.H.H. Groenwold

5 To Mom, Dad, Sofia, Salman, and Anna

6

7 Table of Contents CHAPTER I: GENERAL INTRODUCTION 9 CHAPTER II: BALANCE MEASURES AND COVARIATE SELECTION IN PROPENSITY SCORE ANALYSIS Assessment of Balance in Propensity Score Methods in the Medical Literature: a Systematic Review 17 Journal of Clin Epidemiol 2014, In Press 2.2 Propensity Score Balance Measures in Pharmacoepidemiology: a Simulation Study 45 Pharmacoepidemiol Drug Saf 2014; 25: Improving Selection of Covariates and Caliper for Optimal Balance in Propensity Score Matching: a Simulation Study 71 Submitted CHAPTER III: APPLICATIONS OF PROPENSITY SCORE AND MARGINAL STRUCTURAL MODELS Time-Dependent Propensity Score and Collider-Stratification Bias: the Example of Beta2-Agonist Use and the Risk of Coronary Heart Disease 97 Eur J Epidemiol 2013; 28: Methodological Comparison of Marginal Structural Model, Time-Varying Cox Regression and Propensity Score Methods: the Example of Antidepressant Use and the Risk of Hip Fracture 113 Submitted CHAPTER IV: PROPENSITY SCORE METHODS AND UNMEASURED CONFOUNDING Propensity Score Methods and Unmeasured Confounding Imbalance 133 Health Serv Res 2014; 49:

8 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 143 Epidemiology 2014; 25: CHAPTER V: GENERAL DISCUSSION Application of Propensity Score Methods to Quantify Treatment Effects: a Step-by-Step Approach for Applied Researchers 163 Submitted APPENDICES 193 A Summary 195 B Samenvatting 199 C Acknowledgement / ምስጋና / Dankwoord 203 D List of Co-authors 211 E List of Publications 215 F About the Author 8

9 CHAPTER I GENERAL INTRODUCTION

10

11 Inferences about effects of treatment involve comparisons of potential prognostic outcomes: the effect a certain treatment would have had on a subject who, in fact, received some other (or no) treatment. 1 Intended effects of treatments are ideally investigated using randomized experiments 2,3 where randomization is expected to prevent confounding, i.e. it leads to comparability of treated and untreated groups with respect to all prognostic factors (both measured and unmeasured) except the treatment itself: one group receives the treatment of interest and the other not. Randomized experiments, however, may not be feasible for ethical, financial, logistic, or even political reasons and may not reflect real-life situations due to a highly selected study population, limited sample size, and short duration of the treatment and/or the follow-up period. 4 In addition, randomization is not always necessary in safety research involving unintended effects, particularly when the adverse event is rare, unexpected and unpredictable, which means that it is usually not associated 3, 5, 6 with the (contra-)indications for treatment. 1 However, when adverse events are related to the main effect of the intervention and more or less predictable, prescribing will be guided by the prognosis of the patient and take into account the potential for adverse effect. 3,5,7 As a consequence, treated and untreated groups are systematically different with respect to prognostic factors in non-randomized comparisons and a situation similar to non-randomized studies of intended effects of therapies occurs. Consequently, confounding by (contra-) indication will threaten the validity of the study, unless methods are applied to prevent confounding. 5,7 Hence, causal inferences from non-randomized (i.e. observational) studies require that the study is designed in analogy with the way randomized experiments are designed 8 or the systematic differences need to be accounted for using statistical methods such as regression models. The first half of the 20 th century has seen tremendous developments in methods including those of randomized experiments, regression and matching methods to prevent or getrid of confounding, even though the term confounding was not applied during that period. However, the theoretical basis of these methods, and in particular of stratification and a matching approach was not fully developed until the 1950s to early 1970s when the papers by Mantel and Haenszel, 9, 10 Cochran, 11, 12 Cochran and Rubin, 13 and Rubin 14 started discussions relating to stratification and matching for evaluations of treatment effects in observational studies. Importantly, dealing with multiple covariates using stratification as well as matching was a challenge by then, due to both computational and data problems. 15 In 1970, Rubin proposed multivariate adjustment using discriminant matching that apply linear functions of the covariates, which was later discussed by himself 16 and Cochran and Rubin. 13 In 1976, Miettinen introduced the multivariate confounder score, later called Miettinen s confounder score, as a means to avoid complexity and inefficiency of multiple cross-classification for controlling confounding in etiologic research. 17 Rosenbaum and Rubin suggested an important advancement in 1983, by introducing the propensity score techniques. 1 GENERAL INTRODUCTION 11

12 1 The propensity score is defined as the conditional probability of assignment to a particular treatment given a vector of measured covariates. 1 In essence; covariates are used to predict the prescription of the treatment in daily practice using multivariable regression techniques. Propensity score methods can be considered as the observational study analogues of randomization in experiments. Importantly, however, randomization is superior in achieving balance on all covariates, both measured and unmeasured, while propensity score methods attempt to balance only measured covariates. Propensity scores help researchers to design and analyze observational studies in a way that mimics randomized experiments but under the assumption of ignorable treatment assignment (i.e., treatment assignment is independent of a patient s pre-treatment characteristics given a set of covariates and that both exposed and unexposed subjects exist at every combination of the values of the measured covariate(s) in the population under study, i.e., positivity). 1 Propensity score techniques are particularly useful in high-dimensional data (i.e., large number of covariates) when events are rare and exposure is relatively common, typical of pharmacoepidemiologic studies. 18 Around the end of the 20 th century, Robins extended propensity score applications to settings where treatment and confounders are time-varying In several other seminal papers by Robins and colleagues, this topic was developed further However, several aspects of propensity score methods, such as variable selection, balance assessment and reporting of results have not been fully addressed. The aim of this thesis was to expand the knowledge on several of these crucial aspects of propensity score methods in pharmacoepidemiologic settings using simulations, empirical studies and literature reviews. THESIS OUTLINE GENERAL INTRODUCTION 12 Chapter II addresses covariate selection, balance assessment and balance measures in propensity score analysis. Chapter 2.1 provides a descriptive overview of how covariate selection, balance assessment, and propensity score methods are applied and how important aspects of propensity score methods are reported in the medical literature. Chapter 2.2 evaluates different balance measures of propensity score methods using a simulation study with scenarios typical of pharmacoepidemiologic research. Chapter 2.3 assesses the impact of covariate selection and assessment of balance for different sets of covariates on bias in and precision of treatment effect estimate. In addition, it demonstrates the usefulness of balance measures in determining optimal caliper width in propensity score matching. Chapter III focuses on applications of marginal structural models (MSMs) whose parameters of interest are estimated via inverse probability of treatment weights (using propensity scores) in a time-varying treatment setting. In addition, it compares results from the MSM approach with those of conventional time-varying Cox regression and propensity score methods with descriptions of the biases underlying the latter approaches. Chapter 3.1 describes the application of these methods in a study of beta 2 -agonist use and the risk of

13 coronary heart disease and, in Chapter 3.2, a study of antidepressants use and the risk of hip fracture is presented. 1 Chapter IV deals with the assumptions underlying propensity score methods, particularly assumptions pertaining to unmeasured confounding. Moreover, chapter 4.1 illustrates the added value of propensity score methods, compared to conventional regression analysis, in parmacoepidemiologic research. Chapter 4.2 evaluates the usefulness of balance measures applied in propensity score methods to quantitatively falsify instrumental variables assumptions. Chapter V provides a step-by step description of the different aspects of propensity score methods, interpretations of treatment effect estimates, strengths and limitations of different propensity score approaches, compared to other available statistical methods. Finally, directions for areas of future research are given. GENERAL INTRODUCTION 13

14 1 GENERAL INTRODUCTION 14 REFERENCES 1. Rosenbaum PR & Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Rosenbaum PR. Observational study. Encyclopedia of Statistics in Behavioral Science. John Wiley & Sons, Ltd ; Miettinen OS. The need for randomization in the study of intended effects. Stat Med 1983; 2: Abbing-Karahagopian V et al. Bridging differences in outcomes of pharmacoepidemiological studies: design and first results of the PROTECT project. Curr Clin Pharmacol 2014; 9: Vandenbroucke JP. When are observational studies as credible as randomised trials? The Lancet 2004; 363: Stricker BH & Psaty BM. Detection, verification, and quantification of adverse drug reactions. BMJ 2004; 329: Vandenbroucke JP. Observational research, randomised trials, and two views of medical science. PLoS Medicine 2008; 5: e Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf 2004; 13: Mantel N & Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 1959; 22: Mantel N. Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. JASA 1963; 58: Cochran WG. Matching in analytical studies. Am J Public Health Nations Health 1953; 43: Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 1968; 24: Cochran WG & Rubin DB. Controlling bias in observational studies: a review. Sankhyā A 1973; 35: Rubin DB. Using multivariate matched sampling and regression adjustment to control bias in observational studies. JASA 1979; 74: Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci 2010; 25: Rubin DB. Matching to remove bias in observational studies. Biometrics 1973; 29: Miettinen OS. Stratification by a multivariate confounder score. Am J Epidemiol 1976; 104: Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol 2006; 98: Robins JM. Marginal structural models. In: 1997 Proceedings of the Section on Bayesian Statistical Science, Alexandria, VA: American Statistical Association, American Medical Association, 1998). 20. Robins JM. Causal inference from complex longitudinal data. In: Berkane M, eds. Latent variable modeling and applications to causality. New York: Springer-Verlag; 1997: Robins JM, Greenland S & Hu F. Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome. JASA 1999; 94: Robins JM, Hernán MÁ & Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MÁ, Brumback B & Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11:

15 CHAPTER II BALANCE MEASURES AND COVARIATE SELECTION IN PROPENSITY SCORE ANALYSIS

16

17 2.1 CHAPTER 2.1 Assessment of Balance in Propensity Score Methods in the Medical Literature: a Systematic Review M Sanni Ali, RHH Groenwold, WR Pestman, SV Belitser, KCB Roes, AW Hoes, A de Boer, OH Klungel Journal of Clinical Epidemiology (In Press) Assessment of balance in propensity score methods in the medical literature 17

18

19 ABSTRACT BACKGROUND METHODS RESULTS CONCLUSIONS INTRODUCTION The current practice on selecting variables for PS, choosing a specific propensity score method as well as measuring and reporting of covariate balance is not well documented. This study aimed to assess the current practice of propensity score (PS) analysis in the medical literature, particularly the assessment and reporting of balance on confounders. A PubMed search identified studies using PS methods from December, 2011 through May, For each article included in the review, information was extracted on important aspects of the PS such as the type of PS method used, variable selection for PS model, and assessment of balance. Among 296 articles included in the review, variable selection for PS model was explicitly reported in 102(34.4%) studies. Covariate balance was checked and reported in 177(59.8%) studies. P-values were the most commonly used statistical tools to report balance (125/177, 70.6%). The standardized difference and graphical displays were reported in 45(25.4%) and 11(6.2%) articles, respectively. Matching on the PS was the most commonly used approach to control for confounding (68.9%), followed by PS adjustment (20.9%), PS stratification (13.9%) and inverse probability of treatment weighting (IPTW, 7.1%). Balance was more often checked in articles using PS matching and IPTW, 70.6% and 71.4%, respectively. The execution and reporting of covariate selection and assessment of balance is far from optimal. Recommendations on reporting of PS analysis are provided to allow better appraisal of the validity of PS based studies. In observational studies, treated and control subjects often differ systematically on prognostic factors leading to treatment-selection bias or confounding in estimating the (adverse) effect of treatment on an outcome. Analytical tools such as propensity score (PS) methods are applied to correct for such confounding bias. In their seminal paper, Rosenbaum and Rubin described the PS as a balancing score: treated and untreated subjects with the same PS tend to have similar distributions of measured characteristics given the PS. 1 In other words, assuming no unmeasured confounding and having adequately measured confounders, conditioning on the PS allows one to obtain an unbiased estimate of the average treatment effect at that value of the PS. PS analysis involves two key steps: deriving the PS from the data and estimating the treatment effect by using the PS to control for confounding. The first step involves an iterative process of fitting a PS model (e.g. using logistic regression) on selected covariates until an optimal balance on those covariates is achieved. 2 Despite the growing popularity of PS methods in epidemiologic literature, criteria for selecting variables for a PS 2.1 Assessment of balance in propensity score methods in the medical literature 19

20 2.1 Assessment of balance in propensity score methods in the medical literature 20 model is not well developed compared to variable selection for conventional outcome models. 3, 4 Once the propensity scores are derived, an intermediate step is using one of the four possible methods: matching, stratification or subclassification, covariate adjustment, and inverse probability of treatment weighting using the PS and checking the balance of covariate distribution between treatment groups using appropriate metric. 2 The choice of PS method affects the way balance on covariates is assessed and depends on the specific research question, the target population as well as inferential goals of the study. 5-7 Finally, the effect of treatment on the outcome is estimated using the PS methods chosen in the previous step. Although the use of PS methods has shown a dramatic increase in the medical literature, 8 previous literature reviews indicated that most authors do not adequately report information on propensity score model development, 9, 10 the balance of covariates between the treatment groups in PS analysis, 8, 9, 11, 12 and those who report often use inappropriate diagnostics. 8, 9, 11 In addition, researchers often ignore explicit discussion of the PS estimate (estimand) and its relationship with their research question. 5 However, the reviews were limited to propensity score matching 8,11 or detailed information on the current practice is very limited. PS methodology has evolved over the last few years, during which researchers have proposed recommendations on variable selection for PS 4, and statistical tools for checking balance and/or selecting the optimal PS model, and advised against the use of some statistics such as significance testing or pre-matching c-statistics for evaluating covariate balance and appropriateness of a PS model. 17, However, the current practice on selecting variables for PS, choosing a specific propensity score method as well as measuring and reporting of covariate balance is not well documented. Therefore, the objective of this study was twofold. First, it aimed to systematically review the practice of variable selection as well as PS model building with emphasis on assessment and reporting of balance when using PS analysis in the medical literature. Second, it provided practical recommendations on the reporting of PS analysis. METHODS We performed a PubMed search to identify studies that employed different propensity score methods. The search was conducted on 5, June 2012 using keywords: propensity score(s) or propensity matching in all fields (title, abstract, body or references) identifying 2317 unduplicated references. To assess the current practice, we limited our search to six months (December May, 2012). Articles were excluded if they addressed only methodological or statistical aspects of PS, if they are unrelated to medical research, published in languages other than English, if they were reviews, editorials or letters. All authors discussed on identifying aspects of PS analysis on which data had to be collected but the extraction was performed by one of the investigators (MSA). From each article

21 included for the review, we extracted information on the type of PS method used, methods used to estimate the PS, how variables were selected for inclusion in the PS model, whether balance on confounders was checked, methods used for checking balance as well as the appropriateness of the PS model. When PS matching was used, we recorded information on whether the articles mentioned the matching algorithm applied, the treated : untreated matching ratios used, the size of the matched pairs as well as the starting population, and whether matching was taken into account in the analysis. When stratification on the PS was applied, we extracted information on the quantile of the PS used (deciles, quintiles, quartiles or tertiles). In addition, information on the Impact factor (IF) of the journals 25 and the SCImago Journal Rank (SJR) indicator from Scopus, a measure of quality of the journals, was extracted for articles included in the review to allow direct comparison of sources in different subject fields. Chi-squared test was used to compare the frequency of reporting balance assessment and the use of different balance metrics among quintiles of the IF as well as the SJR of the journals in which the reviewed articles were published. RESULTS The PubMed search identified 388 articles, of which 92 were excluded: methodological or statistical papers (n=20), articles unrelated to medical research (n=63), articles published in languages other than English (n=6), reviews (n=2), and editorials or letters (n=1) (Figure 1). This resulted in a review of 296 articles published in the medical literature during December May 2012 that employed PS methods in empirical data. The articles included for analysis were related to cardiovascular research (148, 50.0%), cancer research (41, 13.9%), renal research (18, 6.1%), neurological research (16, 5.4%), respiratory research (15, 5.1%), and other fields of medical research (57, 19.2%). Surgical interventions 388 articles 296 Articles available for review Drug-related intervention: 108 (36.5%) Clinical* 50 (16.9%) Surgical intervention: 138 (46.6%) 92 Articles were excluded: 63 Non clinical 20 Methodological 6 Other language 2 Systematic reviews 1 Editorials/letters Figure 1. Flow chart of abstract or article extraction for the systematic review *Studies that did not involve drug-related or surgical interventions were classified as clinical. 2.1 Assessment of balance in propensity score methods in the medical literature 21

22 2.1 Assessment of balance in propensity score methods in the medical literature 22 and drug-related evaluation studies constituted the majority of the articles included in the review, 138 (46.6%) and 108 (36.5%), respectively (Table 1). Variable Selection and PS Estimation Most articles (194, 65.5%) did not explicitly mention how variables were selected for the PS model. Variables association with treatment, outcome, and both treatment and outcome were considered and reported in 38 (12.8%), 39 (13.2%), 30 (10.1%) studies, respectively. Background knowledge was specifically mentioned in 14 (4.7%) studies; only four of these articles explicitly reported that they took into account variables association with the outcome and/or treatment. Inclusion of interaction or higher-order terms in the PS model was reported in 17(5.7%) articles, but none of these articles mentioned any motivation for the inclusion of interaction and higher-order terms. Only seven (2.4%) articles reported the PS model itself and how the variables were modelled. Other methods considered include step-wise variable selection methods, c-statistics (n=41, 13.9%), Hosmer-Lemeshow goodness-of-fit tests (n= 25, 8.4%), and balance measures (n=48, 16.2%). Almost all studies (283, 95.6%) reported the variables included in the PS model. The majority of articles reported that they used binary logistic regression for estimating the PS (260, 87.5%). Four articles reported that they used multinomial logistic regression where the exposure was categorical with more than two levels and two other articles reported that they used recursive partitioning for estimating PS in binary exposures. Other methods reported include the probit model (n=1, %) and the high dimensional PS model (n=3, 1%). Balance Assessment and PS Methods Balance of covariates between treatment groups was checked and reported in 177 (59.8%) of the articles and the most commonly used statistical tools to report balance was the Table 1. Classification of Propensity Score-based articles included in the review by body system and type of exposure (number and percentage) Drug-related (n=108) Surgical (n=138) Clinical* (n=50) Total (296) Cardiovascular 46 (42.6) 85 (61.6) 17 (34.0) 148 (50.0) Cancer 11 (10.2) 27 (19.6) 3 (6.0) 41 (13.9) Renal 4 (3.7) 9 (6.5) 5 (10.0) 18 (6.1) Respiratory 7 (6.5) 4 (2.9) 4 (8.0) 15 (5.1) Neurological 10 (9.3) 2 (1.4) 4 (8.0) 16 (5.4) Infection 9 (8.3) 1 (0.7) 5 (10.0) 15 (5.1) General/Non specific 11 (10.2) 4 (2.9) 9 (18.0) 24 (8.1) Others** 10 ( 9.3) 6 (4.3) 3 (6.0) 18 (6.1) * Studies which did not involve drug-related or surgical interventions were classified as Clinical. ** Others include digestive, lymphatic, musculoskeletal and ophthalmic studies.

23 Table 2. The frequency of different methods used for checking balance of confounders between treatment groups among the different Propensity Score methods Methods Number of articles (n)* Matching (n=204) Covariate adjustment (n=62) Stratification (n=41) IPTW (n=21) SDif 45 (25.4) 42 (20.6) 6 (9.7) 1 (2.4) 3 (14.3) P-values 125 (70.6) 105 (51.5) 15 (24.2) 13 (31.7) 10 (47.6) Graphical displays 11 (6.20) 6 (2.9) 3 (4.8) 2 (4.9) - Eye-balling 4 (2.3) 3 (1.5) 1 (1.6) 2 (4.9) - Others** 13 (7.3) 10 (4.9) 5 (8.0) 1 (2.4) 1 (4.7) * Number of studies included those in which balance was checked and reported (n=177); the total does not add up to 177 since some of the articles may have used more than one type of PS methods (matching, covariate adjustment, stratification or IPTW). P-values from hypothesis testing (chi-square test or Fisher s exact test, t-test, Wilcoxon s signed-rank test, McNemar s test, Mann Whitney U test, The Kruskal-Wallis test, and logistic regression) ** Others include Kolmogorov-Simonov test, Lévy distance, Overlapping coefficient, multivariable models and percent reduction in bias. p-value (125/177, 70.6%) from hypothesis testing (e.g., chi-square test or t-test). Standardized difference was used in 45 (25.4%) of the studies where balance was reported and 11(6.2%) employed graphical displays such as standardized difference plots, PS boxplots, kernel plots and histograms to assess balance (Table 2). Hosmer-Lemeshow goodness-of-fit test and the c-statistic of the PS model were reported in 26 (8.8%) and 39 (13.2%) of the reviewed studies, respectively. Frequency of balance assessment did not seem to differ among studies involving surgical interventions (61.6%), drug-related intervention (57.4%) and clinical studies (58%) (p > 0.05). The PS can be used in several ways to control for confounding: matching, stratification, PS adjustment, or inverse probability weighting. Matching on the PS was the most commonly used approach (204/296, 68.9%), followed by PS adjustment (62/296, 20.9%) and stratification using PS (41/296, 13.9%). inverse probability weighting was applied in 21 (7.1%) articles. Three studies did not mention the PS methods used and 26 (8.8%) articles used combinations of two or more of the PS methods. Ten studies reported that they performed sensitivity analysis and three studies specifically indicated that the PS was used as a sensitivity analysis. Among the studies that employed PS matching, one-to-one matching was the most frequently reported approach (118/204, 57.8%); the matching ratio was neither explicitly reported nor clear from the data in 73 (35.8%) of these studies employing PS matching (Table 3). The matching algorithm used to form the matched pairs was reported only in 67 (32.8%) studies that applied PS matching and greedy matching with the nearest neighbour matches was more often reported (n=42). Other matching approaches reported include 5-to-1-digit greedy matching 6 (n=10), Greedy matching on Mahalanobis distance (n=6), 2.1 Assessment of balance in propensity score methods in the medical literature 23

24 2.1 Assessment of balance in propensity score methods in the medical literature 24 Table 3. The frequency of the different PS methods and balance assessment PS method Number of articles (n) Balance checked PS matching* 204 (68.9) 144 (70.6) 1:1 matching :2 matching 3 2 1:3 matching 4 3 1:4 matching 5 3 Covariate adjustment using PS 62 (20.9) 25 (40.3) Stratification using PS* 41 (13.9) 17 (41.5) Quintiles of PS Deciles of PS 8 1 Quartiles of PS 3 2 Tertiles of PS 5 3 IPTW Mixed** 21 (7.1) 26 (8.8) 15 (71.4) 18 (69.2) * Some articles did not mention ratio of treated:untreated patients used in matching and quantiles used in stratifictaion using PS ** Studies used a combination of two or more of the various PS methods (matching, covariate adjustment, stratification or IPTW). 8-to-1-digit greedy matching 6 (n=3), stratified matching (n=3), optimal matching (n=1), and exact matching (n=2). The reporting of calliper width was poor and inconsistent (ranging to 0.06 standard deviations on the logit of the PS); the most frequently used calliper being 0.02 standard deviations on the logit of the PS (n= 20). Unmatched subjects were reported excluded from the analysis in four studies and retained in one study for efficiency reasons. Covariate balance was more often checked and reported in studies using PS matching (144/204, 70.6%) and inverse probability of treatment weighting (15/21, 71.4%)(Table 3). Balance before and after matching or stratification or weighting was compared only in 110 (37.2%) articles. Among the studies that applied stratification on the PS, stratification based on quintiles of the PS was the most common application (21/41, 51.2%). Four studies (9.8%) did not mention the PS quantile that were used for stratification. Four studies reported that observations in the first and fifth quintiles of PS were excluded due to lack of overlap (non-positivity). Use and Reporting of Different Diagnostic Methods in Different Journals The IF of the journals on which articles were included for the review ranged from 0.1 (the Korean Journal of Thoracic Cardiovascular Surgery) to 53.3 (The New England Journal

25 of Medicine, NEJM); there was no association between journal s IF and the frequency of reporting PS analysis (p > 0.05). The SJR indicator of the journals varied between 0.11 (Managed Care) and (NEJM). The frequency of balance checking was similar among studies within quintiles of the SJR indicator (64% and 57% in the first and fifth quintile of SJR indicator, respectively). However, the use and reporting of p-values from hypothesis testing, c-statistic and goodness-of-fit tests of the PS model was less common in studies published in journals from the fifth quintiles of SJR ranking compared to those published in journal from the first quintiles (37.3% versus 49.2%, 10.2% versus 16.9%, 6.8% versus 13.8% for p-values, c-statistic and goodness-of-fit tests, respectively, p > 0.05), In addition, the use of absolute standardized difference for measuring and reporting balance was higher, although not significant, in studies published in journals with in the higher quintiles of SJR indicator (18.6% and 19.6% in the fourth and fifth quintile versus 9.7% and 12.1% in the first and second quintiles of SJR indicator, p > 0.05). Similarly, PS weighting for estimating treatment effect was often used in studies published in the highest quintile of the SJR indicator (16.9% in the fifth quintile versus 3.4% in first quintile, p < 0.05). DISCUSSION The propensity score method has become a commonly used method for controlling confounding in observational studies. This systematic review reveals that the process of variable selection, assessment and/or reporting covariate balance as well as propensity score model fit is inconsistent. Moreover, a limited number of studies reported critical aspects of the propensity score model development or the use of appropriate statistical methods for checking balance. In general, like other observational studies, the conduct and the reporting of propensity score analysis is poor in the medical literature despite the tremendous methodological discussions on the topic in the last few years. 3-5,8,9,11,16,17,19,22,28-30 In the appendix to this manuscript, major methodological contributions in propensity score methods are summarized (Appendix 1). Our study is consistent with previous systematic reviews with respect to low quality of reporting and/or conduct of propensity score analysis. 8,9,11,12 However, our review included a large number of recently published studies to evaluate the current status in conducting and reporting propensity score analysis. To our knowledge, this review is the first to specifically address the current practice of variable selection based on clinical knowledge as well treatment and/or outcome association apart from algorithmic methods, discrimination as well as goodness-of-fit tests of the model. In addition, much effort was put in extracting detailed information on other important aspects of the propensity score methods that could help applied researchers as well as readers appraise the validity of a given propensity score analysis. This review is representative of the medical literature because the studies included were restricted neither to epidemiological nor to high impact journals. It could be possible that authors performed detailed analysis but only reported limited information due to journal revision and editorial restrictions; 8, 9 however, this does not seem to be a strong 2.1 Assessment of balance in propensity score methods in the medical literature 25

26 2.1 Assessment of balance in propensity score methods in the medical literature 26 justification for poor reporting of the results for two main reasons. First, those studies that reported aspects of the propensity score analysis used inappropriate statistical methods such as significance testing or c-statistics. Second, inconsistency of reporting was observed irrespective of the impact factor of the journal in which the article was published, and even within a specific journal, which is also in line with a previous systematic review. 12 Lack of well-established standards for conducting and reporting of propensity score analysis may contribute to inconsistent and poor execution as well as reporting of propensity score analysis despite substantial advances in the propensity score methodology in the last few years. 3,17,18,22,24,29,31 We, therefore, propose that critical items in relation to propensity score analysis should be incorporated in guidelines on the reporting of observational studies, such as the STROBE statement 32 and the ENCePP guide on methodological standards in pharmacoepidemiology to improve the quality of reporting. 33 Before coming to recommendations for the reporting of studies that utilize propensity score methods, we summarize important issues in the conduct of propensity score analysis, First, the variable selection for propensity score model should be conducted with a great care since the choice of variables has tremendous effect on both the bias as well as the precision of the treatment effect estimate. 2-4,30,34,35 The choice of variables, interaction and/ or higher order terms should be primarily based on prior clinical knowledge. 15,30 Obviously, confounders should always be included in the propensity score model. Variable selection based on their association with the outcome irrespective of the exposure can improve the precision of an estimated treatment effect without increasing bias. 3,15 In contrast, variables that are strongly related to the treatment but not to the outcome (instrumental variables) or weakly related to the outcome should not be included in the propensity score model, since such variables could amplify bias in the presence of unmeasured confounding particularly in non-linear models and decrease the precision of the treatment effect estimate. 14,15,21,30,35 Second, fitting the propensity score model and extracting the propensity score values using methods such as ordinary logistic regression or recursive partitioning. Machine learning techniques such as neural networks and classification and regression trees (CART) 36 were shown to have better performance in terms of bias reduction and provide more consistent 95% confidence interval coverage than logistic regression approaches, particularly under 37, 38 conditions of both non-additivity and non-linearity. Third, using appropriate propensity score methods: matching or stratification or inverse probability weighting. The choice of the propensity score method affects the way balance on covariates should be assessed (Fourth step), for example, in propensity score matching and stratification on the propensity score, balance can easily be assessed by looking at the distribution of covariates between matched groups or within strata of the PS, respectively. It also dictates the treatment effect estimation (Fifth step) and its interpretation (Sixth step); hence, the choice of specific propensity score methods should be based on the inferential 5, 7, 39 goal of the research.

27 Fourth, assessing balance on measured baseline characteristics between treated and untreated patients in the (matched or stratified or weighted) sample using appropriate 40, 41 balance diagnostics that are specific to a sample and not influenced by sample size. Accordingly, balance should be assessed on a selected set of covariates that are confounders and/or independent predictors of outcome (First step). The use of pre-matching c-statistic and goodness-of-fit tests for covariate selection and assessment of propensity score model fit should be avoided since such methods do not provide information to detect missing confounders in propensity score model. 22, 24 In our previous studies, we recommended the absolute standardized difference as a balance measure of choice since it has shown superior performance over other balance measures such as the overlapping coefficient both in simulation and empirical studies. In addition, the absolute standardized difference is very familiar, easy to calculate, and reasonably well-understood tool for epidemiologists compared to other balance measures. Once the absolute standardized difference is calculated per covariate, covariate squares, and important pairwise interactions, covariate specific absolute standardized difference can be pooled in to a single measure using empirically derived weights based on the strength of association between covariates (or covariate terms) and outcome suggested by Belitser et al. 19 Recently, post-matching c-statistic of the propensity score model 42 has been suggested as an overall measure of the imbalance across covariates. It has shown comparable performance with the absolute standardized difference in terms of indicating bias and it is also simpler to compute; however, unlike the absolute standardized difference, an assessment of covariate s potential for confounding (by checking balance on the covariate s scale) as well as identification of whether the imbalances are due to a set of related covariates is difficult. 21, 42, 43 An iterative process of fitting the propensity score model, checking balance on covariates, and respecifying the propensity score model has been suggested by Rosenbaum. 2 A comparison of different statistical methods for assessing and reporting of balance and propensity score model fit is summarized in Table 4. Fifth, estimating treatment effect using appropriate statistical methods depending on the type of the outcome (for example, Cox proportional hazard model for time to event data or logistic regression for binary data). When propensity score matching is used, the matched nature of the data or lack of independence between observations should be taken in to account in the analysis particularly when matching is done with replacement. 5, 8,11 In inverse probability of treatment weighting, the use of stabilizing weights could help normalize the range of the inverse probabilities and increase efficiency of the analysis 39,53-56 Since there is no strong theory regarding when balance is close enough, examining the sensitivity of results to a range of propensity-score specifications is recommended. 5,24 Last but not least, careful interpretation of the treatment effect estimate (estimand) and explanation of the relationships among this estimand, the research question, and the target population in mind. 5 For example, marginal structural models using inverse probability of treatment weighting estimates a marginal treatment effect, 53, 54 which on average is similar to the treatment effect in randomized studies; thus, the estimand can be directly 2.1 Assessment of balance in propensity score methods in the medical literature 27

28 Assessment of balance in propensity score methods in the medical literature 28 Table 4. Comparison of Different Statistical Methods for Assessing and Reporting PS Model fit and/or Covariate Balance (Continued) Balance Diagnostic Short Descriptions Strengths Limitations - Influenced by sample size - It is not a characteristic of a sample (relates to a hypothetical population) Arbitrary cut-off (threshold) - Gives little or no information whether the PS model has been correctly specified (hence, bias) - Easy to use - Easy to interpret - Can be derived from nonparametric tests - - Assess evidence in favor of some claim about the population from which the sample has been drawn. Test of significance - - Frequently used to compare the distribution of - - Scale invariant measured baseline covariates between treated and control subjects in the PS analysis Gives no signal whether the PS model has been correctly specified or key confounders have been omitted from the PS model (hence, no indication of 8, 22, 47 bias). - Easy to use - Easy to interpret - Gives information on the full - model - Scale invariant - Non-parametric C-statistic - - It refers to the ability of the PS model to accurately distinguish treated subjects from untreated ones For binary exposure, it is identical to the area under - - Higher c-statistics may not necessarily indicate 8, 22, 24 optimal balance. the receiver operating characteristic (ROC) curve The value ranges between 0.5 (classification - Arbitrary cut-off (threshold) - Influenced by sample size - Influenced by sample size - Does not indicate bias Indicate only model-fit not - balance of covariates - Arbitrary cut-off (threshold) no better than a pure chance) to 1.0 (perfect classification) - Easy to use - Easy to interpret - Semi as well as nonparametric alternatives - Gives information on the full - Model - Scale invariant - Characteristic of a sample - Graphical presentation of PS distribution - Scale-invariant - Non-parametric - Good indicator of bias in large 19, 21 sample size - - A measure the compatibility of the observed values from the data with the predicted values from the model in question, i.e., they show how well the selected model describes the data. 22,46 Goodnessof-fit test 19, 21 - Influenced by sample size*. - Estimation is complex - Relies on densities The amount of overlap in the density of covariate distributions for treated and untreated subjects. - - For continuous covariates, non-parametric kernel Overlapping Coefficient (OVL) - - Arbitrary cut-off (threshold) density can be used to estimate the OVL. 19, 48 For dichotomous covariates, it is the proportion of overlap Range between Zero (no overlap, i.e. perfect imbalance) and one (complete overlap, i.e. perfect balance).

29 Table 4. Comparison of Different Statistical Methods for Assessing and Reporting PS Model fit and/or Covariate Balance (Continued) Balance Diagnostic Short Descriptions Strengths Limitations , Influenced by sample size* - Estimation is complex - Fails to capture convergence of distribution - Arbitrary cut-off (threshold) - Characteristic of a sample - Non-parametric - Does not need densities - Scale-invariant 49 - Clear interpretation - - The maximum vertical distance between two cumulative distribution functions of a certain covariate for treated and untreated subjects expressed as relative frequencies 19,49,50 Kolmogorov- Smirnov Distance (D) - - Range between zero ( perfect balance) and one (complete imbalance) - Not scale-invariant - Estimation is complex - Influenced by sample size* - Interpretation is complex - Arbitrary cut-off (threshold) - Non-parametric - Characteristic of a sample - Does not need densities - Captures convergence of distribution - - The side length of the largest square that can be inscribed between the cumulative distribution functions of a certain covariate for treated and untreated subjects with sides parallel to the 19, 50 coordinate axes. Lévy distance (L) 19, This distance can range between zero ( perfect balance) and one (complete imbalance). - - Arbitrary cut-off (threshold) - Easy to calculate - Non-parametric - Clear interpretation - Scale-invariant - Not influenced by sample - - The absolute difference in means between the two treatment groups divided by an estimate of the common standard deviation of that variable in the two treatment groups, i.e. the pooled standard 17, 23, 51, 52 deviation. Absolute standardized difference size 19, Characteristic of a sample - - Describes the observed bias in the means (or proportions) of covariates across treatment groups, expressed as a percentage of the pooled standardized deviation. 43 *The correlation with bias is influencd by sample size only when covariates are continous or mixed binary and continous 19,21 Assessment of balance in propensity score methods in the medical literature 29

30 2.1 interpreted as the average causal treatment effect (ATE) between treated and untreated patients. However, covariate adjustment and stratification using propensity score give conditional treatment effect estimates and their interpretation is not straightforward. 28 This is particularly the case when non-collapsible effect measures such as odds ratio and hazard Assessment of balance in propensity score methods in the medical literature 30 Figure 2. Flow chart summarizing relevant information to be reported when conducting PS analysis. The PS estimation can be iteratively until an optimal balance on covariates is reached ( ). It is not relevant to report goodness-of-fit tests, pre-matching c-statistic of the PS model, the actual PS values, the PS model itself, P-values and model coefficients from the PS model.

31 ratio are used where the conditional and marginal effect estimates differ in the presence of a non-null treatment effect. 7, 39, 57, 58 On the other hand, propensity score matching typically focuses on either the effect of the treatment in the treated or the effect of the treatment in the untreated, not on the average treatment effect in the whole population. 5,6 It is important to note that exclusion of unmatched patients from the analysis not only affect the precision of the effect estimate but could also have consequences on the generalizability of results. 5,6 More sophisticated methods such as full matching or one-to-many matching can make use of all available data and may improve the performance of propensity score matching in terms of reducing bias. 59 In addition, the choice of appropriate caliper and matching algorithm in propensity score matching deserve great attention for achieving a good balance thereby reducing bias in estimating a treatment effect. We refer to the literature for 6,31, detailed aspects of the propensity score matching. 2.1 Our systematic review of current reporting of propensity score methods in the medical literature shows that the quality of reporting variable selection, assessing covariate balance, and other important aspects of propensity score analysis is far from optimal. The conduct of studies that use propensity score methods could be split in to six essential steps, and each of which should be clearly reported. Recommendations for the reporting of propensity score methods are summarized in Figure 2. These recommendations may improve the quality of reporting of propensity score methods, which allows for a better appraisal of propensity score based studies. Assessment of balance in propensity score methods in the medical literature 31

32 2.1 Assessment of balance in propensity score methods in the medical literature 32 REFRENCES 1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: Brookhart MA, Schneeweiss S, Rothman KJ et al. Variable selection for propensity score models. Am J Epidemiol 2006; 163: Patrick AR, Schneeweiss S, Brookhart MA et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20: Hill J. Discussion of research using propensity-score matching: Comments on A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: Lunt M. Selecting an appropriate caliper can be essential for achieving good balance with propensity score matching. Am J Epidemiol 2014; 179: Ali MS, Groenwold RH, Klungel OH. Propensity Score Methods and Unobserved Covariate Imbalance: Comments on Squeezing the Balloon. Health Serv Res 2014; 49: Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: Weitzen S, Lapane KL, Toledano AY et al. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf 2004; 13: Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58: Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. J Thorac Cardiovasc Surg 2007; 134: D Ascenzo F, Cavallero E, Biondi-Zoccai G, et al. Use and misuse of multivariable approaches in interventional cardiology studies on drug-eluting stents: a systematic review. J Interv Cardiol 2012; 25: Brookhart MA, Stürmer T, Glynn RJ, et al. Confounding control in healthcare database research: challenges and potential approaches. Med Care 2010; 48: S114-S Pearl J. On a class of bias-amplifying variables that endanger effect estimates. In: Grünwald P, Spirtes P, Eds. Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 2010) Corvallis, OR: Association for Uncertainty in Artificial Intelligence; 2010: Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174: Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol 2011; 174: Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf 2008; 17: Austin PC. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiol Drug Saf 2008; 17: Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20:

33 20. Groenwold RHH, Vries F, Boer A, et al. Balance measures for propensity score methods: a clinical example on beta-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf 2011; 20: Ali MS, Groenwold RH, Pestman WR, et al. Propensity score balance measures in pharmacoepidemiology: a simulation study. Pharmacoepidemiol Drug Saf DOI: / pds Weitzen S, Lapane KL, Toledano AY, et al. Weaknesses of goodness-of-fit tests for evaluating propensity score models: the case of the omitted confounder. Pharmacoepidemiol Drug Saf 2005; 14: Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Westreich D, Cole SR, Funk MJ, et al. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: Falagas ME, Kouranos VD, Arencibia-Jorge R, Karageorgopoulos DE. Comparison of SCImago journal rank indicator with journal impact factor. The FASEB J 2008; 22: Gonzalez-Pereira B, Guerrero-Bote VP, Moya-Anegón F. A new approach to the metric of journals scientific prestige: The SJR indicator. J Informetr 2010; 4: Bornmann L, Marx W, Gasparyan AY, Kitas GD. Diversity, value and limitations of the journal impact factor and alternative metrics. Rheumatol Int 2012; 32: Martens EP, Pestman WR, De Boer A, et al. Systematic differences in treatment effect estimates between propensity score methods and logistic regression. Int J Epidemiol 2008; 37: Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol 2006; 98: Myers JA, Rassen JA, Gagne JJ, et al. Myers et al. Respond to Understanding Bias Amplification. Am J Epidemiol 2011; 174: Stuart EA. Developing practical recommendations for the use of propensity scores: Discussion of A critical appraisal of propensity score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med 2007; 45: ENCePP Guide on Methodological Standards in Pharmacoepidemiology. EMA/95098/2010. Available at: standardsandguidances. Accessed March 11, Mortimer KM, Neugebauer R, Van Der Laan M, Tager IB. An application of model-fitting procedures for marginal structural models. Am J Epidemiol 2005; 162: Pearl J. Invited commentary: Understanding bias amplification. Am J Epidemiol 2011; 174: Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63: Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6: e Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf 2008; 17: Ali MS, Groenwold RH, Pestman WR, et al. Time-dependent propensity score and colliderstratification bias: an example of beta2-agonist use and the risk of coronary heart disease. Eur J Epidemiol 2013; 28: Assessment of balance in propensity score methods in the medical literature 33

34 2.1 Assessment of balance in propensity score methods in the medical literature Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15: Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference.. J R Stat Soc: series A (statistics in society) 2008; 171: Franklin JM, Rassen JA, Ackermann D, et al. Metrics for covariate balance in cohort studies of causal effects. Stat Med Doi: /sim Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd edn. Hillsdale, NJ: Lawrence Erlbaum Associates Publishers; Ash A, Shwartz M. R2: a useful measure of model performance when predicting a dichotomous outcome. Stat Med 1999; 18: Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983; 148: Hosmer Jr DW, Lemeshow S. Applied logistic regression. 2nd edn. New York: John Wiley & Sons; Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 1993; 49: Silverman BW. Density estimation for statistics and data analysis. London: Chapman & Hall/CRC; Stephens MA. Use of the Kolmogorov-Smirnov, Cramér-Von Mises and related statistics without extensive tables. J R Stat Soc: Series B (Methodological) 1970; 32: Pestman WR. Mathematical Statistics: An Introduction. 2nd edn. Berlin:Walter De Gruyter Inc; Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. 2nd edn. Hoboken, New Jersey: John Wiley & Sons; Hartung J, Knapp G. Statistical inference in adaptive group sequential trials with the standardized mean difference as effect size. Sequential Analysis 2011; 30: Hernán MÁ, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology 2004; 15: Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol 2006; 163: Greenland S, Pearl J. Adjustments and their consequences collapsibility analysis using graphical models. 2010; 79: Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med 2007; 26: Rassen JA, Shelat AA, Myers J, et al. One-to-many propensity score matching in cohort studies. Pharmacoepidemiol Drug Saf 2012; 21: Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceutical Statistics 2011; 10: Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med 2013; 33:

35 APPENDICES Appendix 1: Summary of Articles on Imprortant Aspects of Propensity Score Analysis Assessment of balance in propensity score methods in the medical literature 35 Figure 1. Summary of articles on imprortant aspects of propensity score analysis (List of references below)

36 2.1 First, a search term propensity score OR propensity scores in the title or abstract or key terms were used in Scopus. Methodological studies, which have at least 80 citations in Scopus, have been extracted irrespective of subject area. In addition, with in the above search results, additional search terms variable selection or covariate balance or diagnostics or goodness-of-fit tests or C-statistics in the title, abstract or key terms were used to filter articles that contributed on variable selection and balance assessment in PS analysis. For the latter group, only methodological articles, which have a minimum of five citations in Scopus, were selected and included in reference tree below. The search was performed on the 15 th Dec List of References in the Figure Assessment of balance in propensity score methods in the medical literature Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Statistician 1985; 39: Rosenbaum PR, Rubin DB.The bias due to incomplete matching. Biometrics 1985; 14: Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 1993; 49: Rosenbaum PR. Observational Studies. New York: Springer; Rubin DR, Thomas N. Matching using estimated propensity score: relating theory to practice. Biometrics 1996; 52: Rubin DB. Estimating causal effects from large data sets using the propensity score. Ann Intern Med 1997; 127: D Agostino RB, Jr. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med 1998; 17: Joffe MM, Rosenbaum PR. Invited commentary: Propensity scores. Am J Epidemiol 1999; 150: Dehejia RH, Wahba S. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. J Am Stat Assoc 1999; 94: Perkins SM, Tu W, Underhill MG, et al. The use of propensity scores in pharmacoepidemiologic research. Pharmacoepidemiol Drug Saf 2000; 9: Imbens GW. The role of the propensity score in estimating dose-response functions.() Biometrika 2000; 87: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MÁ, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Little, RJ, Rubin, DB. Causal effects in clinical and epidemiological studies via potential outcomes: Concepts and analytical approaches. Ann Rev Pub Healt 2000; 21: Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Serv and Out Res Method 2001; 2:

37 18. Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Serv and Out Res Method 2001; 2: Dehejia RH, Wahba S. Propensity score-matching methods for non-experimental causal studies. Rev Econ and Stat 2001; 84: Braitman LE, Rosenbaum PR. Rare outcomes, common treatments: Analytic strategies using propensity scores. Ann Int Med 2001; 137: Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003; 158: Weitzen S, Lapane KL, Toledano AY, et al. Principles for modeling propensity scores in medical research: a systematic literature review 2004; 13: Klungel OH, Martens EP, Psaty BM, et al. Methods to assess intended effects of drug treatment in observational studies are reviewed (Review). J Clin Epidemiol 2004; 54: Imai K, Van Dyk DA. Causal inference with general treatment regimes: Generalizing the propensity score. J Am Stat Assoc 2004; 99: Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat Med 2004; 23: McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods 2004; 9: Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf 2004; 13: Weitzen S, Lapane KL, Toledano AY, et al. Weaknesses of goodness-of-fit tests for evaluating propensity score models: the case of omitted confounders. Pharmacoepidemiol Drug Saf 2005; 14: Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58: Luellen JK, Shadish WR, Clark MH. Propensity scores: An introduction and experimental test (Review). Eval Rev 2005; 29: Stürmer T, Schneeweiss S, Avorn J, Glynn RJ. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol 2005; 162: Stürmer T, Schneeweiss S, Brookhart MA, et al. Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: Non-steroidal anti-inflammatory drugs and short-term mortality in the elderly. Am J Epidemiol 2005; 161: Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol 2006; 163: Stürmer T, Joshi M, Glynn RJ, et al. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol 2006; 59: Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol 2006; 98: Austin, P.C., Mamdani, M.M.A comparison of propensity score methods: A case-study estimating the effectiveness of post-ami statin use. Stat Med 2006; 25: Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol 2006; 163: Assessment of balance in propensity score methods in the medical literature 37

38 2.1 Assessment of balance in propensity score methods in the medical literature Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15: Rubin DB. The design versus the analysis of observational studies for causal effects: Parallels with the design or randomized trials. Stat Med 2007; 26: D Agostino RBJ. Propensity scores in cardiovascular research (Review). Circul 2007; 115: Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: A Monte Carlo study. Stat Med 2007; 26: D Agostino RBJ, D Agostino Sr RB. Estimating treatment effects using observational data. JAMA 2007; 297: Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf 2008; 17: Imai K, King G, Stuart E. Misunderstandings among experimentalists and observationalists: about causal inference. J the Royal Stat Soc, Series A 2008; 171(Part 2, Forthcoming): Martens EP, Pestman WR, de Boer A, et al. Systematic differences in treatment effect estimates between propensity score methods and logistic regression. Int J Epidemiol 2008; 37: Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: Schneeweiss S, Rassen JA, Glynn RJ, et al. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiol 2009; 20: Rubin DB. Should observational studies be designed to allow lack of balance in covariate distributions across treatment groups? Stat Med 2009; 28: Austin PC. The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies. Medical Decision Making 2009; 29: Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Harder VS, Stuart EA, Anthony JC. Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychol Methods 2010; 15: Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med 2010; 29: Stuart EA. Matching Methods for Causal Inference: A review and a look forward. Stat Science 2010; 25: Westreich D, Cole SR, Funk MJ, et al. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol 2011; 174: Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multi Behav Res 2011; 46: Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174: Rassen JA, Shelat AA, Myers J, et al. One-to-many propensity score matching in cohort studies Pharmacoepidemiol Drug Saf 2012; 21: Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20:

39 60. Sekhon JS. Multivariate and propensity score matching software with automated balance optimization: The matching package for R. J Stat Soft 2011; 42: Patrick AR, Schneeweiss S, Brookhart MA, et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20: Sekhon JS, Grieve RD. A matching method for improving covariate balance in cost-effectiveness analyses. Health Econ 2012; 21: Rassen JA, Shelat AA, Myers J, et al. One-to-many propensity score matching in cohort studies. Pharmacoepidemiol Drug Saf 2012; 21: Vansteelandt S, Maarten B, Gerda C. On Model Selection and Model Misspecification in Causal Inference. Stat Methods Medic Res 2012; 21: Assessment of balance in propensity score methods in the medical literature 39

40 2.1 Appendix 2: R Codes for Balance Metrics 1. The absolute standardized difference (SDif) is defined as the absolute difference in means (proportions) between the two treatment groups divided by an estimate of the common standard deviation of that variable in the two treatment groups, i.e. the pooled standard deviation R Code to calculate absolute standardized difference Assessment of balance in propensity score methods in the medical literature SDiff <- function(a,b,ldich=f){ if(ldich){ #For binary Covariate if (length(table(c(a,b)))==1) d <- 0 if (length(table(c(a,b)))!=1){ pc <- mean(a) pt <- mean(b) d <- abs( (pt-pc) / sqrt((pt*(1-pt)+pc*(1-pc))/2))} } else { # for Continous Covariate xc <- mean(a) xt <- mean(b) sc <- sd(a) st <- sd(b) d <- abs( (xt-xc) /sqrt((st^2 + sc^2)/2) ) } d} 40

41 2. Calculating the overlapping coefficient, OVL, involves estimation of two density functions evaluated at the same x values and then calculation of the overlap. The function ovl needs two input vectors of observations on the covariable for both groups (var1 and var0).we used the R build-in function density using the normal density rule bandwidth.nrd. For calculation of the overlap, we used Simpson s rule on a grid of 101. A plot of the two densities and the overlap is optional (plot=t). 2.1 R Code to calculate OVL ovl <- function(group0, group1, plot=f) { wd1 <- bandwidth.nrd(group1) wd0 <- bandwidth.nrd(group0) from <- min(group1,group0) * mean(c(wd1,wd0)) to <- max(group1,group0) * mean(c(wd1,wd0))]]> d1 <- density(group1, n = 101, width=wd1, from=from, to=to) d0 <- density(group0, n = 101, width=wd0, from=from, to=to) dmin <- pmin(d1$y,d0$y) ovl <- ((d1$x[(n<-length(d1$x))]-d1$x[1])/(3*(n-1)))* (4*sum(dmin[seq(2,n,by=2)])+2*sum(dmin[seq(3,n-1,by=2)])+dmin[1]+dmin[n]) if(plot){ maxy <- max(d0$y, d1$y) minx <- min(d0$x) plot(d1, type= l, lty=1, ylim=c(0, maxy), ylab= Density, xlab= ) lines(d0, lty=3) lines(d1$x, dmin, type= h ) text(minx, maxy, OVL = ) text(minx+0.085*(max(d1$x)-minx), maxy, round(ovl,3)) } round(ovl,3) } # Example treated <- rnorm(100,10,3) untreated <- rnorm(100,15,5) ovl(untreated, treated, plot=t) Assessment of balance in propensity score methods in the medical literature 41

42 The Kolmogorov Smirnov Distance, KSD R-Code to calculate the Kolmogorov-Smirnov distance with optional figure Assessment of balance in propensity score methods in the medical literature # using function ks2 used within function ks.gof ksdist <- function(group0, group1, plot=f){ n0 <- length(group0) n1 <- length(group1) total <- sort(unique(c(group0, group1))) ma0 <- match(group0, total) ma1 <- match(group1, total) F0 <- cumsum(tabulate(ma0, length(total)))/n0 F1 <- cumsum(tabulate(ma1, length(total)))/n1 diff <- abs(f0-f1) ks <- max(diff ) if(plot){ x.ks <- order(ks-diff )[1] plot(f1, type= l, lty=1, ylab= Cumulative density, xlab= ) lines(f0, lty=3) lines(c(x.ks, x.ks), c(f0[x.ks],f1[x.ks]), lty=2) text(0.08*(n0+n1), 1, K-S distance = ) text(0.20*(n0+n1), 1, ks) } ks } # Example treated <- rnorm(100,10,3) untreated <- rnorm(100,15,5) ksdist(untreated, treated, plot=t) 42

43 4. The Lévy distance can be calculated using the following two functions mecdf and Levy. R Code to calculate the Levy distance mecdf <- function(group0,group1) { n0 <- length(group0) n1 <- length(group1) total <- sort(unique(c(group0, group1))) ma0 <- match(group0, total) ma1 <- match(group1, total) F0 <- cumsum(tabulate(ma0, length(total)))/n0 F1 <- cumsum(tabulate(ma1, length(total)))/n1 min <- min(f1-f0) max <- max(f1-f0) m <- c(min,max) return(m) } Levy <- function(u,v){ f <- function(s,u,v){ t <- mecdf(u,v-s)+s return(t[1]) } g <- function(s,u,v){ t <- mecdf(u,v+s)-s return(t[2]) } a <- min(c(u,v)) b <- max(c(u,v)) c <- b-a z1 <- uniroot(f,low=c,up=c,tol= ,u=u,v=v) z2 <- uniroot(g,low=c,up=c,tol= ,u=u,v=v) z <- max(z1$root,z2$root) return(z) } # Example treated <- rnorm(100,10,3) untreated <- rnorm(100,15,5) mecdf(untreated,treated) Levy(untreated,treated) 2.1 Assessment of balance in propensity score methods in the medical literature 43

44

45 2.2 Chapter 2.2 Propensity Score Balance Measures in Pharmacoepidemiology: a Simulation Study M Sanni Ali, RHH Groenwold, WR Pestman, SV Belitser, KCB Roes, AW Hoes, A de Boer, OH Klungel Pharmacoepidemiology and Drug Safety 2014; 25: Propensity score balance measures in pharmacoepidemiology 45

46

47 ABSTRACT BACKGROUND METHODS RESULTS CONCLUSIONS INTRODUCTION Conditional on the propensity score (PS), treated and untreated subjects have the same distribution of measured baseline characteristics when the PS model is appropriately specified. The performance of several PS balance measures in assessing the balance of confounders achieved by a specific PS model and selecting the optimal PS model was evaluated in simulation studies. However, these studies involved only normally distributed variables. Comparisons in binary or mixed covariate distributions with rare outcomes, typical of pharmacoepidemiological settings, are scarce. Monte-Carlo simulations were performed to examine the performance of different balance measures in terms of selecting an optimal PS model thereby reducing bias. The balance of covariates between treatment groups was assessed using the absolute standardized difference, Kolmogorov-Smirnov distance, Lévy distance and overlapping coefficient. Spearman s correlation coefficient (r) between each of these balance measures and bias were calculated. In large sample sizes (n 1000), all balance measures were similarly correlated with bias (r ranging between ) irrespective of the treatment effect s strength and frequency of the outcome. In smaller sample sizes with mixed binary and continuous covariate distributions, these correlations were low for all balance measures (r ranging between ), except for the absolute standardized difference (r = 0.51). The absolute standardized difference, which is an easy to calculate balance measure, displayed consistently better performance across different simulation scenarios. Therefore, it should be the balance measure of choice for measuring and reporting the amount of balance reached, as well as for selecting the final PS model. In observational studies, attributing causality remains a major challenge due to the nonrandom assignment of treatment leading to systematic differences in covariate distributions between treated and untreated subjects. This may bias the estimates of treatment effect unless adequate statistical adjustments are made. Propensity score (PS) methods are a set of commonly used tools to correct for such differences. 1 The PS is a balancing score: conditional on the PS, prognostic patient characteristics tend to be balanced between treatment groups; that is, treatment received is independent of measured patient characteristics. 1, Propensity score balance measures in pharmacoepidemiology A critical step in PS analysis, although not often well described or even ignored, is an iterative derivation of an adequately specified PS model and assessing the balance reached on measured covariates between treatment groups. 1,3-5 A check for appropriateness of such 47

48 2.2 Propensity score balance measures in pharmacoepidemiology 48 a PS model is the degree to which measured baseline characteristics are balanced between treated and untreated groups. 3 Selection of variables based on their association with the outcome has been recommended for constructing an optimal PS model in terms of biasvariance trade-off On the other hand, detailed knowledge of potential confounders and their association with both the treatment and the outcome of interest is usually incomplete in real practice. 11 Hence, investigators may be forced to select variables for a PS model in data-driven ways. 7 However, sub-optimal procedures such as stepwise variable selection algorithms or c-statistics should not be chosen since such standard model building tools designed to create good predictive models of the exposure will not necessarily lead to optimal PS models and good balance. 5-7 Others proposed balance measures to assess the balance of covariate distribution between treatment groups reached by a specific PS model and to select an optimal PS model (in terms of achieved balance) among a variety of possible models. 12,13 Different balance measures including the standardized difference, 3,4,14 the Kolmogorov- Smirnov distance (KSD), 15,16 and the Lévy distance (L) 16,17 have been evaluated in simulation studies. Mean based measures such as the standardized difference were highly correlated with bias in effect estimate irrespective of sample size, whereas KSD and L were highly correlated with bias only in large sample sizes. 12,13 The overlapping coefficient (OVL), 12,18 another balance measure, showed lower correlation with bias in large sample sizes. 12 However, these simulation studies considered only a limited range of scenarios that might not be generalizable to pharmacoepidemiologic studies, typically dealing with categorical patient characteristics and rare outcomes. Therefore, we conducted a simulation study to assess the performance of different balance measures for PS models in various research settings typical of pharmacoepidemiologic research. METHODS Propensity Score Balance Measures Different PS balance measures have been proposed. In this study, we evaluated four balance measures: the standardized difference, OVL, KSD, and L. These balance measures were chosen because they are characteristics of the sample and are not part of hypothesis testing unlike other statistics such as the t-statistics. 3,12,14 For a more detailed explanation of these balance measures, we refer to the literature. 3,4,12,14,19 Here, we only describe the key characteristics of these measures. The absolute standardized difference (SDif ) for binary confounders is the absolute difference in proportions of the confounder between treated and untreated subjects standardized to the variation in the confounding variable (i.e. the standard deviation). It has a minimum value of zero ( perfect balance) but no maximum value. For continuous confounders, the SDif is the absolute difference in the means of confounders in treated and untreated subjects standardized to the variation.

49 The overlapping coefficient (OVL), also called proportion of similar response (PSR), is the amount of overlap in the density of confounder distributions for treated and untreated subjects. Overlapping coefficient (OVL) can range between Zero (no overlap, i.e. perfect imbalance) and one (complete overlap, i.e. perfect balance). In case of continuous confounders, we used non-parametric kernel density to estimate the OVL. 12,20,21 For dichotomous confounders, we calculated the proportion of overlap The Kolmogorov-Smirnov distance (KSD) is defined as the maximum of all vertical distances between two cumulative distribution functions of a certain confounder for treated and untreated subjects expressed as relative frequencies. 16 This distance can range between zero ( perfect balance) and one (complete imbalance). The Lévy distance (L), a variant of KSD, is defined intuitively as the side length of the largest of the squares that can be inscribed between the curves of the cumulative distribution functions of a certain confounder for treated and untreated subjects with sides parallel to the coordinate axes. 12,16 Unlike KSD, L takes into account the horizontal distance; hence, it is not scale invariant meaning that it is sensitive to the unit of measurement of the confounder. Therefore, when different confounders are included in the estimation, one should standardize the confounders before these measures can be compared. 12 L can range between zero ( perfect balance) and one (complete imbalance). Simulation We performed Monte-Carlo simulations to assess the performance of balance measures to select a PS model with respect to the amount of bias in the effect estimate. In addition, we determined how inclusion of different covariates, their interactions, and higher order terms in a PS model affect this bias. For the simulations, we used the framework of Austin and Belitser et al. 12,19,22,23 X 7 & X 8 Treatment X 9 X 1, X 2, X 4 & X 5 X 3 & X 6 Outcome Propensity score balance measures in pharmacoepidemiology Figure 1. Causal diagram showing variables related to both treatment and outcome, i.e. confounders of the treatment-outcome pair (X 1, X 2, X 4, and X 5 ), variables related to treatment only (X 7 and X 8 ), variables related to outcome only (X 3 and X 6 ), and variable neither related to treatment nor outcome (X 9 ). 49

50 2.2 Data Generation We simulated three different scenarios. In each scenario, we generated nine covariates: four confounders (X 1, X 2, X 4, and X 5 ), i.e. variables that are related to both treatment and outcome, two covariates (X 7 and X ) that were related only to the treatment but not to the outcome; 8 two covariates (X 3 and X 6 ) that were related only to the outcome but not to the treatment; and one covariate (X 9 ) that was neither related to the treatment nor to the outcome (see Figure 1). In Scenario 1, all the nine covariates (X 1 -X 9 ) were binary. In Scenario 2, six of the covariates (X 3 and X 5 - X 9 ) were binary, and the other three (X 1, X 2 and X 4 ) were continuous following a gamma distribution (scale parameter =2 and shape parameter =1) to capture skewedly distributed covariates in our simulation. In Scenario 3, six of the covariates (X 3 and X 5 - X 9 ) were binary, and the other three (X 1, X 2, and X 4 ) were continuous following a standard normal distribution. To generate the binary covariates, we used a Bernoulli distribution with varying success rate :0.30, 0.35, 0.40, 0.45, 0.50, and 0.60, which means that the prevalence of the binary covariates was on average 30%, 35%, 40%, 45%, 50%, and 60% in each of the simulated populations, respectively. Next, we randomly generated binary treatment status (t) for each of the n subjects (n=sample size) according to the following logistic model, including quadratic and interactions terms, Propensity score balance measures in pharmacoepidemiology 50 logit ( p t α 0 α x α x α x α x α x α 6 x8 2 2 α 8 x2 x + 7 α 9 x7 x + 8 α 10 x4 x + 5 α 11 x + 1 α 12 x7 ) = α 7 x Binary outcome status was then generated for each of the n subjects conditional on treatment status (t), and six of the covariates (X 1 - X 6 ), their interactions, and higher-order terms using the following Poisson model, log ( p y β + γ * t β 0 1x1 β x3 x + 5 β x3 x6 ) = β x + 2 β x + 3 β x + 4 β x 2 2 β x4 x + 5 β x + 1 β x β x x 4 β x2 x The dichotomous treatment and outcome were generated using a Bernoulli distribution with subject-specific probabilities of treatment assignment and outcome status, π =p t and π= p y, respectively. Strong and medium associations between covariates and treatment or outcome were induced by varying the values of the regression coefficients for the main terms, i.e., α 1 - α 6 and β 1 - β 6 in the treatment [1] and outcome [2] models, respectively. When a variable was binary, the values of α 1 - α 6 and β 1 - β 6 were set to log(3) and log(1.5) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. This means that strong and moderate associations between a given variable and either treatment or outcome was considered when the presence of that variable independently increased the risk of either treatment or outcome by a factor of 3 and 1.5, respectively. For continuous variables, the values of α 1 - α 6 and β 1 - β 6 were set to 7 4 [1] [2]

51 log(2) and log(1.35) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. In this case, strong and moderate associations with treatment or outcome were defined as an increase in the risk of treatment and outcome by 2 and 1.35, respectively, per standard deviation increase in the covariate. No association was considered between a given covariate and treatment or outcome when the presence of a covariate did not have an independent impact on the risk of treatment or outcome. In addition, the following values for the regression coefficients of interaction and square terms, i.e., α 7 α 12 and β 7 - β 12 were used: log (1.2) for α 7, α 10, β 7 and β 10 ; log (1.4) for α 8, α 11, β 8 and β 11 ; log (1.6) for α 9, α 12, β 9 and β Marginal treatment effects (RR) of 1 (γ = log1), 1.5 (γ = log1.5), and 2 (γ = log2) were considered and sample sizes were varied for different simulations (n= 500, 1000, 2000, 5000). The values of β 0 and α 0 were set such that approximately half of the subjects were treated and an event occurred in approximately 15%, 10%, or 5% of the untreated individuals. We used 500 replications in each scenario. Propensity Score Analysis In each simulated dataset, forty different PS models containing different sets of covariates, higher-order and interaction terms were considered. Although it was possible to construct more than forty combinations of covariates or terms (hence, PS models) using the nine covariates and their interaction or square terms, we chose only the forty PS models in such a way that the PS models considered included models containing only confounders, confounders and treatment-related covariates, confounders and outcome-related covariates, all covariates, and other combination of covariates (Appendix 1 in the supporting information). The PS was estimated for all the PS models using ordinary logistic regression. For each of the 40 PS models, the effect of the treatment on the outcome was estimated using the estimated PS as a covariate in a Poisson regression model (providing a risk ratio). We included the PS as a covariate in the regression model, because it is one of the most commonly used approaches among the different PS methods. 24 Bias was calculated as the difference between the true marginal effect (used in data generation process, that is,. RR = 1.0, RR = 1.5, or RR = 2.0) and the effect estimate obtained from a Poisson regression model with the PS as a covariate. Data were then stratified into quintiles of the PS and the four balance measures were calculated within each of the PS quintiles for each covariate separately. The balance measures were then pooled: first across the nine covariates within each stratum and then across the strata into a single (mean and median) measure. Because 40 different models were applied, in each simulated dataset, 40 values for each balance measure were estimated and 40 values for the bias were obtained. Next, Spearman s correlation coefficient (r) was used to assess the association between the different balance measures and the bias in the 40 PS models and averaged over the total number of simulations (500) for a given sample size (n). Finally, the four different balance measures were used for PS model selection. Hence, for each balance measure, the PS model that gave the best balance in each simulation Propensity score balance measures in pharmacoepidemiology 51

52 2.2 (i.e., the maximum of OVL the minimum of KSD, SDif, and L) was selected. The performance of different balance measures in terms of bias reduction (i.e., selecting the optimal PS model; hence, the least biased treatment effect estimate) and the frequency of selection of a specific or related PS models were evaluated. All analyses were performed in R, version R code for the simulation is available from the authors up on request. RESULTS When all the covariates were binary (Scenario 1) and the prevalence of outcome was 10%, the three balance measures OVL, SDif, and L had equal Spearman s correlation coefficient, and all balance measures showed a strong correlation with bias (e.g. for sample size of 2000 and RR=2, r = -0.54, 0.54, 0.54, and 0.53, for OVL, KSD, L, and SDif, respectively) (Table1). In addition, the correlations for the means of the balance measures with bias were higher than for the medians of the balance measures. In Scenario 2, when three of the confounders (X 1, X 2, and X 4 ) were continuous and gamma distributed and the other covariates binary, the SDif and the OVL showed a strong correlation with bias irrespective of sample size. However, in this scenario, KSD and L were weakly correlated in the case of small sample sizes (e.g., for sample size of 500 and RR=2, r = -0.43, 0.18, 0.11, and 0.51, for OVL, KSD, Propensity score balance measures in pharmacoepidemiology 52 Table 1. Average Spearman s Correlation Between Bias and Mean and Median of Different Balance Measures for Different Sample Sizes (n) (True Treatment-Outcome RR =2.00) Balance Measure Overlapping Coefficient Mean Median Kolmogorov- Smirnov distance Mean Median Lévy distance Mean Median Standardized mean difference Mean Median X 1, X 2, X 4 Gamma and Others Binary All Covariates Binary Covariates Sample Sizes (n) = 500, 1000, 2000, and 5000 (Each 500 Replications)

53 L, and SDif, respectively). Compared to the other balance measures, SDif was consistently strongly correlated with bias irrespective of covariate distributions. With lower prevalence of outcome (e.g., 5% and 1%), these correlations were smaller in all scenarios. When the continuous covariates followed a standard normal distribution (Scenario 3), instead of gamma distribution, the correlations were similar to scenario 2 (data not shown). 2.2 When balance measures were used to select the optimal PS model, estimates of the RR were similar for all balance measures in Scenario 1 (all covariates binary). The variation in estimated RR decreased with increasing sample size (Table 2). In small sample sizes and in the case of mixed binary and continuous covariates (scenarios 2 and 3), the PS model selected using the means of OVL and SDif yielded the least biased estimates. However, treatment effect estimates using the mean and medians of the three balance measures (OVL, KSD, and L) were more divergent compared to those of SDif. Table 3 shows the frequency (as percentage) of selection of related PS models (the number of times PS models were chosen) by different balance measures in 500 replications for different sample sizes. The PS models that were most often selected were the ones containing variables either related to the treatment and/or the outcome. PS models containing many of the covariates were most often selected, which was expected since balance measures were estimated over all nine covariates irrespective of their relation with the treatment or outcome. Given that all confounders were included in the PS model, PS models that include confounders and outcome-related variables were less often selected than PS models including confounders and only treatment-related variables. The frequency of selection of the true confounder model (PS model containing only variables related to both the treatment and the outcome) was low compared to the full PS model (i.e., the PS model containing variables related to either the treatment or the outcome). Omission of confounders from the PS model resulted in the most biased estimates. Inclusion of variables related only to the treatment but not to the outcome (X 7 and X 8 ) amplified the bias when measured confounders were omitted from PS model (i.e., in the presence of unmeasured confounding) (PS Models 37 and 39 Appendix 1; Table 4). PS models containing confounders and only outcome-related covariates provided the least biased effect estimates among the PS models considered with reduced variation. This was the case irrespective of sample size particularly when all covariates were binary (Scenario1). The true PS model (PS model used to generate treatment status, i.e., PS model 1) and related PS models (models 2-7, Appendix 1 in the supporting information) yielded slightly biased estimates with increased variation (Table 4). Including interaction and higher-order terms reduced the bias of the effect estimates. Furthermore, including variable related neither to the treatment nor to the outcome did not affect the effect estimate as well as the balance of confounders. When the treatment effect was set to RR = 1 0r RR = 1.5, results were similar in all scenarios (Results for RR = 1.5 and RR=2 with mixed gamma and binary distribution are included in the supporting information). Propensity score balance measures in pharmacoepidemiology 53

54 2.2 Propensity score balance measures in pharmacoepidemiology 54 Table 2. Median (and Interquartile Range) of Treatment-Outcome Risk Ratio When Using Means and Medians of Different Balance Measures for PS Model Selection (True Treatment-Outcome RR = 2.00) All Covariates binary X 1, X 2, X 4 Gamma and Others Binary Covariates n = 500 n = 1000 n = 2000 n = 5000 n = 500 n = 1000 n = 2000 n = 5000 Balance Measures 1.98 (0.35) 0.95 (0.90) 1.98 (0.34) 0.94 (0.83) 1.98 (0.34) 1.02 (0.99) 1.98 (0.34) 1.90 (0.53) 2.03 (0.54) 1.05 (0.92) 2.03 (0.54) 0.97 (0.82) 2.03 (0.54) 1.07 (0.90) 2.03 (0.54) 1.87 (0.66) 1.97 (0.73) 1.01 (0.85) 1.92 (0.70) 0.91 (0.55) 1.90 (0.72) 1.01 (0.83) 1.98 (0.74) 1.74 (0.89) 2.06 (1.29) 1.19 (1.04) 1.86 (1.33) 0.95 (0.68) 1.82 (1.40) 1.12 (0.95) 2.07 (1.28) 1.77 (1.36) 1.95 (0.20) 1.92 (0.22) 1.95 (0.20) 1.92 (0.23) 1.95 (0.20) 1.92 (0.23) 1.95 (0.20) 1.92 (0.22) 1.96 (0.35) 1.90 (0.41) 1.96 (0.35) 1.90 (0.42) 1.96 (0.35) 1.90 (0.42) 1.96 (0.34) 1.89 (0.39) 1.94 (0.54) 1.87 (0.56) 1.94 (0.54) 1.87 (0.56) 1.94 (0.54) 1.87 (0.56) 1.94 (0.54) 1.86 (0.58) 1.94 (0.73) 1.84 (0.86) 1.93 (0.74) 1.81 (0.84) 1.93 (0.74) 1.81 (0.84) 1.94 (0.76) 1.76 (0.89) Overlapping Coefficient Mean Median Kolmogorov-Smirnov Distance Mean Median Levy Distance Mean Median Standardized Mean Difference Mean Median Sample Sizes (n) = 500, 1000, 2000, and 5000 (Each 500 Replications).

55 Table 3. Selection Frequency of Related PS Models in Percentage ( Models 1-7, Models 8, 9, or 10, and Models ) by Various Balance Measures (all Covariates Binary and True Treatment-Outcome RR = 2.00) Selection of PS models Selection of PS models Selection of PS models Selection of the rest PS 1-7 8,9,10 11,12,14-16,17,18 Models* Sample size (n) Sample size (n) Sample size (n) Sample size (n) Balance Measures Overlapping Coefficient Mean Median Kolmogorov-Smirnov distance Mean Median Lévy distance Mean Median Standardized mean difference Mean Median True PS (Model 1) and related PS models, PS models containing confounders and treatment related covariates (Models 1-7), PS models including confounders and covariates related only to the outcome (Models 8-10) PS models containing covariates related either to the treatment or the outcome (Models 11-18) *The rest of PS models (Models 19-40) Total number of replications = 500 (hence, percentage is calculated from the 500 possible selections) in each sample size 2.2 Propensity score balance measures in pharmacoepidemiology 55

56 2.2 Table 4. Median (and Interquartile Range) of Treatment-Outcome Risk Ratio When Using the 40 Different PS Models (all Covariates Binary, True Treatment-Outcome RR =2.00, 500 Replication for Each Sample Size, n) Variables in the Model Model No. n = 500 n= 1000 n = 2000 n= 5000 Confounders and treatment related variables (0.75) 1.94 (0.76) 1.93 (0.75) 1.93 (0.75) 1.93 (0.75) 1.93 (0.76) 1.93 (0.74) 1.95 (0.53) 1.95 (0.54) 1.95 (0.53) 1.94 (0.53) 1.94 (0.53) 1.94 (0.53) 1.96 (0.54) 1.95 (0.38) 1.95 (0.38) 1.95 (0.38) 1.94 (0.37) 1.94 (0.37) 1.95 (0.37) 1.95 (0.37) 1.95 (0.19) 1.96 (0.20) 1.95 (0.19) 1.95 (0.20) 1.95 (0.20) 1.95 (0.21) 1.95 (0.10) Confounders and outcome related variables (0.72) 2.01 (0.71) 2.01 (0.70) 1.97 (0.48) 1.97 (0.48) 1.97 (0.50) 2.00 (0.29) 2.00 (0.29) 2.00 (0.36) 2.00 (0.20) 2.00 (0.20) 2.00 (0.19) Propensity score balance measures in pharmacoepidemiology 56 Treatment and/ or outcome related variables Only confounding variables Three/more of the confounders excluded from the model The rest of the PS models* (0.74) 1.97 (0.70) 1.93 (0.73) 1.98 (0.74) 1.97 (0.74) 1.97 (0.74) 1.94 (0.73) 1.97 (0.74) 1.99 (0.73) 2.00 (0.73) 1.99 (0.73) 1.99 (0.74) 1.99 (0.74) 1.98 (0.72) 1.15 (0.46) 1.10 (0.46) 1.10 (0.46) 1.12 (0.43) 1.11 (0.43) 1.11 (0.43) 1.07 (0.41) 0.99 (0.40) 1.94 (0.76) 1.94 (0.76) 1.70 (0.66) 1.76 (0.65) 1.71 (0.63) 1.81 (0.67) 1.64 (0.63) 1.12 (0.46) 1.94 (0.52) 1.95 (0.52) 1.94 (0.52) 1.95 (0.52) 1.94 (0.53) 1.94 (0.52) 1.94 (0.51) 1.94 (0.53) 1.98 (0.51) 1.99 (0.52) 1.98 (0.51) 1.98 (0.50) 1.98 (0.50) 1.97 (0.51) 1.15 (0.30) 1.10 (0.31) 1.10 (0.31) 1.12 (0.28) 1.12 (0.28) 1.11 (0.29) 1.08 (0.27) 0.98 (0.26) 1.94 (0.54) 1.94 (0.54) 1.73 (0.44) 1.77 (0.46) 1.71 (0.42) 1.78 (0. 46) 1.64 (0.43) 1.11 (0. 31) 1.97 (0.36) 1.97 (0.36) 1.94 (0.37) 1.96 (0.37) 1.96 (0.33) 1.97 (0.37) 1.94 (0.20) 1.96 (0.20) 2.00 (0.36) 1.99 (0.35) 2.00 (0.18) 1.99 (0.36) 1.99 (0.34) 1.99 (0.35) 1.15 (0.35) 1.10 (0.35) 1.10 (0.35) 1.12 (0.22) 1.12 (0.31) 1.08 (0.35) 1.12 (0.18) 0.99 (0.18) 1.95 (0.35) 1.95 (0.18) 1.71 (0.19) 1.78 (0.30) 1.71 (0.35) 1.78 (0.22) 1.64 (0.34) 1.12 (0.35) 1.95 (0.21) 1.95 (0.21) 1.95 (0.21) 1.95 (0.20) 1.95 (0.21) 1.95 (0.21) 1.95 (0.21) 1.95 (0.21) 2.00 (0.20) 2.00 (0.20) 2.00 (0.20) 1.99 (0.20) 1.99 (0.20) 1.99 (0.20) 1.15 (0.14) 1.10 (0.13) 1.10 (0.13) 1.12 (0.13) 1.12 (0.13) 1.12 (0.13) 1.08 (0.12) 0.99 (0.11) 1.96 (0.20) 1.96 (0.20) 1.71 (0.18) 1.78 (0.19) 1.70 (0.19) 1.79 (0.16) 1.63 (0.17) 1.13 (0.14) *PS Models 33-40

57 DISCUSSION This simulation study extends the findings from our previous study 12 on measures of balance in scenarios with continuous normal distributed covariates and frequent outcomes to scenarios that include binary covariates, non-normal distributed continuous covariates and rare outcomes. Such scenarios are typical of pharmacoepidemiological studies. We further confirmed the usefulness of balance measures as tools not only for measuring and reporting the amount of balance reached by a given PS model but also for selecting an optimal PS model in terms of bias and variance of the treatment effect. The absolute standardized difference displayed better performance in terms of bias reduction and variance among the different balance measures compared in a wide range of sample sizes and covariate distributions. The performance of the KSD and the L was dependent on sample size and covariate patterns. Including variables unrelated to the treatment but related to the outcome in the PS model reduced the variation of the effect estimate, whereas including variables related only to treatment in the PS model increased the variation of the effect estimate. The overlapping coefficient, the Kolmogorov-Smirnov, and the Lévy distances gave similar results with respect to bias reduction when all the covariates followed a binomial distribution and were comparable with SDif irrespective of sample size and the strength of the treatment effect. With continuous or mixed binary and continuous covariate distributions, the performance of OVL, KSD L was poor in small sample sizes, which is in line with our previous study, 12 whereas SDif performed consistently better in all scenarios. When calculating the parametric SDif, the difference in means of the continuous covariate between treated groups within strata of the PS was standardized to the pooled variation, which might explain why the mean and median of SDif (hence, the bias when using means and median of SDif to select PS models) were relatively close to each other, compared to the mean and median of OVL, KSD, and L. Indeed, the OVL, KSD, and L are non-parametric, and we noted in our simulations that with mixed binary and continuous covariates, the values of the three balance measures were more divergent than those of SDif. Some extreme imbalances in covariate distribution (as outliers) that might exist in some of the PS strata (for example, in the first or fifth quintiles of the PS) could be captured by the mean rather than the median, which might explain why using the means of the different balance measures resulted in less biased estimates than their medians. Although our primary aim was to evaluate balance measures in PS model selection in terms of bias, we have also looked at variable selection for inclusion in PS model based on their association with the treatment and/or the outcome. Including variables related only to the outcome but not to the treatment in the PS model reduced the variance of the effect estimate, which is consistent with previous studies. 6,8,9,12,26,27 In addition, including variables related only to the treatment but not to the outcome (instrumental variables, IVs) or variables strongly related to the treatment but only weakly related to the outcome (near-ivs) amplified the bias, which is in line with the findings of Myers et al. 8 and Pearl. 9,27 However, the bias amplification in our simulation was mild particularly when covariates 2.2 Propensity score balance measures in pharmacoepidemiology 57

58 2.2 Propensity score balance measures in pharmacoepidemiology 58 were mixed binary and continuous. Possible explanations could be that, first, the presence of extreme unmeasured confounding and a very strong instrument is a condition for observable increase in bias. 8,28 However, in our simulation, we did not consider extreme unmeasured confounding and the treatment was binary, further constraining the extent to which an IV or a near-iv determines treatment status. No such constraints exist with a continuous exposure, and the IV association with exposure could be larger. 8,28 Second, our settings incorporated a large number of covariates, higher-order and interaction terms in the PS model. Pearl 9 has pointed out that inclusion of IVs or near-ivs in case of multiple covariates will always amplify confounding bias (if such exists) in linear models but bias in non-linear systems (such as the one used in our simulations) could be either amplified or attenuated. On the other hand, in the absence of unmeasured confounder, in our study, consistently least biased treatment effect estimates were obtained using PS models including confounders and variables related only to the outcome compared to PS models that include confounders and variables related only to treatment (instrumental variables) only when all the covariates were binary. When balance measures were used to select a PS model, the PS models that were most often selected included covariates related only to the outcome in addition to confounders and treatment related covariates. Although the choice of functional forms and possible interactions among covariates for the PS model may not seem straightforward, the use of balance measures in combination with prior clinical knowledge about the causal network that links exposure, outcome, and potential confounders is worth consideration. When sample sizes are relatively small, using balance measures for model selection may result in a large number of calculations per covariate and stratum of the PS and may provide unreliable estimates in small sample size. 12 However, when matching on the PS is used, the number of calculations will be greatly reduced (involving only balance of covariate and/ or interaction terms across matched groups), and hence, the reliability of estimates could be improved. The current study captures a wide range of scenarios in relation to covariate distributions, inclusion of higher-order and/or interaction terms, and lower incidence of outcome allowing for the generalizability of study results to empirical, pharmacoepidemiologic studies. Possible limitations of the current study could be that, first; we gave equal weights to all variables irrespective of their association with treatment and outcome in calculating the balance measures. Weighted balance measures based on covariate s association with the outcome have been discussed previously, 12 but the choice of the weights is not straightforward and, therefore, not included in our simulations. Second, we based our balance calculation on all covariates but not on interaction or higher-order terms to simplify computations and interpretations although Rubin suggested to check balance on covariates as well as their interactions terms included in the PS model, when relevant. 29 On other hand, several researchers advocate inclusion of outcome-related covariates in the PS model to reduce bias and improve precision of treatment effect estimate; 9,10,26,27 hence, balance check only on outcome related covariates seems important. Third, we calculated the balance

59 measures per covariate within each PS strata before pooling into summary measures, while applying the covariate adjustment using the PS in regression model (and not stratification on the PS) to estimate treatment effect. Although pooling across strata might lead to an underestimation of the amount of residual imbalance, its impact on the treatment effect estimate does not seem to be significant. This can be seen in our simulation by looking at the bias that resulted when we use summary balance measures to select PS models. Standardized differences on the logit of the PS as well as variance ratios using residuals after regressing each covariate on the logit of the PS have been suggested elsewhere, 29, 30 but we did not include this approach in our simulations. Last but not least, covariate adjustment using PS in the regression model which we used in estimating treatment effect rely on the assumption of linear relationship between the estimated PS and the outcome. 29 However, noncollapsibility, 31,32 a phenomenon when estimating the treatment-outcome association using odds ratio (OR) or hazard ratio (HR) by including covariates in the adjustment model that the conditional OR or HR does not equal the marginal OR or HR in the presence of non-null treatment effect, even in the absence of confounding and effect modification, is not an issue in our study because a Poisson model was employed for both data generation and treatment effect estimation. 2.2 The use of machine learning techniques such as neural networks and classification and regression trees 33 were proposed for PS estimation, yet applications of these methods in PS analysis of pharmacoepidemiologic data is very limited. However, the methods were shown to have better performance in terms of bias reduction and more consistent 95%CI coverage than logistic regression approaches, particularly under conditions of both nonadditivity and non-linearity. 34,35 Hence, evaluation of the different balance measures when using logistic regression and machine learning techniques for PS estimation is a possible next step. In summary, we recommend that emphasis should be given to correct specification of the PS model and suggest the absolute standardized difference as a balance measure of choice for measuring, reporting, and selecting the final PS model. Not only is the absolute standardized difference easy to estimate, independent of sample size, and scale-invariant, but it is also a very familiar and reasonably well-understood tool compared to other balance measures. Propensity score balance measures in pharmacoepidemiology 59

60 2.2 Propensity score balance measures in pharmacoepidemiology 60 REFERENCES 1. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79: Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf 2008; 17: Austin PC. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity score. Pharmacoepidemiol Drug Saf 2008; 17: Weitzen S, Lapane KL, Toledano AY, et al. Principles for modelling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf 2004; 13: Rubin DB, Thomas N. Matching using estimated propensity scores: relating theory to practice. Biometrics 1996; 52: Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. Am J Epidemiol 2006; 163: Bhattacharya J, Vogt WB. Do instrumental variables belong in propensity scores? NBER Technical Working Paper No The National Bureau of Economic Research: Cambridge, MA, Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174: Pearl J. Invited commentary: Understanding bias amplification. Am J Epidemiol 2011; 174: Pearl J. Causal diagrams for empirical research. Biometrika 1995; 82: Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20: Groenwold RHH, Vries F, Boer A, et al. Balance measures for propensity score methods: a clinical example on beta-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf 2011; 20: Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Stephens MA. Use of the Kolmogorov-Smirnov, Cramér-Von Mises and related statistics without extensive tables. J R Stat Soc Series B (Methodological). 1970; 32: Pestman WR. Mathematical Statistics. 2nd ed. Walter de Gruyter: Berlin, Lévy P. Théorie de l addition des variables aléatoires. Gauthier-Villars, Paris, Stine RA, Heyse JF. Non-parametric estimates of overlap. Stat Med 2001; 20: Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med 2007; 26: Silverman BW. Density estimation for statistics and data analysis. Chapman & Hall/CRC, Wand MP, Jones MC. Kernel smoothing. Chapman & Hall/CRC: London, Austin PC. The performance of different propensity score methods for estimating marginal odds ratios. Stat Med 2007; 26: Austin PC, Grootendorst P, Normand SLT, et al. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study. Stat Med 2007; 26:

61 24. Shah BR, Laupacis A, Hux JE, et al. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58: R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL Patrick AR, Schneeweiss S, Brookhart MA, et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20: Pearl J. On a class of bias-amplifying variables that endanger effect estimates. In: Gr ünwald P, Spirtes P, eds. Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 2010). Corvallis, OR: Association for Uncertainty in Artificial Intelligence 2010; Brookhart MA, Stürmer T, Glynn RJ, et al. Confounding control in healthcare database research: challenges and potential approaches. Med Care 2010; 48: S114-S Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Serv Outcomes Res Methodol 2001; 2: Rubin DB. The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Stat Med 2007; 26: Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci 1999; 14: Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 1984; 71: Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63: Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6: e Setoguchi S, Schneeweiss S, Brookhart MA, et al. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf 2008; 17: Propensity score balance measures in pharmacoepidemiology 61

62 2.2 Propensity score balance measures in pharmacoepidemiology 62 AppendiCES Appendix 1: Description of the Propensity Score Models Model 1 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) Model 2 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + x9 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) Model 3 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 Model 4 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + I(x1^2 + I(x7^2) Model 5 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 Model 6 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + x9 Model 7 logit(t) ~ x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + I(x1^2) + I(x7^2) Model 8 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x2*x4 + x4*x5 + I(x1^2) + x3*x5 + x3*x6 + I(x6^2) Model 9 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x9 + x2*x4 + x4*x5 + I(x1^2) + x3*x5 + x3*x6 + I(x6^2) Model 10 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x2*x4 + x4*x5 + I(x1^2) Model 11 logit(t) ~ x1 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) + x3*x5 + x3*x6 + I(x6^2) Model 12 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2 ) Model 13 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 Model 14 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) + x3*x5 + x3*x6 + I(x6^2) Model 15 logit(t) ~ x1 + x2 + x4 + x7 + x8 + x9 + x2*x7 + x7*x8 + I(x7^2) + x3 + x6 + x3*x5 + x3*x6 + I(x6^2) Model 16 logit(t) ~ x1 + x2 + x3 + x6 + x4 + x5 + x7 + x8 + x9 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) + x3*x5 +x3*x6 +I(x6^2) Model 17 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 Model 18 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x2*x7 + x7*x8 + I(x1^2) + I(x7^2) + x3*x5 + x3*x6 + I(x6^2) Model 19 logit(t) ~ x1 + x2 + x4 + x5 + x2*x4 + x4*x5 + I(x1^2) Model 20 logit(t) ~ x1 + x2 + x4 + x5 + x9 + x2*x4 + x4*x5 + I(x1^2) Model 21 logit(t) ~ x1 + x2 + x4 + x5 + x2*x4 + x4*x5 Model 22 logit(t) ~ x1 + x2 + x4 + x5 + I(x1^2) Model 23 logit(t) ~ x1 + x2 + x4 + x5 Model 24 logit(t) ~ x1 + x2 + x4 + x5 + x9

63 2.2 Model 25 logit(t) ~ x2 + x3 + x6 + x7 + x8 + x9 + x2*x7 + x7*x8 + I(x7^2) + x3*x5 + x3*x6 + I(x6^2) Model 26 logit(t) ~ x2 + x7 + x8 + x2*x7 + x7*x8 + I(x7^2) Model 27 logit(t) ~ x7 + x8 + x2* x7 + x7*x8 Model 28 logit(t) ~ x3 + x6 + x5 + x9 + x3*x5 + x3*x6 + I(x6^2) Model 29 logit(t) ~ x3 + x5 + x6 + x3*x5 + x3*x6 + I(x6^2) Model 30 logit(t) ~ x3 + x5 + x6 + x9 Model 31 logit(t) ~ x3 + x6 + I(x6^2) + x9 Model 32 logit(t) ~ x3 + x6 + x7 + x8 Model 33 logit(t) ~ x1 + x3 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) Model 34 logit(t) ~ x1 + x2 + x3 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) Model 35 logit(t) ~ x1 + x2 + x5 + x7 + x8 + x2*x7 + x7*x8 + I(x1^2) + I(x7^2) Model 36 logit(t) ~ x1 + x4 + x5 Model 37 logit(t) ~ x1 + x2 + x3 + x5 + x6 + x7 + x8 Model 38 logit(t) ~ x1 + x2 + x3 + x5 + x6 Model 39 logit(t) ~ x1 + x2 + x7 + x8 + x2*x7 + x7*x8 + I(x1^2) + I(x7^2) Model 40 logit(t) ~ x3 + x4 + x5 + x6 + x7 + x8 + x7*x8 + x3*x5 + x3*x6 + I(x6^2) Models 1-7 True PS and related models, PS models containing confounders and covariates related only to the treatment. Models 8-10 Outcome PS models, PS models containing confounders and covariates related only to the outcome Models Full PS Models PS models containing covariates related either to the treatment or the outcome or both. Models Confounder PS Models, PS models containing confounders. Models The rest PS models, PS models containing variable number of confounders or covariates Propensity score balance measures in pharmacoepidemiology 63

64 2.2 Appendix 2 Table 1. Median (and Interquartile Range) of Treatment-Outcome Risk Ratio When Using the 40 Different PS Models (X 1, X 2, X 4 Gamma and Others Binary Covariates, True Treatment-Outcome RR =2.00, 500 Replications for Each Sample Size, n) Variables in the Model Model No. n = 500 n = 1000 n = 2000 n = 5000 Confounders and treatment related variables (1.20) 2.08 (1.22) 2.10 (1.20) 2.03 (1.24) 2.07 (1.25) 2.08 (1.24) 2.06 (1.21) 1.99 (0.72) 1.99 (0.72) 2.01 (0.75) 1.95 (0.71) 1.99 (0.74) 1.99 (0.74) 1.99 (0.74) 2.04 (0.56) 2.04 (0.56) 2.06 (0.57) 2.01 (0.455) 2.05 (0.56) 2.05 (0.55) 2.03 (0.56) 1.99 (0.34) 1.99 (0.34) 2.03 (0.35) 1.97 (0.36) 2.00 (0.36) 2.00 (0.37) 1.99 (0.34) Confounders and outcome related variables (0.11) 2.07 (0.09) 2.05 (0.07) 1.94 (0.64) 1.93 (0.64) 1.94 (0.63) 2.00 (0.48) 2.00 (0.47) 1.99 (0.47) 1.97 (0.29) 1.97 (0.30) 1.97 (0.29) Propensity score balance measures in pharmacoepidemiology 64 Treatment and/ or outcome related variables Only confounding variables Three/more of the confounders excluded from the model The rest of the PS models* (1.24) 2.08 (1.23) 2.10 (1.30) 2.08 (1.25) 2.14 (1.31) 2.11 (1.24) 2.11 (1.30) 2.09 (1.23) 2.04 (0.08) 2.05 (0.08) 2.05 (0.06) 2.05 (0.07) 2.04 (0.06) 2.05 (0.09) 0.87 (0.48) 0.84 (0.49) 0.84 (0.49) 0.85 (0.42) 0.85 (0.43) 0.85 (0.43) 0.82 (0.41) 0.72 (0.38) 2.06 (1.25) 2.06 (1.25) 1.54 (0.90) 1.69 (0.91) 1.57 (0.90) 1.60 (0.84) 1.46 (0.83) 0.85 (0.49) 1.98 (0.72) 1.99 (0.71) 1.98 (0.73) 1.99 (0.73) 2.02 (0.75) 1.98 (0.72) 1.99 (0.73) 1.98 (0.73) 1.95 (0.65) 1.95 (0.64) 1.94 (0.65) 1.95 (0.64) 1.94 (0.66) 1.95 (0.66) 0.84 (0.29) 0.80 (0.29) 0.80 (0.29) 0.80 (0.28) 0.80 (0.28) 0.80 (0.28) 0.77 (0.27) 0.68 (0.24) 1.98 (0.72) 1.98 (0.72) 1.47 (0.53) 1.61 (0.56) 1.47 (0.57) 1.54 (0.53) 1.41 (0.51) 0.82 (0.30) 2.04 (0.57) 2.03 (0.55) 2.04 (0.54) 2.03 (0.57) 2.06 (0.56) 2.04 (0.57) 2.04 (0.54) 2.03 (0.56) 1.98 (0.48) 1.98 (0.48) 1.97 (0.47) 1.97 (0.48) 1.97 (0.47) 1.97 (0.47) 0.86 (0.20) 0.83 (0.19) 0.83 (0.19) 0.83 (0.19) 0.83 (0.19) 0.84 (0.19) 0.81 (0.18) 0.71 (0.17) 2.03 (0.55) 2.03 (0.55) 1.53 (0.39) 1.64 (0.40) 1.54 (0.39) 1.57 (0.37) 1.44 (0.37) 0.85 (0.21) 1.98 (0.36) 1.98 (0.35) 1.99 (0.36) 1.98 (0.36) 2.01 (0.36) 1.98 (0.36) 1.99 (0.36) 1.98 (0.36) 1.97 (0.30) 1.97 (0.31) 1.96 (0.30) 1.96 (0.31) 1.96 (0.30) 1.97 (0.31) 0.85 (0.14) 0.82 (0.14) 0.82 (0.14) 0.82 (0.13) 0.82 (0.12) 0.82 (0.13) 0.79 (0.12) 0.70 (0.12) 1.99 (0.35) 1.99 (0.35) 1.50 (0.27) 1.63 (0.26) 1.50 (0.27) 1.55 (0.24) 1.41 (0.25) 0.83 (0.14) *PS Models 33-40

65 Table 2. Selection Frequency of Related PS Models in Percentage ( Models 1-7, Models 8, 9, or 10, and Models ) by Various Balance Measures (X1, X2, X4 Gamma and Others Binary Covariates and True Treatment-Outcome RR = 2.00) Selection of PS models Selection of PS models Selection of PS models 1-7 8,9,10 11,12,14-16,17,18 Selection of the rest PS Models* Sample size (n) Sample size (n) Sample size (n) Sample size (n) Balance Measures Overlapping Coefficient Mean Median Kolmogorov-Smirnov distance Mean Median Lévy distance Mean Median Standardized mean difference Mean Median True PS (Model 1) and related PS models, PS models containing confounders and treatment related covariates (Models 1-7), PS models including confounders and covariates related only to the outcome (Models 8-10) PS models containing covariates related either to the treatment or the outcome (Models 11-18) *The rest PS models (Models 19-40) Total number of replications= 500 (hence, percentage is calculated from the 500 possible selections) in each sample size (n) 2.2 Propensity score balance measures in pharmacoepidemiology 65

66 2.2 Propensity score balance measures in pharmacoepidemiology Table 3. Average Spearman s Correlation Between Bias and Means and Medians of Different Balance Measures for Different Sample Sizes (n) (True Treatment-Outcome RR =1.5) Balance Measure Overlapping Coefficient Mean Median Kolmogorov- Smirnov Distance Mean Median Lévy Distance Mean Median Standardized Mean Difference Mean Median All Covariates Binary X 1, X 2, X 4 Gamma and Others Binary Covariates Sample Sizes (n) = 500, 1000, and 5000 (Each 500 Replications)

67 Table 4. Median (and Interquartile Range) of Treatment-Outcome Risk Ratio When Using Means and Medians of Different Balance Measures for PS Model Selection (True Treatment-Outcome RR=1.50) X 1, X 2, X 4 Gamma and Others Binary All Covariates Binary Covariates Balance Measures n = 500 n = 1000 n = 5000 n = 500 n = 1000 n = 5000 Overlapping Coefficient Mean Median Kolmogorov- Smirnov Distance Mean Median Levy Distance Mean Median Standardized Mean Difference Mean Median 1.50 (0.57) 1.41 (0.60) 1.50 (0.58) 1.40 (0.58) 1.50 (0.58) 1.40 (0.58) 1.49 (0.58) 1.36 (0.60) 1.45 (0.45) 1.38 (0.49) 1.46 (0.45) 1.39 (0.48) 1.45 (0.45) 1.39 (0.48) 1.45 (0.45) 1.38 (0.48) 1.46 (0.20) 1.45 (0.22) 1.46 (0.20) 1.45 (0.22) 1.46 (0.20) 1.45 (0.22) 1.46 (0.20) 1.44 (0.23) 1.51 (0.97) 0.84 (0.74) 1.38 (0.1.01) 0.72 (0.52) 1.31 (1.02) 0.79 (0.67) 1.51 (0.95) 1.27 (1.05) 1.52 (0.53) 0.86 (0.67) 1.52 (0.51) 0.76 (0.53) 1.50 (0.53) 0.85 (0.61) 1.52 (0.51) 1.38 (0.63) 1.51 (0.22) 0.74 (0.74) 1.51 (0.22) 0.73 (0.65) 1.51 (0.22) 0.80 (0.81) 1.52 (0.22) 1.44 (0.37) Sample Sizes (n) = 500, 1000, 2000, and 5000 (Each 500 Replications) 2.2 Propensity score balance measures in pharmacoepidemiology 67

68 2.2 Table 5. Median (and Standard Deviation) of Treatment-Outcome Risk Ratio When Using the 40 Different PS Models (all Covariates Binary, True Treatment-Outcome RR =1.5, 500 Replications) Variables in the Model Confounders and treatment related variables Model No. n = 500 n = 1000 n = (0.60) 1.51 (0.60) 1.51 (0.60) 1.48 (0.58) 1.48 (0.58) 1.49 (0.59) 1.51 (0.59) 1.46 (0.46) 1.45 (0.46) 1.46 (0.46) 1.45 (0.46) 1.45 (0.46) 1.45 (0.47) 1.45 (0.47) 1.47 (0.19) 1.47 (0.19) 1.47 (0.19) 1.46 (0.19) 1.46 (0.19) 1.46 (0.20) 1.47 (0.19) Confounders and outcome related variables (0.56) 1.51 (0.58) 1.50 (0.58) 1.47 (0.42) 1.48 (0.41) 1.47 (0.41) 1.48 (0.18) 1.48 (0.18) 1.48 (0.18) Propensity score balance measures in pharmacoepidemiology Treatment and/ or outcome related variables Only confounding variables Three/more of the confounders excluded from the model The rest of the PS models* (0.61) 1.48 (0.60) 1.47 (0.59) 1.49 (0.61) 1.49 (0.59) 1.49 (0.61) 1.47 (0.60) 1.49 (0.59) 1.52 (0.56) 1.52 (0.57) 1.52 (0.56) 1.51 (0.57) 1.51 (0.57) 1.52 (0.55) 0.87 (0.34) 0.83 (0.31) 0.83 (0.31) 0.84 (0.32) 0.84 (0.31) 0.84 (0.32) 0.80 (0.30) 0.75 (0.28) 1.48 (0.60) 1.48 (0.60) 1.31 (0.50) 1.34 (0.56) 1.30 (0.50) 1.34 (0.51) 1.26 (0.48) 0.84 (0.35) 1.46 (0.43) 1.46 (0.44) 1.45 (0.45) 1.46 (0.43) 1.45 (0.44) 1.46 (0.43) 1.45 (0.45) 1.45 (0.44) 1.47 (0.43) 1.47 (0.43) 1.47 (0.43) 1.46 (0.43) 1.46 (0.43) 1.47 (0.42) 0.87 (0.24) 0.83 (0.24) 0.83 (0.24) 0.84 (0.23) 0.84 (0.22) 0.84 (0.22) 0.80 (0.22) 0.74 (0.22) 1.45 (0.44) 1.45 (0.44) 1.28 (0.38) 1.31 (0.40) 1.27 (0.38) 1.33 (0.37) 1.23 (0.34) 0.83 (0.25) 1.46 (0.20) 1.46 (0.20) 1.46 (0. 20) 1.47 (0. 20) 1.46 (0. 20) 1.46 (0. 20) 1.45 (0. 20) 1.46 (0. 20) 1.49 (0.19) 1.49 (0.19) 1.49 (0.19) 1.48 (0.18) 1.48 (0.18) 1.48 (0.18) 0.85 (0.10) 0.82 (0. 10) 0.82 (0. 10) 0.83 (0. 10) 0.83 (0. 10) 0.83 (0. 10) 0.80 (0. 10) 0.73 (0.09) 1.47 (0.20) 1.47 (0.20) 1.29 (0.17) 1.32 (0.16) 1.28 (0.16) 1.33 (0.17) 1.23 (0.17) 0.83 (0.10) 68 *PS Models 33-40

69 Table 6. Selection Frequency of Related PS Models in Percentage ( Models 1-7, Models 8, 9, or 10, and Models ) by Various Balance Measures (all Covariates Binary and True Treatment-Outcome RR = 1.5) Balance Measures Selection of PS models 1-7 Selection of PS models 8,9,10 Selection of PS models 11,12,14-16,17,18 Selection of the rest PS Models* Sample size (n) Sample size (n) Sample size (n) Sample size (n) Overlapping Coefficient Mean Median Kolmogorov- Smirnov distance Mean Median Lévy distance Mean Median Standardized mean difference Mean Median True PS (Model 1) and related PS models, PS models containing confounders and treatment related covariates (Models 1-7), PS models including confounders and covariates related only to the outcome (Models 8-10) PS models containing covariates related either to the treatment or the outcome (Models 11-18) *The rest PS models (Models 19-40) Total number of replications= 500 (hence, percentage is calculated from the 500 possible selections) in each sample size (n) 2.2 Propensity score balance measures in pharmacoepidemiology 69

70

71 2.3 Chapter 2.3 Improving Selection of Covariates and Caliper for Optimal Balance in Propensity Score Matching: a Simulation Study M Sanni Ali, RHH Groenwold, SV Belitser, KCB Roes, AW Hoes, A de Boer, OH Klungel Submitted Improving selection of covariates and caliper for optimal balance in propensity score matching 71

72

73 ABSTRACT BACKGROUND METHODS RESULTS CONCLUSIONS INTRODUCTION Selecting covariates and caliper width for propensity score (PS) matching is a trade-off between reducing confounding bias and possible amplification of residual bias. This study assesses the covariate balancing properties of PS matching with respect to unmeasured covariates (U) as well as the usefulness of balance measures for choosing optimal caliper width in terms of bias and precision of treatment effect estimates. Simulation studies were conducted on binary covariates, treatment, and outcome data. In different scenarios, instrumental variables (IVs, i.e., variables related only to treatment and not to the outcome), risk factors (RFs, variables related only to the outcome), and confounding variables with various associations between them were considered. Treatment effects were estimated after matching on different calipers of PS, using Poisson models. Balance for each covariate was checked before and after matching using the absolute standardized difference (SDif ). The SDif was used to choose PS models caliper combinations and comparisons were made with respect to bias in the treatment effect, balance of (unobserved) covariates, and number of subjects matched. PS matching improved balance of covariates included in the PS model but increased imbalance of unobserved variables (U) unrelated to measured covariates compared to the full-unmatched sample. Inclusion of IVs that are independent of U, increased the imbalance in U and amplified residual bias. However, including IVs that were associated with U improved the balance of U and reduced the bias. When the PS model included RFs, exclusion of IVs that were related to U improved the balance of U thereby decreasing the bias. In choosing covariates for a PS model, the association among covariates has substantial impact on other covariates balance and the bias in the treatment effect. Investigators should not only rely on covariate association with treatment/outcome but also consider mutual associations among covariates. In addition, balance measures are valuable tools for assessing the sensitivity of choosing optimal caliper width in PS matching Propensity score (PS) matching is a popular approach in observational studies to estimate treatment effects owing to its conceptual simplicity 1 and its reduced level of model dependence. 1, 2 Rosenbaum and Rubin demonstrated that if treatment assignment is ignorable (i.e. there is no unmeasured confounding) conditioning on the propensity score allows one to obtain unbiased estimates of the average treatment effect, 1, 3 or, more specifically, the average treatment effect in the treated included in the propensity score 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 73

74 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 74 matching. 4 In other words, treated and untreated subjects with the same PS will have similar distributions of measured baseline covariates. In PS matching, first a set of treated and untreated subjects with similar PS are matched and then treatment effect is estimated using this matched sample. 1, 5 Successful application of PS matching has two important goals: (1) to reduce imbalance of covariates between treated and untreated subjects (i.e., reduce confounding bias) and (2) to improve precision (i.e., reduce variability) of the treatment effect estimate, 6 Nevertheless, the existing matching algorithms optimize PS matching with respect to only one of these two aims. 2, 6 Hence, researchers face a trade-off between imbalance of covariates (and thus dependence on parametric models to adjust for these imbalances) and precision of the effect estimate (which is strongly related to the size of the matched data). 6-8 The bias-variance trade-off in estimating a treatment effect is driven by several factors: specification of the PS model, 9-14 the matching algorithm, and the choice of the matching distance, i.e., the caliper width. 8,15-17 In specifying the PS model, inclusion of variables related to both the treatment and the outcome in the PS model reduces both bias and variability in the effect estimate. Inclusion of variables related to the outcome but not to the treatment in the PS model increase the precision of the effect estimate without increasing bias in the effect estimate. 9, On the other hand, inclusion of variables strongly related to the treatment but weakly related to the outcome increase the variability of the effect estimate, 13, 14 and in the presence of strong unmeasured confounding, it may even amplify residual bias, at least in linear models However, the impact of the association between individual covariates as well as measured and unmeasured covariates, when included in the PS model, on the bias and variance has not been well studied. A wide range of matching algorithms has been proposed in the methodological literature including exact matching, optimal matching, 5-to-1 matching, and nearest neighbor matching with or without a specified PS calipers. 4, 15, 18 Nearest neighbor matching within a specified PS caliper width is the most commonly used algorithm in the medical literature However, the prevailing inconsistencies in the use of the matching algorithm and the choice of caliper in the medical literature is partly due to lack of clear, evidencebased recommendations on different aspects of matching algorithms (such as the number of untreated matches per treated patient, whether matching should be conducted with or without replacement, and the order in which potential matches should be formed) and formal tool(s) to the choose optimal caliper width. Although a caliper width of 0.25 standard deviations of the PS has been considered as a standard in the literature, 8,16 several studies indicated that the caliper depends on the characteristics of the data 4,18,22,23 such as the association between outcome variable and matching variable, 8 the number of untreated reservoirs available in the data, as well as the overlap of the two populations. Hence, the choice of caliper should be based on characteristics of the data. Balance measures, which are characteristic of a sample, have

75 been proposed to assess balance of covariate distribution between treatment groups reached by a specific PS model and to select an optimal PS model (in terms of achieved balance) among a variety of possible models; 9, 24, 25 however, their usefulness in choosing caliper width for PS matching has not been studied. The objective of this study is twofold: first, to assess the impact of associations between individual covariates as well as covariates and treatment and /or outcome on both the bias and variance of treatment effect. Second, the usefulness of balance measures (specifically the absolute standardized difference, SDif ) for choosing the optimal caliper width in terms of both bias and variance is examined. In addition, assessment of balance on different sets of covariates and its impact on both bias and variance was evaluated. METHODS We performed Monte-Carlo simulations to assess how selection of covariates and caliper width for the PS matching impacts the bias and variance of the estimated treatment effect. For the simulations, we used the framework previously described by Belitser et al. 24 and Ali et al. 9 Data Generation We simulated three different scenarios in in total 2000 subjects. In each scenario, we generated nine binary covariates: four confounding variables (X 1, X 2, X 4, and X 5 ), i.e. variables that were related to both treatment and outcome; two instrumental variables (X 7 and X ), 8 variables that were related only to the treatment but not, other than through the treatment, with the outcome; two risk factors (X 3 and X 6 ), variables that were related only to the outcome but not to the treatment; and one covariate (X 9 ) that was related to neither the treatment nor the outcome (Figure 1). Scenario 1: All the nine covariates (X 1 - X 9 ) were independent. Scenario 2: The correlation between X 4 and X was set to 0.4 and all other covariates were 5 considered independent of each other. Scenario 3: The correlation between X 4 and X 7 was set to 0.4 and all other covariates were considered independent of each other. To generate the binary covariates, we used a Bernoulli distribution with varying success rate of 0.30, 0.35, 0.40, 0.45, 0.50, and 0.60, which means that the prevalence of the binary covariates was on average 30%, 35%, 40%, 45%, 50%, and 60% in each of the simulated populations (n=2000), respectively. In a sensitivity analysis, six additional scenarios with respect to covariate distributions (mixed binary and continuous following a standard normal distribution as well as mixed binary and continuous following gamma distribution with scale parameter =2 and shape parameter =1) were considered for each of the above three scenarios. 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 75

76 2.3 Next, we randomly generated binary treatment status (t) for each of the 2000 subjects according to the following logistic model, including interactions and square terms, logit ( p t α 0 α x α x α x α x α x α x α 7 x α 8 x 2 x + 7 α 9 x 7 x + 8 α 10 x 4 x + 5 α 11 x + 1 α 12 x 7 ) = + 2 x 4 [1] Then, binary outcome status was generated for each of the 2000 subjects conditional on treatment status (t), six of the covariates (X 1 - X 6 ), their interactions and square terms using the following Poisson model, Improving selection of covariates and caliper for optimal balance in propensity score matching 76 log ( p y β + γ * t β 0 1x1 β x3 x + 5 β x3 x6 ) = β x + 2 β x + 3 β x + 4 β x5 2 2 β x4 x + 5 β x + 1 β x β x β x The dichotomous treatment and outcome were generated using a Bernoulli distribution with subject-specific probabilities of treatment assignment and outcome status, π =p t and π= p y, respectively. Marginal treatment effects (RR) of 1 (γ = log1), 1.5 (γ = log1.5), and 2 (γ = log2) were considered for the sample size of The values of β 0 and α 0 were set such that approximately half of the subjects were treated and outcome event occurred in approximately 15%, 10%, or 5% of the untreated individuals. We used 500 replications in each scenario. Table 1. List of Variables and Terms in the Treatment and Outcome Models and Their Association with the Treatment/Outcome Variables 7 2 x Terms Confounding X 1, X, X, and X X , X, X X, X X, X X, and X Only treatment related (Instrumental variables) Only outcome related (Risk factors) X 7 and X 8 X 7, X 8, X X, X X, and X X 3 and X 6 X 3, X 6, X X, X X, and X Strong and medium associations between covariates and treatment/outcome was induced by varying the values of the regression coefficients for the main terms, i.e., α 1 - α 6 and β 1 - β 6 in the treatment [1] and outcome [2] models, respectively. For binary variables, the values of α 1 - α 6 and β 1 - β 6 were set to log (3) and log (1.5) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. This means that strong and moderate associations between a given variable and either treatment or outcome was considered when the presence of that variable independently increased the risk of either treatment or outcome by a factor of three and 1.5, respectively. For continuous variables, the values of α 1 - α 6 and β 1 - β 6 were set to log (2) and log (1.35) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. In this case, strong and moderate associations with treatment or outcome were defined as an increase in the risk of treatment and outcome by two and 1.35, respectively, per standard deviation increase in the covariate. No association was considered between a given covariate and treatment or outcome when the presence of a covariate did not have an independent impact on the risk of treatment or outcome. In addition, the following values for the regression coefficients of interaction and square terms, i.e., α 7 α 12 and β 7 - β 12 were used: log (1.2) for α 7, α 10, β 7 and β 10 ; log (1.4) for α 8, α 11, β 8 and β 11 ; log (1.6) for α 9, α 12, β 9 and β [2]

77 X 9 X 8 A Y X X 7 X 6 X 1, X 2, X 5 X 4 Figure 1. Direct Causal Diagrams for the Data Generating Mechanism Propensity Score Models Although it was possible to construct several PS models using the nine covariates, their interaction and square terms, in each simulated dataset, we used only 30 different PS models in such a way that the PS models considered included models containing confounding variables and either instrumental variables or risk factors or their terms (PS models 1-5), only confounding variables (PS models 10-14), confounding and instrumental variables or terms (PS models 17-21), confounding variables and risk factors or terms (PS models 24-28), and other combination of covariates (Appendix 1). For evaluating the impact of including instrumental variable(s) in the PS model on the bias in the estimated treatment effect, both in the presence and absence of unmeasured confounding. The covariate X 4 was dropped from some of the PS models; hence, X4 was considered as unmeasured confounding variable when it is not included in the PS model (e.g., PS models 7, 8, 15, 22, and 29). Propensity Score Matching Algorithm We used one-to-one nearest neighbor matching within a specified caliper (NNMC) with and without replacement as the PS matching algorithm. These are variants of the nearest neighbor matching (NNM); however, compared to the standard NNM method, the NNMC avoid poor matches by imposing a caliper width in such a way that treated and untreated matches are selected only if their PS values lie within the caliper. 15 NNMC with and without replacement primarily vary in the number of patients that remain after matching and in the relative weights the patients receive (i.e., they differ in whether untreated patients can be used as a matches for more than one treated patient). In NNMC with replacement, we used the descending method 8, 23 : matching was started with the treated subject with the highest PS value although the order in which matches are constructed does not matter in matching with replacement due to the fact that there is no depletion of the untreated reservoir till the end of the matching. 15 Improving selection of covariates and caliper for optimal balance in propensity score matching 77

78 2.3 In NNMC matching without replacement, an untreated subject is no longer eligible for consideration as a match for other treated subjects, once an untreated subject has been matched to a treated subject 8, 15. As a result there is a sequential depletion of untreated matches for treated subjects. Consequently, the order in which matches are constructed (e.g. selecting treated subjects from highest to lowest PS or lowest to highest PS or in a random order of PS) is important. 8, 15, 16 Figure 2 shows the principal steps in NNMC using PS with and without replacement. Improving selection of covariates and caliper for optimal balance in propensity score matching We used 10 different caliper widths: i.e. respectively 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, and 0.6 standard deviations of the logit of the propensity score: We used a fixed random number seed so that the matched sample was reproducible in subsequent analyses. Propensity Score Balance Measure Although several PS balance measures have been proposed in the literature, 7, 9, in this study, we used only the absolute standardized difference. The absolute standardized difference was chosen because it has displayed better performance in different patterns of covariate distribution and sample size with respect to bias reduction compared to other 9, 24, 25 balance measures such as the Kolmogorov-Smirnov distance. Go Back to First Step and Repeat for Every Treated subject Arrange Treated Subjects in Increasing, Decreasing or Random Order of Their PS Values Find the Closest Subject From the Pool of Untreated Subjects If the Distance Between Treated and Untreated Subjects is With the Specified Caliper Width, Record the Match NNM with Replacement Remove the Treated Subject From the Pool of Untreated Subjects Replace the Untreated Subject back to the Pool of Untreated subjects NNM without Replacement Remove the Matched Treated and Untreated Subjects Remove the Treated Subject From the Pool of Untreated Subjects Go Back to First Step and Repeat for Every Treated subject 78 Figure 2. Steps in Nearest Neighbor Matching With Caliper With and Without Replacement

79 The absolute standardized difference (SDif ) for binary confounding variable is the absolute difference in the prevalence of the variable between treated and untreated subjects standardized to the variation in the variable (i.e. the standard deviation). It has a minimum value of zero ( perfect balance) but no maximum value. For continuous confounding variables, the SDif is the difference in the means of the variable between treated and untreated subjects standardized to the pooled variation. 7, 9 For a more detailed explanation 7, 9, 24, 26 of this balance measure, we refer to the literature. Analysis For each of the 30 PS models, the PS was estimated using ordinary logistic regression and the effect of the treatment on the outcome was estimated in the PS matched dataset using Poisson regression model (yielding a risk ratio). The PS matching was done using the NNMC with and without replacement. In NNMC with replacement, the weights were used in the outcome analysis to account for the fact that the matched untreated subjects were used more than one time and that they were no longer independent. 27 The absolute standardized difference of each covariate, interaction and square term was calculated before and after matching for each caliper width and PS model combination. In addition, the number of subjects available for analysis is recorded for each matching algorithm (NNMC with and without replacement), caliper width and PS model combination. To assess how covariate selection for PS model affects bias and variance of the treatment effect in the presence of instrumental variables and unmeasured confounding, we compared, for different PS models, the balance of covariates (SDif ) before and after matching and the bias in the estimated treatment effect. In this case, we chose a caliper width of 0.20 standard deviation of the logit of the propensity score. Next, using the similar caliper width, balance was measured on four different categories of covariates, interaction and square terms using the average of the SDif: 1. All variables and/or interaction or square terms (X 1 - X 8, X 2 X 4, X 4 X 5, X 3 X 6, X 2 X 7, X 7 X 8, and X 12, X 62, and X 72 ) 2. Outcome related covariates and/or interaction or square terms (X 1 - X 6, X 2 X 4, X 4 X 5, X 3 X 6, X 12 and X 62 ), and 3. Confounding variables and/or interaction or square terms (X 1, X 2, X 4, X 5, X 2 X 4, X 4 X 5, and X 12 ), and 4. Treatment related covariates and/or interaction or square terms (X 1, X 2, X 4, X 5, X 7, X 8, X 2 X 4, X 4 X 5, X 2 X 7, X 7 X 8, X 1 2 and X 72 ) The pooled SDif was then used for PS model selection. Hence, the PS model that gave the best balance in each set of covariates (i.e., the minimum value of SDif ) was selected. The balance of covariates (values of SDif ), the bias in the treatment effect estimate, and the number of patients included in the analysis was compared for the four categories of variables included in the PS model. 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 79

80 2.3 Finally, the performance of SDif was evaluated by comparing the amount of bias in the treatment effect estimate (selecting the optimal caliper width and PS model; hence, considering the least biased treatment effect estimate) and the precision of the estimate (i.e., the number of subjects retained for analysis after matching) using different calipers. Bias was calculated as the difference between the true marginal effect (used in data generation process, i.e. RR = 1.0, RR = 1.5 or RR = 1.75) and the effect estimate obtained from a Poisson regression model in the PS matched dataset. All analyses were performed in R, version R code for the simulation is available from the authors up on request. Improving selection of covariates and caliper for optimal balance in propensity score matching 80 RESULTS When all the covariates were binary and independent (Scenario 1), PS matching reduced the imbalance only of the covariates that were included in the PS model. On the other hand, the imbalance of covariates that were not included in the PS model (as well as unmeasured confounding that was independent of measured covariates in the PS model, e.g., X 4 in PS model 7) increased in the PS matched sample compared to the original unmatched data (Table 2). The inclusion of variables related only to the treatment (instrumental variables, IVs) in the PS model amplified the residual bias due to unmeasured confounding (via exacerbating the imbalance of unmeasured confounding, X 4, between treatment groups, Table 3) and increased the variation in treatment effect. Inclusion of variables related only to the outcome (risk factors) reduced the variability of the treatment effect without increasing bias, even in the presence of unmeasured confounding. The results were similar with a lower incidence of the outcome (5%), different strengths of the treatment effect (RR=1 and RR=1.5) and different covariate distribution patterns (Appendix). The balance of covariates before and after matching using different PS models, when unmeasured confounding variable was correlated to other confounders (Scenario 2) is shown in Table 2. The PS matching improved balance not only of the covariates that were included in the PS model but also of the unmeasured confounding variable (X 4 ), which was related to measured ones. Inclusion of IVs in the PS model resulted in only a small increase in bias due to residual confounding (compared to Scenario 1) and increased the variation in the treatment effect (Table 3). When an unmeasured confounder was associated with treatment related covariates (IVs or near IVs, when no or a weak correlation between treatment related covariates and the outcome existed, respectively), inclusion of near IVs in the PS improved balance not only of the covariates that were included in the PS model but also of unmeasured confounding variable which was related to the near IVs (Table 2). As a result, the bias in the treatment effect due to unmeasured confounding was reduced at the cost of a small increase in the variance in the treatment effect. On the other hand, exclusion of these near IVs from the PS model resulted in increased bias compared to their inclusion in the PS model (Table 3).

81 Table 2. Balance of Covariate Using Standardized Different When Different Sets of Covariates Were Included In the Propensity Score All covariates were independent X 4 and X 5 were correlated X 4 and X 7 were correlated Terms Unmatched PS2 PS18 PS25 PS11 PS15 PS29 PS22 PS8 X X X X X X 2 X X 4 X X X X X X X X X X X 2 X X 4 X X X X X X X X X X X 2 X X 4 X X X X X Data was unmatched* and matched using Full PS model (PS2), True PS model (PS18), Risk Factor PS model (PS25), Confounder PS model (PS11), Confounder PS model omitting confounder, X 4 (PS15), Confounder PS model omitting confounder, X 4, but risk factors were included (PS29), Confounder PS model omitting confounder, X 4, but instrumental variables were included (PS22), Confounder PS model omitting confounder, X4, but risk factors and instrumental variables were included (PS8). 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 81

82 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 82 Table 3. Median (Interquartile Range) of Estimated Treatment Effect Using Different Propensity Score Models in different Scenarios (True RR =1.75, 500 Simulations) Model All covariates were independent HR (IQR) X 4 and X 5 were correlated HR (IQR) X4 and X7 were correlated HR (IQR) Crude FullPSModel TruePSModel OutcomePSModel ConfPSModel OmitConfPSModel 1 * OmitConfPSModel 2 ** OmitConfPSModel OmitConfPSModel In all the OmitConfPSModels 1-4, the confounder X4 was omitted and other * Confounders but no risk factors/instrumental variable were included (PS Model 15) ** Confounders and risk factors but no instrumental variable were included (PS Model 29) Confounders and (near) instrumental variables but no risk factors were included (PS Model 22) Confounders and (near) instrumental variables as well as risk factors were included PS model (PS Model 8) Checking balance on different sets of covariates, their interaction or square terms resulted in the selection of different PS models and influenced both the bias and the variance of the treatment effect. When only confounding variables and/or their interaction as well as square terms were used in balance calculation, the bias was reduced and the precision was improved compared to balance calculation that involved all covariates or confounding variables and treatment related covariates (Table 4). Results from balance calculations based on outcome related covariates (and interaction as well as square terms) were similar to those based only on confounding variables (and interaction as well as square terms). When balance calculations involved confounding factors alone or confounding factors plus risk factors, effect estimates were similar, although larger numbers of subjects were retained for analysis when balance calculation involved only main terms and not interaction or square terms (Table 5). On the other hand, when balance calculations involved all covariates or confounding terms and only treatment related covariates, the least biased estimates were obtained when only main terms were included in balance calculations. However, inclusion of interaction and square terms did not impact the number of subjects retained for analysis. When PS model selection was based on the achieved balance on confounding terms only, the confounder PS was the most frequently selected PS model followed by the outcome PS model. The full PS model and true PS models were more often selected when balance calculation involved all covariates or treatment related covariates. Results were similar across different caliper widths and covariate distributions (Data not shown).

83 Table 4. Median (Interquartile Range) Treatment Effect When Propensity Score Model Was Chosen Based on Balance Calculations on Different Sets of Covariates (True RR =1.75, 500 Simulations) Caliper All Covariates Confounding Factors Confounding and Risk factors Confounding and Treatment related factors 0.05* * * * * * * * * * ** ** ** ** ** ** ** ** ** ** *Balance calculation involved covariates and their interaction or square terms *Balance calculation involved only main terms (covariates) and not their interaction or square terms Figure 3 shows the impact of caliper width on bias (3a) and the number of subjects included in the matched analysis (3b) for different propensity score models under the three different scenarios. In general, the bias in the estimated treatment effect as well as the number of subjects included in the matched analysis increased with increasing caliper width. However, when covariates are correlated, the number of subjects matched did not seem to change dramatically with increasing caliper width. In addition, once the propensity score models were chosen based on balance measures, the impact of caliper width was minimal. 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 83

84 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 84 Table 5. Median Number of subjects Matched When Propensity Score Model Was Chosen Based Balance Calculations on Different Sets of Covariates (True RR =1.75, 500 Simulations) Caliper All Covariates Confounding Factors Confounding and Risk factors Confounding and Treatment related factors 0.05* * * * * * * * * * ** ** ** ** ** ** ** ** ** ** *Balance calculation involved covariates and their interaction or square terms *Balance calculation involved only main terms (covariates) and not their interaction or square terms DISCUSSION This study demonstrated that, in choosing covariates for a propensity score model, the association among covariates has substantial impact on other covariates balance and the bias in the treatment effect estimate. Inclusion of instrumental variables in the propensity score model amplified bias due to unmeasured confounding independent of other confounders, via exacerbating the imbalance in unmeasured confounders. Reduction in bias with optimal precision in the treatment effect estimates can be achieved by improving the balance of confounding variables (and interaction and square terms) only; attempting to balance all covariates, including even outcome related variables will produce more bias and lower precision. Out results further illustrate the usefulness of balance measures in choosing optimal caliper width with respect to both bias and precision of the estimated treatment effect.

85 Percent Reduction In Bias Percent Reduction In Bias FullPSModel TruePSModel RiskFactorPSModel ConfPSModel PSModelUC PSModelUCRF PSModelUCIV (a) Caliper Width on the Logit of The Propensity Score FullPSModel TruePSModel RiskFactorPSModel ConfPSModel PSModelUC PSModelUCRF PSModelUCIV (c) Caliper Width on the Logit of The Propensity Score Caliper Width on the Logit of The Propensity Score Figure 3. Caliper width and percent reduction in bias (a and c) and number of matched subjects (b and d). FullPSmodel (included treatment and/or outcome related covariates or terms), TruePSModel (included confounding terms and treatment related covariates or terms), RiskFactorPSModel (included confounding terms and risk factors, i.e., outcome related covariates or terms), ConfPS Model (included only confounding terms), PSModelUC (PS model with confounding terms omitted from the model), PSModelUCRF (PS model with confounding terms omitted from the model but risk factors included), PSModelUCIV (PS model with confounding terms omitted from the model but instrumental or near instrumental variables included). Matching was done with replacement and covariates were all binary, True RR =1.75 and sample size (n) =2000. All covariates were independent (a and b), and X 4 and 7 were correlated (c and d). Number of Subjects Matched Number of Subjects Matched FullPSModel TruePSModel RiskFactorPSModel ConfPSModel PSModelUC PSModelUCRF PSModelUCIV (b) FullPSModel TruePSModel RiskFactorPSModel ConfPSModel PSModelUC PSModelUCRF PSModelUCIV Caliper Width on the Logit of The Propensity Score (d) 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 85

86 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 86 Bias amplification due to instrumental variables has been widely accepted among epidemiologic researchers 10, 12, 21 shifting the paradigm in covariate selection towards based on their association with the outcome variable. Although the current study is consistent with previous findings in that instrumental variables should not be included in propensity score models (outcome regression model), researchers should make an implicit assumption that the instrumental variables are independent of unmeasured confounding that cannot be controlled for. However, the last few years have demonstrated the difficulties of finding perfect instrumental variables in pharmacoepidemiology that fully satisfy the key assumptions: strong association with the treatment; no association with the outcome, except through its association with the treatment; and independence from confounding variables (both measured and unmeasured). Although one can check the association between instrumental variable and the treatment variable using data, the last two assumptions are empirically untestable If a variable really seems to fulfill all three assumptions, one should consider application of the instrumental variable method, because this approach not only avoids the risk of bias amplification typical of propensity score methods, but also controls for residual bias due to unmeasured confounding which cannot be addressed by either the propensity score method or multivariable regression analysis. Importantly, when the third assumption is violated, i.e., when the instrumental variable is associated with other confounding variables (measured or unmeasured), inclusion of such variables in the propensity score or regression model still reduces bias in the effect estimate compared to the exclusion of this instrumental variable. Bias amplification due to inclusion of instrumental variable in the propensity score model, on the other hand, requires not only conditioning on a strong instrumental variable, but 12, 21, 34 also the presence of strong unmeasured confounding that cannot be controlled for. In practical situations, most pharmacoepidemiologic studies utilize electronic health care databases that contain large number of variables as well as proxies for both measured and unmeasured confounding factors. Hence, through extensive listing of potential confounding factors, researchers could minimize the likelihood of unmeasured confounding to the extent that conditioning on instrumental variables carries a minimal risk of amplifying bias and offers the potential for reducing bias. Therefore, in selecting covariates for propensity score (outcome regression) model, one should consider mutual associations of covariates, in addition to their association with treatment and outcome variables. Moreover, when researchers encounter covariates having instrumental variable properties, falsification tests from instrumental variable methods 32, 35 can be employed to assess their association with other covariates or outcome and thus their potential for bias amplification. Because inclusion of outcome related covariates in the propensity score (outcome regression) model improve precision of the effect estimates without any risk of amplifying residual bias, these should always be included in the model. Although variable selection for propensity score model and balance assessment has been a focus of research on PS methodology, there was a paucity of research regarding which covariates or terms should be included in balance calculations. Rubin advised to check

87 balance on covariates included in the PS model and relevant interaction terms 36. On the other hand, variable selection based on their association with the outcome and treatment implies that balance should be checked not only on confounding variables but also on covariates only related to the outcome 21, 26. Our study indicated that assessment of balance based only on confounding variables (and their interaction or square terms) is equivalent, if not superior, in terms of both reducing bias and improving precision of treatment effect, to assessment of balance based on outcome related covariates. Nevertheless, the propensity score model itself may include additional outcome or treatment related covariates and interaction or higher order terms while balance is assessed only on confounding variables (and their interaction or square terms). In PS matching, the choice of caliper width not only affects the balance of covariates and 8, 15, 37 the precision of the effect estimate, but also alters the interpretation of this estimate. Although the use of 0.25 standard deviation on the logit of PS is prevalent in the medical literature, as suggested by Rubin, the choice of caliper to produce high quality matches in terms of both bias (covariate balance) and precision of treatment effect depends on the structure of the data at hand. 4, 8, 16 For example, in a given dataset two different calipers could result in a small difference in precision but a large variation in bias or vice versa. Using balance measures can help researchers find a better trade-off between balance (i.e., bias) and precision. In other words, balance measures can be used as a sensitivity analysis to show the impact of caliper choice on both the balance of covariates and the number of patients retained for analysis. Matching with replacement minimizes the propensity score distance between the matched treated subjects and the untreated subjects: each treated subjects can be matched to the nearest untreated subject, even if a untreated subject is matched more than once, thereby enabling optimal use of treated subject population. This is beneficial in terms of bias reduction and direct interpretation of the effect estimate as the average treatment effect in the treated (ATT). 23 In contrast, by matching without replacement, when there are only few untreated subjects similar to the treated subjects, researchers may be forced to match treated subjects to untreated subjects that are quite different in terms of the estimated propensity scores. This will increase the bias but could improve the precision of the estimates compared to matching with replacement. 15 An additional complication of matching without replacement is that the results are potentially sensitive to the order in which the treatment subjects are matched. 15 In both cases, untreated matches cannot be found for all treated subjects due to depletion of untreated pool resulting in exclusion of unmatched treated subjects from the analysis. In such cases, interpretation of the estimated treatment effect is not just the average treatment effect in the treated, rather the average 8, 37 treatment effect in the treated for whom untreated matches were found. Our study has several strengths. First, comparisons were made in a setting similar to most pharmacoepidemiologic studies in terms of covariate distributions, caliper widths and magnitude of treatment effect estimates. Second, we considered not only covariates but also their interaction and square terms both in modeling propensity scores and evaluating balance. Third, we assessed the sensitivity of our findings in various scenarios: matching with 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 87

88 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching and without replacement, covariate patters, and strength of treatment effect. Limitations of our study include the fact that we did not compare different options of matching without replacement ( high-to-low, low-to-high, and random ). Our focus on matching without replacement with random selection of treated subjects to match with untreated was based on better performance of this matching option compared to the high-to-low and the low-to-high order matching. 8 Second, overall balance was assessed based on the average of covariate specific standardized differences and the strength of covariates association with treatment and outcome was not accounted for. Although the use of weights based on covariates association with outcome has been previously proposed 24 and applied, it was not superior to the approach we used, hence, was not considered in this study. 24, 26 On the other hand, post matching c-statistic, 26 a recently proposed balance metric, enables researchers to assess multivariate balance compared to our approach. However, the advantage of univariate balance measures is that imbalance on specific covariates can easily be identified and their balance can be improved by including interactions or squares in the propensity score model. Nonetheless, multivariate balance metrics could outperforms univariate ones particularly in high dimensional data where large number of computations per covariate is needed. We conclude that in selecting variable for PS model, emphasis should be given not only to covariates association with the treatment or outcome but also correlation between covariates. Decisions on whether to include or exclude a variable with instrumental variable properties, into a propensity score model, could be supported with the use of falsification tests from instrumental variable analysis. In addition, researches should place more emphasis on obtaining an extensive list of potential confounding factors or proxies for unmeasured confounding to minimize the risk of residual confounding and its amplification through inclusion of instrumental variables, rather than relying on exclusion of any potential instrumental variable. Although all potential confounding factors and interactions or square terms should be included in propensity score models, balance calculations based on the main terms only are sufficient and reduce the number of calculations, in particular with large number of covariates. Furthermore, balance measures are valuable tools for choosing optimal caliper width in PS matching; hence we suggest researchers to start the PS matching using several calipers including 0.20 SD on logit of PS and choosing the one that results in the best balance (i.e., the lowest standardized difference on covariates) and in an acceptable number of matched groups. 88

89 REFERENCES 1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15: Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79: Cochran WG, Rubin DB. Controlling bias in observational studies: a review. Sankhyā: Series A 1973; 35: Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf 2004; 13: King G, Nielsen R, Coberley C, Pope JE, Wells A. Comparative effectiveness of matching methods for causal inference. Unpublished manuscript 2011; Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Lunt M. Selecting an appropriate caliper can be essential for achieving good balance with propensity score matching. Am J Epidemiol 2014;179: Ali MS, Groenwold RHH, Pestman WR et al. Propensity score balance measures in pharmacoepidemiology: a simulation study. Pharmacoepidemiol Drug Saf doi: / pds Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol 2011; 174: Pearl J. On a class of bias-amplifying variables that endanger effect estimates, 2010). 12. Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174: Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. Am J Epidemiol 2006; 163: Patrick AR, Schneeweiss S, Brookhart MA, et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20: Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci 2010; 25: Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat. Med. 2014; 33: Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat 2011; 10: Rosenbaum PR & Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985; 39: Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. J Thorac Cardiovasc Surg 2007; 134: Ali MS, Groenwold RHH, Belitser SV, et al. Covariate selection and assessment of balance in propensity score analysis in the medical literature: a systematic review Accepted. I Clin Epidemiol Improving selection of covariates and caliper for optimal balance in propensity score matching 89

90 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 22. Rubin DB, Thomas N. Combining propensity score matching with additional adjustments for prognostic covariates. JASA 2000; 95: Dehejia R, Wahba S. Propensity score-matching methods for nonexperimental causal studies. Rev Econ Stat 2002; 84: Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20: Groenwold RHH, Vries F, Boer A, et al. Balance measures for propensity score methods: a clinical example on beta-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf 2011; 20: Franklin JM, Rassen JA, Ackermann D, et al. Metrics for covariate balance in cohort studies of causal effects. Stat Med 2014; 33: Stuart EA. Developing practical recommendations for the use of propensity scores: Discussion of A critical appraisal of propensity score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. JASA 1996; 91: Martens EP, Pestman WR, de Boer A, Belitser SV, Klungel OH. Instrumental variables: application and limitations. Epidemiology 2006; 17: Hernan MA, Robins JM. Instruments for causal inference: an epidemiologist s dream? Epidemiology 2006; 17: Uddin M, Groenwold RHH, de Boer A, et al. Performance of instrumental variable methods in cohort and nested case control studies: a simulation study. Pharmacoepidemiol. Drug Saf. 2014; 23: Brookhart MA, Rassen JA, Schneeweiss S. Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol. Drug Saf 2010; 19: Myers JA, Rassen JA, Gagne JJ, et al. Myers et al. respond to Understanding bias amplification. Am J Epidemiol 2011; 174: Ali MS, Uddin M, Groenwold RHH et al. Quantitative falsification of instrumental variable assumptions using balance measures. Accepted, Epidemiol 2014; 36. Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Serv Outcomes Res 2001; 2: Hill J. Discussion of research using propensity-score matching: Comments on A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27:

91 APPENDICES Appendix 1. Description of the Propensity Score Models PS Model 1 x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x2*x4 + x2*x7 + x3*x5 + x4*x5 + x3*x6 + x7*x8 + I(x1^2) + I(x7^2) + I(x6^2) + x9 PS Model 2 x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x2*x4 + x2*x7 + x3*x5 + x4*x5 + x3*x6 + x7*x8 + I(x1^2) + I(x7^2) + I(x6^2) PS Model 3 x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x2*x4 + x2*x7 + x3*x5 + x4*x5 + x3*x6 + x7*x8 PS Model 4 x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + I(x1^2) + I(x7^2) + I(x6^2) PS Model 5 x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 PS Model 6 x1 + x2 + x3 + x5 + x6 + x7 + x8 + I(x1^2) + I(x7^2) + I(x6^2) PS Model 7 x1 + x2 + x3 + x5 + x6 + x7 + x8 PS Model 8 x1 + x2 + x3 + x5 + x6 + x2*x7 + x3*x5 + x3*x6 + x7*x8 + I(x6^2) PS Model 9 x1 + x2 + x3 + x5 + x6 + x3*x5 + x3*x6 + I(x1^2) + I(x6^2) PS Model 10 x1 + x2 + x4 + x5 + x2*x4 + x4*x5 + I(x1^2) + x9 PS Model 11 x1 + x2 + x4 + x5 + x2*x4 + x4*x5 + I(x1^2) PS Model 12 x1 + x2 + x4 + x5 + x2*x4 + x4*x5 PS Model 13 x1 + x2 + x4 + x5 + I(x1^2) PS Model 14 x1 + x2 + x4 + x5 PS Model 15 x1 + x2 + x5 + I(x1^2) PS Model 16 x1 + x2 + x4 PS Model 17 x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) + x9 PS Model 18 x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 + I(x1^2) + I(x7^2) PS Model 19 x1 + x2 + x4 + x5 + x7 + x8 + x2*x4 + x2*x7 + x7*x8 + x4*x5 PS Model 20 x1 + x2 + x4 + x5 + x7 + x8 + I(x1^2) + I(x7^2) PS Model 21 x1 + x2 + x4 + x5 + x7 + x8 PS Model 22 x1 + x2 + x5 + x7 + x8 + I(x1^2) + I(x7^2) PS Model 23 x1 + x2 + x4 + x7 + x8 PS Model 24 x1 + x2 + x3 + x4 + x5 + x6 + x2*x4 + x3*x5 + x3*x6 + x4*x5 + I(x1^2) + I(x6^2) + x9 PS Model 25 x1 + x2 + x3 + x4 + x5 + x6 + x2*x4 + x3*x5 + x3*x6 + x4*x5 + I(x1^2) + I(x6^2) PS Model 26 x1 + x2 + x3 + x4 + x5 + x6 + x2*x4 + x3*x5 + x3*x6 + x4*x5 PS Model 27 x1 + x2 + x3 + x4 + x5 + x6 + I(x1^2) + I(x6^2) PS Model 28 x1 + x2 + x3 + x4 + x5 + x6 PS Model 29 x1 + x2 + x3 + x5 + x6 + I(x1^2) + I(x6^2) PS Model 30 x1 + x2 + x3 + x5 + x6 2.3 Improving selection of covariates and caliper for optimal balance in propensity score matching 91

92 Appendix Improving selection of covariates and caliper for optimal balance in propensity score matching 92 Absolute Standardized Difference Unmatched FullPSModel CFPSModel TruePSModel RiskFactorPSModel PSModelUC PSModelUCIV PSModelUCRF Absolute Absolute Standardized Standardized Difference Difference Covariates a a Unmatched FullPSModel CFPSModel TruePSModel RiskFactorPSModel Unmatched PSModelUC FullPSModel PSModelUCIV CFPSModel PSModelUCRF TruePSModel RiskFactorPSModel PSModelUC PSModelUCIV PSModelUCRF Absolute Standardized Difference Covariates Figure 1. Balance of Covariates, Interaction and Square Terms (Measured as Absolute Standardized Difference) When Different PS Models are Used for Matching with Replacement (Caliper = 0.2 SD on the Logit of the PS, True RR = 1.75, 500 Replications). c Absolute standardized difference of different covariates (listed in the X-axis in the order : X 1, X 2, X 3, X 4, X 5, 2 2 X 6, X 7, X 8, X 9, X 12, X 6,X 7,X 2 X 4, X 2 X 7, X 3 X 5, X 3 X 6, X 4 X 5, and X 7 X 8 ) in the unmatched data, black line; matching Figure 1. Balance of Covariates, Interaction and Square Terms (Measured as Absolute using full PS model (FullPSModel), violet line; confounder PS model (CFPSModel), purple line; true Standardized PS model (TruePSModel), Difference) blue When line; risk Different factor PS model PS (RiskFactorPSModel), Models are Used yellow for line; Matching PS model with Replacement with missing confounders (Caliper = (PSModelUC), 0.2 SD on the green Logit line; of PS the model PS, True with missing RR = 1.75, confounders 500 Replications). but including instrumental variables (PSModelUCIV), PS model with missing confounders but including outcome Absolute risk factors standardized (PSModelUCRF). difference of different All covariates (listed were in independent the X-axis the (a); order instrumental : X 1, X 2, X 3, variable X 4, X 5, X 6, was X 7, Xcorrelated 8, X 9, X 2 1, X 2 2 6,X 7,X 2X 4, X 2X 7, X 3X 5, X 3X 6, X 4X 5, and X 7X 8) in the unmatched data, black line; matching using full PS model (FullPSModel), violet line; confounder with confounders PS model (b); (CFPSModel), near IV was correlated purple line; with true unmeasured PS model (TruePSModel), confounding (c). blue line; risk factor PS model (RiskFactorPSModel), yellow line; PS model with missing confounders (PSModelUC), green line; PS model with missing confounders but including instrumental variables (PSModelUCIV), PS model with missing confounders but including outcome risk factors c Covariates Unmatched FullPSModel CFPSModel TruePSModel RiskFactorPSModel PSModelUC PSModelUCIV PSModelUCRF Covariates b b

93 Percent Reduction In Bias FullPSModel CFPSModel RiskFactorPSModel TruePSModel Percent Reduction In Bias FullPSModel CFPSModel RiskFactorPSModel TruePSModel Caliper Width a Percent Reduction In Bias FullPSModel CFPSModel RiskFactorPSModel TruePSModel Caliper Width Caliper Width Figure 2. Percent Reduction in Bias for Different Matching Caliper When PS Model is chosen Based on Balance Achieved on Different Sets of Covariates. Balance was based on all covariates, black line; only on confounding terms, blue line; on outcome related (and confounding) terms, purple line; and treatment related (and confounding) terms. Matching was done with replacement and all covariates were independent (a); instrumental variable was correlated with confounders (b); near IV was correlated with unmeasured confounding (c). c b Improving selection of covariates and caliper for optimal balance in propensity score matching 93

94 2.3 Percentage of Reduction in Bias TruePS.Model OutcomePS.Model FullPS.Model ConfonderPS.Model omitted ConfPS.Model Coverage of 95% CIs TruePS.Model OutcomePS.Model FullPS.Model ConfonderPS.Model omitted ConfPS.Model Improving selection of covariates and caliper for optimal balance in propensity score matching Caliper Width: SD of logit PS a Number of Matced Subjects for Analysis TruePS.Model OutcomePS.Model FullPS.Model ConfonderPS.Model omitted ConfPS.Model Caliper Width: SD of logit PS Caliper Width: SD of logit PS Figure 3. Caliper Width and Percent Reduction In Bias (a), Coverage of 95% Confidence Interval (b), and Number of Matched Subjects (c). True PS model (confounders and treatment related variables), outcome PS model (confounders and outcome related variables), full PS model (treatment and/or outcome related variables), confounder PS model (only confounders), and omitted conf. PS model (Few confounders omitted in the model). Matching was made without replacement, Covariates were all binary, True RR =2.0, and sample size (n) =2000, 500 replications. c b 94

95 CHAPTER III APPLICATIONS OF PROPENSITY SCORE AND MARGINAL STRUCTURAL MODELS

96

97 CHAPTER Time-Dependent Propensity Score and Collider- Stratification Bias: the Example of Beta2-Agonist Use and the Risk of Coronary Heart Disease M. Sanni Ali, RHH Groenwold, WR Pestman, SV Belitser, AW Hoes, A de Boer, OH Klungel European Journal of Epidemiology 2013; 28: Time-dependent Propensity Score and Collider-Stratification Bias 97

98

99 ABSTRACT BACKGROUND METHODS RESULTS DISCUSSIONS INTRODUCTION Stratification and conditioning on time-varying cofounders that are also intermediates can induce collider-stratification bias and adjust-away the (indirect) effect of exposure. Similar bias could be expected when one conditions on time-dependent PS. We explored collider-stratification and confounding bias due to conditioning or stratifying on time-dependent PS using a clinical example on the effect of inhaled short- and long-acting beta2-agonist use (SABA and LABA, respectively) on coronary heart disease (CHD). In an electronic general practice database we selected a cohort of patients with an indication for SABA and/or LABA use and ascertained potential confounders and SABA/LABA use per three-month intervals. Hazard ratios (HR) were estimated using PS stratification as well as covariate adjustment and compared with those of Marginal Structural Models (MSMs) in both SABA and LABA use separately. In MSMs, censoring was accounted for by including inverse probability of censoring weights. The crude HR of CHD was 0.90 [95 % CI: 0.63, 1.28] and 1.55 [95 % CI: 1.06, 2.62] in SABA and LABA users respectively. When PS stratification, covariate adjustment using PS, and MSMs were used, the HRs were 1.09 [95 % CI: 0.74, 1.61], 1.07 [95 % CI: 0.72, 1.60], and 0.86 [95 % CI: 0.55, 1.34] for SABA, and 1.09 [95 % CI: 0.74, 1.62], 1.13 [95 % CI: 0.76, 1.67], 0.77 [95 % CI: 0.45, 1.33] for LABA, respectively. Results were similar for different PS methods, but higher than those of MSMs. When treatment and confounders vary during follow-up, conditioning or stratification on time-dependent PS could induce substantial colliderstratification or confounding bias; hence, other methods such as MSMs are recommended. Propensity score (PS) methods 1 have become a commonly used approach in observational studies to control for confounding bias in estimation of causal treatment effects. 2,3 These methods are used to balance patient characteristics between treatment groups in pointtreatment studies, where confounders and treatment are constant over time. However, in follow-up studies, too simple (yes or no) treatment ascertainment may result in nondifferential misclassification in the presence of non-compliance and/or treatment switching. 4,5 In addition, treatment groups may quickly become less comparable over the course of the study resulting in biased estimates of treatment effects, unless comparisons can be balanced over follow-up. 5 With the increased availability of prospectively gathered and computerized information on medical diagnoses and medication use, treatment status can be assessed on a daily basis, 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 99

100 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 100 rendering time-varying analysis of drug use possible. 5 A time-dependent Cox proportional hazards model is the common approach in estimating the effect of time-varying treatment. 6 However, this approach may be biased in the presence of time-varying confounders that are affected by prior treatment 6,7, because adjustment for time-varying confounders that are affected by prior treatment may adjust away part of the treatment effect and induce collider-stratification bias (Figure 1) The use of PS in such longitudinal analysis of exposure is limited to using them as inverse probability of treatment weights (IPTW), in marginal structural models (MSMs). 12 Collider-stratification bias is a bias that is introduced by conditioning or stratifying on a collider, a variable that is a common effect of two or more variables in a causal pathway. 10,11 The inclusion of collider(s) in a PS model and subsequent stratification or regression adjustment using this PS in such a time-varying analysis of treatment and covariates could lead to similar bias, although it is not specifically assessed in clinical research and its impact is uncertain. 13 For example, in the causal diagram presented in Figure 1, we consider the PS to be a summary of time-dependent confounders. It can be speculated that time-varying propensity score at time t (PS t ), which includes beta 2 -agonist (LABA) and anticholinergics use that are proxies for severity of COPD, is predicted by previous SABA use (SABA t-1 ) and itself predicts subsequent SABA use (SABA t ) and the risk of coronary heart disease (CHD t ). Hence, PS t is a collider (summary of one or more colliders and confounders) on the causal path from previous SABA use (SABA t-1 ) to CHD t through PS t (SABA t-1 PS t SABA t CHD t ). By conditioning on the time-dependent PS, a spurious path will be opened from the exposure (SABA t ) to the outcome (CHD t ), via the unmeasured common causes (U) of confounders and outcome (SABA t PS t U CHD t ), hence, inducing collider-stratification bias. It will Figure 1. Directed acyclic graphs (DAGs) representing possible causal association among summary of time-varying potential confounders (PS), inhaled beta 2 -agonist (SABA), Coronary heart disease (CHD) and potentially unmeasured factors (U). t, current time; t-1 previous time. Potential confounders at time t (PS t ) are independent prognostic factors for Coronary heart disease (CHD t ), predictors of subsequent inhaled beta 2 -agonist use (SABA t ) and intermediates on the path SABA t-1 PS t CHD t.

101 also adjust- away the (indirect) effect of previous SABA use (SABA t-1 ) via later confounders (PS t ), which are also intermediates, resulting in a biased estimate of the net effect of SABA use on CHD. 9,10 The objective of this study was to illustrate collider-stratification bias associated with timedependent PS methods in a cohort of patients using inhaled short- and long-acting beta 2 - agonist (SABA and LABA) and the risk of non-fatal coronary heart disease (CHD). Effect estimates from time-dependent PS methods were compared with those of Robins MSMs 6-8 to control for time-varying confounders. The example of inhaled beta 2 -agonist use and the risk of CHD was chosen, because of the contradicting results in the literature In the literature, explanations for these discrepancies included lack of statistical adjustment for confounders such as severity of COPD 19, among others. However, severity of COPD (its proxies such as LABA or anticholinergic use) could be a time-dependent confounder of the causal association between SABA use and CHD, hence, conditioning on it could lead to collider-stratification bias. 3.1 METHODS Data, Design and Study Population As an illustrative example, we used data from the Netherlands University Medical Centre Utrecht General Practitioner Research Network. This is a computerized medical database that includes cumulative information on approximately 60,000 patients. Medical diagnoses are registered according to the International Classification of Primary Care (ICPC) system. Prescriptions are registered using the Anatomical Therapeutic Chemical (ATC) classification system. We used information from the period A cohort study was conducted in adults with an indication for inhaled beta 2 -agonist use (patients with a diagnosis of incident bronchitis, asthma, or COPD: ICPC codes, R78/R91, R96, and R95 respectively). To obtain detailed information on baseline characteristics, only patients experiencing incident bronchitis, asthma or COPD from 01 April 1995 onwards were included. Follow-up began the first day of diagnosis of bronchitis, asthma, or COPD and ended at the occurrence of either non-fatal CHD, death, loss to follow-up (unregistered with the GP), or end of the study (31 December 2005), whichever occurred first. Patients were excluded if they had any history of myocardial infarction (ICPC code K75) or angina pectoris (ICPC code K74) prior to or at the start of follow up. Outcome, Exposure, and Confounder Assessment The primary outcome was defined as the first diagnosis of non-fatal myocardial infarction (MI; ICPC code, K75) or angina pectoris (AP; ICPC code, K74) and is referred to as coronary heart disease (CHD) in the rest of the manuscript. If both events were observed in the same patient, the earlier date of diagnosis was considered. Patients who died (possibly due to a fatal MI) were first excluded from the analysis because cause of death was not routinely Time-dependent Propensity Score and Collider-Stratification Bias 101

102 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 102 registered 19 and later included as censored observations or combined with the end point (thus as events) in two separate sensitivity analyses to check whether the exclusion had any impact on the effect estimate. Using data on prescription dispensing date, we ascertained exposure status (inhaled beta 2 - agonist use) for every patient in terms of binary indicators in each three-month interval. A patient was considered a user (exposed) if he/she had filled at least one inhaled beta 2 - agonist prescription in the three-month interval (a nonuser or unexposed, otherwise). The choice for a three-month interval was based on the fact that Dutch health insurance policies cover the dispensing of the majority of drugs for three months. 19 In this study, we considered both inhaled short-acting beta-agonist (SABA) use [names (ATC codes): Salbutamol (R03AC02), Terbutaline (R03AC03), Fenoterol (R03AC04), Rimiterol (R03AC05), Fenoterol and other drugs for obstructive airway diseases (R03AK03) or Salbutamol and other drugs for obstructive airway diseases (R03AK04)] and long-acting beta-agonist (LABA) use [ATC: Salmeterol (R03AC12), Formeterol (R03AC13), Salmeterol and other drugs for obstructive airway diseases (R03AK06) or Salmeterol and other drugs for obstructive airway diseases (R03AK07)] as time-varying treatment in separate analyses. To evaluate this treatment classification, standard risk-set analysis was performed whereby a risk-set was constructed each time an event (CHD) occurred. At each of those time-points, treatment status as well as covariate values were ascertained for all patients at risk in the cohort. More details on the risk-set approach are included in Appendix. Information on potential confounders was available at baseline and during follow-up. The following potential confounders were available for analysis: age, gender, cardiovascular disease status (hypertension, heart failure, atrial fibrillation, paroxysmal tachycardia, cardiac arrhythmia, heart/atrial murmurs, pulmonary heart disease, heart valve diseases, other heart disease), presence of COPD, diabetes, inhalation glucocorticoids, anticholinergics, systemic corticosteroids, cardiovascular drugs (antithrombotic drugs, cardiac therapy, diuretics, agents acting on the rennin-angiotensin system), beta-blockers, statins, antidiabetics, SABA and LABA use. In addition, cardiovascular drugs were pooled into a single binary variable indicating cardiovascular medication use. For chronic diseases such as COPD and diabetes, patients were classified as having the disease from the first date of diagnosis through follow up. Analysis We conducted three sets of analyses. In the first set of analyses, treatment was considered time-varying over the three-month intervals and all other covariates were considered constant from baseline onwards. In this case, the PS was estimated as the probability of inhaled SABA/LABA use in one or more three-month intervals in the follow-up period (i.e., ever vs. never use) conditional on observed baseline covariates. Hence, the PS was considered constant during follow-up. Then, treatment effects were estimated using the PS as a covariate and stratifying variable in a Cox model. In addition, the IPTW approach

103 was used in which the estimated PS was used to assign weights to all observations (person-times). This weighting creates an altered composition of study population in which the probability of receiving LABA/SABA at each three-month interval is unrelated to confounders. 6,9 The weight for each patient was the inverse of the probability that the patient had the treatment that he or she actually received. Hence, the weight for treated observations was 1/PS and for untreated observations 1/(1-PS). Finally, a marginal Cox model was fitted using inhaled SABA/LABA use as the only covariate on the altered study population. Marginal frequency of SABA/LABA use was used in the numerator of IPTW (instead of 1.0) to stabilize the weights. 6-8 In both PS and IPTW approaches, the outcome model included a time-varying binary variable for treatment, which indicated treatment status during a three-month interval. This approach adjusted for baseline confounders in the presence of time-varying treatment; hence, collider-stratification bias is not an issue here, but inadequate adjustment for confounding may invalidate the results. Second, both treatment and covariates were considered time-varying at intervals of three months. In this approach, the data was restructured in a way that the longitudinal information of each patient was split-up into person-moments of three-month intervals, which included start and end dates, exposure status during that period, as well as covariates and censoring or event status (indicator values, i.e. yes = 1 or no = 0, were used for covariate, treatment, censoring and event status in each three-month interval). Both exposure and covariates were assumed to be constant during the intervals, and covariate histories were actualized so that their values are temporally prior to treatment. Crude and adjusted risks of CHD associated with inhaled SABA or LABA use was estimated using (multivariable) Cox proportional hazards model. In addition, the propensity for inhaled SABA/LABA use was estimated for each three-month interval by fitting logistic regression model that included the thirteen demographic and clinical variables listed in Table 1. The PS was defined as the probability of exposure to inhaled SABA/LABA during a three-month interval, conditional on observed covariates and exposure in the previous three-month interval. Hence, for each patient the PS can differ between consecutive three-month intervals. Then, treatment effect were estimated by fitting a Cox model that either included the PS as a continuous covariate, or stratified person-times on quintiles or deciles of the PS. Separate analyses were conducted for SABA and LABA use, and we only considered a simple PS model without any interaction or higher order terms. These methods could lead to biased estimates in the presence of time-dependent confounders that are affected by prior treatment. As a sensitivity analysis to assess the impact of using three-month intervals approach for classification of treatment status and potential confounders on the effect estimate, a risk-set analysis was performed. More details on the risk-set analysis can be found in Appendix. Third, similar to the previous set of analyses, both treatment and covariates were considered time-varying at three-month intervals, but marginal structural models were used. Two MSMs (one with only treatment weights and the second with combined treatment and censoring weights) were fitted. Stabilized treatment weights (Sw i ) and censoring weights (Cw i ) were 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 103

104 3.1 calculated using the method described by Hernan et al. 7 Inverse probability of treatment weight at each time t was defined as, Pr( A( k) = a ( k) A( k 1 ) = a = t i( k 1 ) ) i sw ( t) (1) i k = 0 Pr( A( k) = a ( k) A( k 1 ) = a = i( k 1 ), L( k) l i( k)) i where the numerator and denominator represent the probability of SABA/LABA use (A(k)) for each patient i at each three-month interval k (A(k) =a i (k)) given previous SABA or LABA use, A ( k 1 ) without and also with conditioning on time-varying covariates ( L ( k ) ), respectively. Inverse probability of censoring weights were estimated in the same way, except that the numerator and denominator represent the probability of remaining uncensored (C(k)) up to time t given past SABA/LABA use, A ( k 1 ) without and with also conditioning on timevarying covariates, L ( k) respectively: Time-dependent Propensity Score and Collider-Stratification Bias int( t) P r ( C ( k ) 0 C ( k 1 ) 0, A ( k 1 ) a i ( k 1 )) cw i ( t) (2) C k C k A k a k L k l k k 0 P r ( ( ) 0 ( 1 ) 0, ( 1 ) i ( 1 ), ( ) i ( ) Separate logistic regression models were fitted for the numerator and denominator. Treatment and censoring weights were then multiplied to get overall weights (sw i ) in each three-month interval, sw i (t)=sw i (t) cw i (t). Informally, the denominator of sw i (t) is the probability that a subject had the observed history of SABA/LABA and censoring up to time interval t. All analyses were performed in R, version and correlation between observations was taken into account in both PS and Cox analyses using the cluster function. MSMs theoretically do not suffer from collider-stratification bias, since the confounding effect of time-dependent confounders that are affected by prior treatment is controlled by weighting instead of conditioning. In all analytic methods, we assumed exchangeability (i.e. no unobserved confounding or noninformative censoring), consistency (an individual s potential outcome under his/her observed treatment history is precisely his/her observed outcome), positivity (i.e. at every level of the confounders, individuals in the population have a non-zero probability of receiving every level of treatment, which implies that the average causal effect of the treatment can be estimated in each subset of the population defined by the confounders), and correct model specification. 1, 6, 7, 23 For further details on these assumptions, we refer to the literature. RESULTS 104 In total, 8,099 patients with the inclusion criteria specified in the methods section and data on these subjects was used for analysis. A total of 337 (4.2%) patients experienced CHD during a mean follow up of 4.5 years. Males comprised 42.8% of the cohort and the mean

105 age at start of follow-up was 49.6 (SD = 19.1) years. At some point in time during followup, 31% and 15.6% of the patients used inhaled SABA and LABA, respectively. Baseline characteristics of patients included in the analysis are summarized in Table Table 1. Baseline Characteristics of Patients by Beta2-agonist use through follow up SABA LABA Characteristics Ever Users (N=3160) Never Users (N=4349) Ever Users (N=1264) Never Users (N=6835) Mean (SD) age in years 44.6 (17.8) 52.8 (19.2)* 51.9 (18.4) 49.2 (19.2)* Male gender 1297 (41.0) 2169 (43.9) 579 (45.8) 2887 (42.2) Co-morbidities COPD 662 (20.9) 556 (11.3)* 595 (47.1) 623 (9.1)* DM 241 (7.6) 711 (14.4) 114 (9.0) 597 (8.7) CVD** 835 (26.4) 1729 (35.0) 474 (37.5) 2090 (30.60* Co-medications Anti-diabetics 191 (6.0) 410 (8.3)* 103 (8.1) 498 (7.3) CV medications 951 (30.1) 1865 (37.8)* 573 (45.3) 2243 (932.8)* Beta-Blockers 571 (18.1) 1170 (23.7)* 267 (21.1) 1474 (21.6) Statins 90 (2.8) 229 (4.6)* 38 (3.0) 281 (4.1) Corticosteroids 451 (14.3) 558 (11.3)* 249 (19.7) 760 (11.1)* Anticholinergics 709 (22.4) 945 (19.1)* 579 (45.8) 1075 (15.7)* Glucocorticoids 1916 (60.6) 1025 (20.8)* 835 (67.5) 2088 (30.5)* SABA 834 (66.0) 2326 (34.0)* LABA 834 (26.4) 430 (8.7)* *P values < 0.05, P values were calculated using t-test for continuous variable (age) and chi-square test for categorical variables. All variables are expressed as number of patients (percentages) except for age. **CVD: Cardiovascular diseases (hypertension, heart failure, atrial fibrillation, cardiac arrhythmia, paroxysmal tachycardia, heart/atrial murmurs, pulmonary heart disease, heart valve diseases, other heart disease). CV Medications (antithrombotic drugs, cardiac therapy, diuretics, and agents acting on the renninangiotensin system). Table 2 shows results of different PS and multivariable Cox analyses when only treatment was considered time-varying in three-month intervals and adjustment was made for covariates only at baseline (time-fixed covariates). There was sufficient overlap the PS distribution between treated and untreated subjects, except in the lower and upper quintiles or deciles of the PS (data not shown). Results from PS methods and multivariable Cox models were comparable (Table 2). Time-dependent Propensity Score and Collider-Stratification Bias 105

106 Table 2. Adjusted Estimates of Hazard Ratio for CHD Associated With use of Inhaled SABA and LABA Using Different PS (at Baseline) methods With the Three-Month Interval Approach 3.1 Methods SABA use LABA use HR 95% CI HR 95% CI PS Stratification Quintiles of PS* Deciles of PS** , , , , 1.76 PS Covariate adjustment*** , , 1.70 IPTW , , 2.35 Multivariable Cox model , , 1.72 PS was estimated as the probability of SABA/LABA use in one or more of the three-month intervals during follow-up given patient characteristics only at base line. In all cases, covariates (PS) and weights were considered constant over time (time-fixed covariates) * Stratification based on quintiles of PS in the Cox model **Stratification based on deciles of PS in the Cox model *** PS were included as covariate in the Cox model PS were used to assign (Stabilized) weights in marginal Cox model Time-dependent Propensity Score and Collider-Stratification Bias 106 Table 3 shows the crude and adjusted hazard ratios (HRs) for CHD associated with the use of inhaled SABA and LABA, when treatment and confounders were defined in the threemonth intervals and adjustment was made for time-dependent confounders. The crude HR in case of inhaled SABA use was closer to unity and not significant (HR: 0.90 [95% CI: 0.63, 1.28]) but in case of inhaled LABA use, it was significant (HR: 1.55 [95% CI: 1.06, 2.62]). Once Table 3. (Un)adjusted Estimates of Hazard Ratio (HR) for CHD Associated With use of Inhaled SABA and LABA Using Three-Month Interval (Exposure Classification) Approach Adjusted for SABA Use LABA Use HR 95% CI HR 95% CI None (Crude) , , 2.26 Age , , 1.92 Age+ Gender , , 1.82 Age+ Gender +CVD , , 1.81 Age+ Gender +CVD+DM , , 1.81 Age+ Gender +CVD+DM+COPD , , 1.59 Fully Adjusted Model* , , 1.43 In all cases, covariates were considered time-varying Abbreviations: HR, hazard ratio; CI, Confidence interval. *Confounders included in the model: Age, Gender, CVD, DM, COPD, Inhalation glucocorticoids, Anticholinergics, Systemic corticosteroids, Cardiovascular medications, Beta-blockers, Statins, Antidiabetics, previous SABA/LABA use, and LABA in case of SABA use / SABA in case of LABA use.

107 age was included in the model, further adjustments for other covariates in SABA use did not materially alter the HR. Similar results were obtained when treatment classification was based on the risk-set approach (crude HR: 0.91 [95% CI: 0.56, 1.38] vs [95% CI: 0.63, 1.28] for inhaled SABA and 1.74 [95% CI: 1.13, 2.66] vs.1.55 [95% CI: 1.06, 2.26] for inhaled LABA). Additional results from risk-set approach are included in the Appendix. Table 4 shows the results of different time-dependent PS based Cox analyses and MSMs using the three-month interval approach. There was good overlap in the PS distribution between treated and untreated patients (data not shown). The means of the stabilized weights for both treatment and censoring were centered close to one for both SABA and LABA use. The stabilized treatment weights ranged from 0.02 to 8.79 for SABA use and from 0.12 to 3.79 for LABA use, respectively. Similarly, the stabilized weight for censoring ranged from 0.29 to The adjusted HR for CHD was 1.07 [95% CI: 0.72, 1.60] and 1.13 [95% CI: 0.76, 1.67] on quintile stratification of the PS in inhaled SABA and LABA users, respectively. There was no difference in the estimated effect when deciles of the PS were used instead of the quintile PS. Effect estimates from covariate adjustment using the PS were similar compared to quintile stratification based on the PS (HR: 1.09 [95% CI: 0.74, 1.61] vs [95 % CI: 0.72, 1.60] for inhaled SABA, and 1.09 [95% CI: 0.74, 1.62] vs [95% CI: 0.76, 1.67] for inhaled LABA). Estimates from MSMs using combined treatment and censoring weights were lower compared to those using only treatment weights (HR 0.86 [0.55, 1.34] vs [0.60, 1.41] for inhaled SABA and 0.77 [0.45, 1.33]) vs [0.53, 1.50] for inhaled LABA, respectively). Inclusion of patients who died (possibly due to a fatal MI) did not affect the result (data not shown). Table 4. Estimates of Hazard Ratio for CHD Associated With use of Inhaled SABA and LABA Using Different Time-dependent PS Methods and MSMs With Three-Month Interval Approach Methods SABA use LABA use HR 95% CI HR 95% CI PS Stratification Quintiles of PS* , , 1.67 Deciles of PS** , , 1.57 PS Covariate adjustment*** , , 1.62 MSMs-model , , 1.50 MSMs-model , , 1.33 PS was estimated as the probability of SABA/LABA use in each of the three-month interval during follow-up given patient characteristics in the previous three-month interval. In all cases, covariates except gender (PS) and weights were considered time-varying * Stratification based on quintiles of PS in the Cox model ** Stratification based on deciles of PS in the Cox model *** PS were included as covariate in the Cox model Only stabilized treatment weight were used to fit MSMs Both stabilized treatment and censoring weights were used to fit MSMs 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 107

108 DISCUSSION 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 108 Our goal was to illustrate collider-stratification and confounding bias associated with the use of time-dependent PS methods for the analysis of time-varying treatment, in observational data, in the presence of time-dependent confounders. In empirical data, effect estimates from time-dependent PS methods and MSMs of the association between inhaled SABA/ LABA and the risk of CHD were different, suggesting the impact of conditioning on timedependent confounders that are also affected by prior treatment. Substantial confounding of the association between SABA use and CHD could not be displayed in our data set, as shown by similar effect estimates from crude and multivariable Cox models, time-varying PS methods and MSMs. However, there are important differences to note on LABA use and the risk of CHD. Estimates from PS models are higher than those of MSMs and in opposite direction although the confidence intervals (CIs) were overlapping. These differences could in part be explained by the fact that the time-dependent PS t (a function of time-varying anticholinergics or time-varying beta 2 -agonist use other than the treatment of interest, i.e. severity of COPD) is a collider of prior treatment and possible unobserved risk factor (U) for the CHD. Conditioning or stratification on this PS, like the time-dependent Cox model may induce collider-stratification bias and also adjust away the (indirect) effect of previous treatment via time-dependent confounders (PS t ), which are also intermediates. 9,10 Another possible explanation for the differences in treatment effect estimates is non-collapsibility 24,25 of the HR. Conditional treatment effect estimate from Cox models that include the PS could be different from the marginal treatment effect from MSMs. However, the impact of non-collapsibility in our study is probably limited since the incidence of the outcome during follow up was relatively low (i.e.,4.2%). 26 In addition, PS methods give, in general, treatment effect estimates that are closer to the true marginal treatment effect than a conventional regression model in which all confounders are separately included in the adjusted model. 26 It could be argued that estimates obtained by conditioning or stratification on timedependent confounders or PS t represent the direct effect of SABA or LABA use (SABA t-1 ) on CHD t and only adjust away its indirect effect through intermediates (PS t ). However, this does not hold true in the presence of unmeasured common causes of confounders (PS t ) and outcome (CHD t ) even in the absence of unmeasured confounding on SABA t use and CHD t, 1, 6, 7, 27, 28 which is the underlying basic assumption in both PS methods and MSMs. In our empirical study, whether the impact of confounding or collider-stratification on bias is largest could not be assessed. Bias induced by adjusting for a collider could be comparable or could result in estimates with opposite direction from the true effect thereby altering conclusions and not just the strength of an association. 10, 11 Both for SABA and LABA use, estimates from multivariable models were closer to MSMs than PS methods, which is in line with findings from simulation studies of point treatment settings Furthermore, we used only a simple PS model (all observed confounders included, without interactions or higher

109 order terms) which may not result in the optimal balance of covariates between treatment groups. Thus, both confounding and collider- stratification may bias the observed effect estimates. Another possible source for bias is violation of the assumptions underlying our analyses, that the outcome model, the PS model for both treatment and censoring are correctly specified. Again, if loss to follow-up was related to treatment (beta 2 -agonist use) and outcome (CHD), results from time-dependent PS methods would be more biased due to selective loss to follow-up. However, inclusion of the censoring weights in MSMs can help us back to one of the untestable assumptions of exchangeability in the standard Cox model that censoring is non-informative, again under the assumptions of no unmeasured confounding for treatment and censoring being only dependent on observed patient characteristics. 6, 7 In our study, censoring could be non-random since results from MSMs fitted with only treatment weights versus both treatment and censoring weights were different. However, it is difficult to make general conclusion in the context of a single study. In both cases, we used stabilized weights to normalize the range of these inverse probabilities and increase efficiency of the analysis. 6, 33, 34 The convergence of the mean of stabilized weight to unity and the overlap (common support) of the PS of the two treatment groups are an indirect support that the positivity assumption holds in our example. We did not consider weight truncation to reduce the effect of influential observations and variance of the treatment effect estimate 23, 35 since it could introduce residual confounding. Our study has both strengths and limitations. We think that the three-month interval approach for treatment ascertainment in this study minimizes treatment misclassification. A similar treatment ascertainment approach was also used in a US case-control study that reported increased risk of unstable angina or myocardial infarction associated with beta 2 -agonist use 16 and another Dutch case-control study that indicated no increased risk to users. 19 Moreover, results were similar when compared with the risk-set approach that does not restrict the time-axis to discrete three-month intervals. Nonetheless; residual treatment misclassification seems likely, since we used computerized records of prescriptions, which may not reflect actual patient adherence. In addition, misclassification of clinical end points might be possible since diagnosis of CHD (MI) is usually made at hospitals and we used only GP data. Although our study has addressed several potential confounders in a time-varying pattern, we did not have complete information on important potential confounders such as body mass index, smoking, and severity of co-morbidities, which may still bias the estimated effect. Notice that these potential confounders are very likely to change over time (i.e., could be time-varying) and they may be affected by prior treatment (e.g., severity of COPD may be affected by LABA use). Hence, residual confounding due to unmeasured (time-dependent) confounders may still remain. However, the aim of this study was only to illustrate the use of time-dependent PS methods and potential consequences and not to answer the clinical research question: the effects of beta-agonist use on the risk of CHD. Therefore, the results should be interpreted with caution. 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 109

110 3.1 In conclusion, in the presence of time-varying confounders that are affected by prior treatment, the use of time-dependent PS stratification or covariate adjustment, like the conventional time-dependent Cox model, can induce bias by collider stratification adjusting-away the effect of treatment through intermediates. In such settings, other methods such as MSMs are more appropriate. Time-dependent Propensity Score and Collider-Stratification Bias 110

111 REFERENCES 1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58: Stürmer T, Joshi M, Glynn RJ, et al. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol 2006; 59: Marcus SM, Siddique J, Ten Have TR, et al. Balancing treatment comparisons in longitudinal studies. Psychiatric annals 2008; 38: Stricker BHC, Stijnen T. Analysis of individual drug use as a time-varying determinant of exposure in prospective population-based cohort studies. Eur J Epidemiol 2010; 25: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MÁ, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Robins JM. Marginal structural models. Proceedings or the American Statistical Association, Section on Bayesian Statistical Science. 1997; Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 1992; 3: Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 2003; 14: Whitcomb BW, Schisterman EF, Perkins NJ, Platt RW. Quantification of collider- stratification bias and the birth weight paradox. Paediatr Perinat Epidemiol 2009; 23: Segal JB, Griswold M, Achy-Brou A, et al. Using propensity scores subclassification to estimate effects of longitudinal treatments: an example using a new diabetes medication. Med Care 2007; 45: S149-S Westreich D, Cole SR, Funk MJ, Brookhart MA, Stürmer T. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: Spitzer WO, Suissa S, Ernst P, et al. The use of β-agonists and the risk of death and near death from asthma. N Engl J Med 1992; 326: Au DH, Curtis JR, Every NR, McDonell MB, Fihn SD. Association Between Inhaled β-agonists and the Risk of Unstable Angina and Myocardial Infarction. Chest 2002; 121: Au DH, Lemaitre RN, Randall Curtis J, Smith NL, Psaty BM. The risk of myocardial infarction associated with inhaled beta-adrenoceptor agonists. Am J Respir Crit Care Med 2000; 161: Suissa S, Assimes T, Ernst P. Inhaled short acting β agonist use in COPD and the risk of acute myocardial infarction. Thorax 2003; 58: Salpeter SR, Ormiston TM, Salpeter EE. Cardiovascular Effects of β-agonists in Patients With Asthma and COPD*. Chest 2004; 125: De Vries F, Pouwels S, Bracke M, et al. Use of β2 agonists and risk of acute myocardial infarction in patients with hypertension. Br J Clin Pharmacol 2008; 65: Sears MR. Safety of Long-Acting β-agonists. Chest 2009; 136: Zhang B, de Vries F, Setakis E, van Staa TP. The pattern of risk of myocardial infarction in patients taking asthma medication: a study with the General Practice Research Database. J Hypertens 2009; 27: Time-dependent Propensity Score and Collider-Stratification Bias 111

112 3.1 Time-dependent Propensity Score and Collider-Stratification Bias 22. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. 2011; ISBN , URL org/. 23. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168: Greenland S, Robins JM. Identifiability, exchangeability, and epidemiological confounding. Int J Epidemiol 1986; 15: Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci 1999; 14: Martens EP, Pestman WR, Anthonius de Boer, et al.systematic differences in treatment effect estimates between propensity score methods and logistic regression. Int J Epidemiol 2008; 37: Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79: Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med 1997; 127: Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48: Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003; 158: Vittinghoff E, McCulloch CE. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol 2007; 165: Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol 2006; 163: Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology 2004; 15: Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6: e doi: /journal.pone

113 Chapter Methodological Comparison of Marginal Structural Model, Time-Varying Cox Regression and Propensity Score Methods: the Example of Antidepressant Use and the Risk of Hip Fracture M. Sanni Ali, Rolf HH Groenwold, Svetlana V Belitser, Patrick C Souverein, Helga Gardarsdottir, Nicole Gatto, Consuelo Huerta, Elisa Martín, Kit C.B. Roes, Arno W Hoes, Antonius de Boer, Olaf H Klungel Submitted Methodological Comparison of Marginal Structural Models and Time-varying PS methods 113

114

115 ABSTRACT BACKGROUND METHODS RESULTS CONCLUSIONS INTRODUCTION Observational studies of (time-varying) treatment are prone to confounding. We compared time-varying Cox regression analysis, propensity score (PS) methods, and marginal structural models (MSM) in a study of antidepressant (selective serotonin reuptake inhibitors, SSRI) use and the risk of hip fracture (HF). A cohort of patients with a first prescription for antidepressants (SSRI or tricyclic antidepressants, TCA) was extracted from the Dutch Mondriaan and Spanish BIFAP general practice (GP) databases for the period Effects of SSRI vs. no SSRI were estimated using time-varying Cox regression, PS stratification and regression analysis, and MSM. In MSM, censoring was accounted for by inverse probability of censoring weights. The crude HR of SSRI use on HF was 1.75 [95%CI: 1.12, 2.72] in Mondriaan and 2.09 [1.89, 2.32] in BIFAP. After confounding adjustment using timevarying Cox regression, PS stratification and regression, HRs increased in Mondriaan: 2.59 [1.63, 4.12], 2.64 [1.63, 4.25], and 2.82 [1.63, 4.25], respectively, and decreased in BIFAP: 1.56 [1.40, 1.73], 1.54 [1.39, 1.71], and 1.61 [1.45, 1.78], respectively. MSMs with stabilized weights yielded HR 2.15 [1.30, 3.55] in Mondriaan and 1.63 [1.28, 2.07] in BIFAP when accounting for censoring and 2.13 [1.32, 3.45] in Mondriaan and 1.66 [1.30, 2.12] in BIFAP, respectively without accounting for censoring. In this empirical study, differences between the different methods to control for time-dependent confounding were small. The observed differences in treatment effects estimates between the databases are likely attributable to different confounding information in the datasets, clearly illustrating that adequate information on (time-varying) confounding is crucial to prevent bias. Antidepressants, and notably selective serotonin-reuptake inhibitors (SSRIs), have been associated with increased risk of femur or hip fracture. 1,2 In observational studies, patients exposure to SSRI medication may change over time, i.e., physicians may stop SSRI therapy, or switch to other classes of antidepressants due to adverse effects such as sexual dysfunction and drowsiness and patients may (temporarily) not adhere to the prescribed drug regimen. 3 In addition, the severity of the depression, co-medication use (e.g., benzodiazepine) and the presence of other co-morbidities e.g. anxiety, might change over time and these may influence the association between the use of antidepressants and their effect of interest. As a result unbiased estimation of the effect of SSRI use on the risk of hip fracture requires that the time-varying nature of both SSRI treatment and confounders is accounted for. 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 115

116 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 116 Cox proportional hazards models with time-varying coefficients are capable of addressing the time-varying nature of both treatment and covariates. 4,5 However, when potential confounders are themselves affected by the previous SSRI use, the time-varying Cox model can no longer provide unbiased estimates of the treatment effect. 6,7 This is because the timedependent confounding factors are also intermediates on the causal path from treatment to outcome, and conditioning on such factors will artificially dilute (or adjust-away part of ) the treatment, SSRI use, effect. 6-9 In addition, when such time-dependent confounders are common effects (i.e., colliders) of previous treatment and unmeasured factors that are also predictors of outcome (here, the risk of hip fracture), the time-varying Cox model 6,7 and propensity score methods that condition on time-dependent confounders (colliders) induce a non-causal association between previous treatment and unmeasured risk factors, thereby introducing collider-stratification bias. 6,7,10 The potential associations between treatment (SSRI use), covariates, outcome (hip fracture) and the impact of adjustment methods is depicted using causal diagrams in the Appendix (Appendix 1). In inverse probability of treatment weighted (IPTW) estimation of marginal structural models a weight is assigned to each observation that is proportional to the inverse of the probability of treatment received given time-dependent confounders and previous treatment. This methods provide unbiased estimates of causal treatment effects under the assumptions of (1) exchangeability, i.e. no unmeasured confounding or informative censoring; (2) positivity, i.e. both treated and untreated subjects exist at each level of the confounders; (3) correct model specification; and (4) consistency, which states that an individual s potential outcome under his or her observed treatment history is precisely his or her observed outcome. 6,7,11 Moreover, MSMs enable investigators to control for bias due to informative censoring, i.e., non-random or systematic loss to follow-up. 6,7 Our primary objective was to assess the sensitivity of the estimated effect of SSRI use on the risk of hip fracture to different approaches of controlling time-dependent confounding, including time-varying Cox model, propensity score methods, and MSMs using inverse probability of treatment and/or censoring weights. The second objective was to assess whether this sensitivity differed between two observational databases, with varying information on covariates. METHODS Data Sources and Study Population The Mondriaan databases include the Netherlands Primary Care Research Database (NPCRD), and the Almere Health Care (AHC) database. 12 The BIFAP (Base de datos para la Investigación Farmacoepidemiológica en Atención Primaria) database, a computerized database of medical records of primary care, is a non-profit research department operated by the Spanish Medicines Agency (AEMPS). 12 The BIFAP database includes clinical and prescription data from around 3.1 million patients covering around 6.8% of the Spanish population. In both

117 databases, the International Classification in Primary Care (ICPC) is used for coding diagnoses and the Anatomical Therapeutic Chemical classification system (ATC) for coding drugs. Further details of the Mondriaan and BIFAP databases can be found elsewhere. 12 Exposure, Outcome, and Potential Confounders Only patients who received at least one prescription of antidepressants: either SSRIs or TCAs, were included in this study. For each patient, all prescriptions for an SSRI or TCA were identified and treatment episodes were constructed. A treatment episode was defined as a series of subsequent prescriptions, irrespective of changes in dosage regimen or switching between antidepressants (SSRI or tricyclic antidepressants, TCA). The theoretical duration of each prescription was estimated based on the number of tablets prescribed and the prescribed dosage regimen (BIFAP). In Mondriaan, prescription length was set at 90-days as information on the dosage regimen was not available. The choice of the 90 days prescription length was based on the maximum allowed duration of an antidepressant (AD) prescription issued by GPs in the Netherlands. 10,13 Patients were considered to have discontinued therapy if 30 days or more elapsed between the theoretical end date of an SSRI prescription and the subsequent SSRI prescription. In the original cohort, the study was designed in such a way that both SSRI and TCA were exposure variables; however, in the current study, the exposure of interest is SSRI. TCA use, apart from inclusion of patients in to the cohort, was considered as confounding variable. Exposure was further divided into episodes of current, recent and past use. Current use was considered as the calculated treatment episode plus 30 days after the estimated theoretical end date of the last prescription, to account for carry-over effects. Recent use included the period between 1-60 days after the period of current use. Past use included the period following recent use until a new SSRI prescription was filled or until the end of follow up. In the cohort for this study, episodes of recent and past use were considered a reference group, non-current SSRI use. Each patient was followed from the first prescription until the occurrence of the first hip fracture or loss to follow-up (due to unregistration with the GP or death), or until the end of data collection, whichever date came first. Hip fractures were identified by ICPC-2 codes and specific string texts in BIFAP, and by ICPC-2 codes in Mondriaan. 12 Hip fractures (HF) were manually reviewed in BIFAP but not in Mondriaan. Potential confounders (including co-medications and co-morbidities) were measured at baseline and updated whenever patients switched between exposure episodes (Table 1- Footnote). When a patient was in the same exposure episode (current or past use), confounding factors were updated every 6 months (i.e., 182 days). The status of comedication use as a confounding variable for current time period was defined as the use in the prior 182 days. Only chronic diseases were considered as co-morbidities. Comorbidities were considered present when recorded during the current period or ever before in the patient history. More details on study design, exposure, confounding, and outcome are available online at the ENCePP e-register of studies Methodological Comparison of Marginal Structural Models and Time-varying PS methods 117

118 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 118 Statistical Analysis Three methods were applied to estimate the risk of hip fracture associated with the current use of SSRI: 1) time-varying Cox regression models; 2) propensity score methods; and 3) marginal structural models. Time-varying Cox models were applied with and without adjustment for the demographic and clinical variables listed in Table 1. Two propensity score techniques were applied: regression on propensity score (i.e. including the PS as independent variable in the regression model), and propensity score stratification. The propensity score was estimated using ordinary logistic regression including the demographic and clinical variables listed in Table 1. The propensity score was defined as the probability of exposure to SSRI in a specific time period, conditional on measured covariates in the previous time period. Hence, for each patient, the propensity score could change over time (i.e., it was considered time-varying). We considered a propensity score model in which all measured covariates were included as main terms without any interaction or higher order terms. Propensity score stratification was based on quintiles as well as deciles of the propensity score. The interaction between SSRI use and propensity score strata was tested in order to compare differences in treatment effect between strata. All propensity score methods used a Cox model as the model for effect estimation. Propensity scores were also used to construct inverse probability of treatment weighting for the MSM. In the inverse probability of treatment weighting (IPTW) approach, the estimated propensity score was used to assign weights to all observations resulting in an altered composition of the study population, also referred to as a pseudo-population. The weight for each patient was the inverse of the probability that the patient had the treatment that (s)he actually received given a set of time-fixed and time-dependent covariates as well as previous treatment. Hence, the weight for current SSRI users was 1/PS and for non-current SSRI users 1/(1-PS). Finally, a Cox model was fitted using current SSRI use as the only covariate in the pseudo-population. To assess the possible impact of informative censoring, MSM with and without censoring weights were applied. In addition, stabilized weights were estimated by replacing the numerator of IPTW by the probability of SSRI use conditional on previous SSRI use. Stabilized treatment weights (STW i ) and censoring weights (SCW i ) were calculated using the method described by Hernan et al. 7 For detail on constructing the inverse probability of treatment and censoring weights, we refer to the Appendix (Appendix 2). Additional analyses adjusting for baseline TCA use, benzodiazepine use, and other covariates were also performed. In MSMs, 95% confidence intervals were estimated using 10,000 bootstrapping (number of bootstraps=10,000). Furthermore, analysis using truncated (treatment and censoring) weights at 0.5 th and 99.5 th percentiles was performed. All analyses were performed in R, version and correlation between observations was taken into account in both propensity score and Cox analyses using the cluster function in R. 15 In all analyses, we assumed exchangeability, consistency, positivity, and correct model specification. For further details on these assumptions, we refer to the literature. 6,7,9,11,16

119 RESULTS There were 22,903 new AD users included in the Mondriaan cohort and 251,884 in the BIFAP cohort (Table 1). The mean ages were 49.4 (±16.27) in Mondriaan and 52.6 (± 17.0) in BIFAP. The proportion of patients using SSRIs was 65.5% in the Mondriaan and 84.8% in the BIFAP cohort. The baseline characteristics of the two cohorts are shown in Table Table 1. Baseline characteristics of patients stratified by current SSRI treatment status and cohort. Mondriaan Cohort N = 22, 903 (169, 948 person moments) Current SSRI Users Non-current SSRI Users BIFAP cohort N = 251, 884 (2, 332, 487 person moments) Current SSRI Users Non-current SSRI Users Number of Person moments* 46, , , 922 1, 586, 565 Number of cases (Femur/hip fracture), N (%) 35 (%) 47 (%) 756 (%) 763 (%) Age (years), Mean ± SD Range 47.1 ± ± ± ± Males, N (%) 16, 009 (34.7) 44, 850 (36.2) 181, 875 (24.4) 408,568 (25.8) TCA use, % Benzodiazepine Use, % Bone related Medications**, % Anti-inflammatory Medications*** % Gastrointestinal Medications, % Cardiovascular Morbidities, % Neurological Co-morbidities, % Respiratory Co-morbidities, % Previous history of fractures, % * Person moments refer to the number of observation times contributed by the patients in the cohort (a single patient may contribute several person times) ** Bone related medications: Previous use of bisphosphonate or any of the other bone protecting drugs: raloxifene, Strontium ranelate, parathyroid hormone, calcium & vitamin D, calcitonin, calcitriol, thyroid hormones, antithyroid drugs *** Anti-inflammatory medications: Inhaled glucocorticoids, non-steroidal anti-inflammatory drug (NSAIDs), disease-modifying anti-rheumatic drug (DMARD), Cardiovascular morbidities and medications: Antihypertensive drugs (including ACE inhibitors, angiotensin II antagonists, Beta blocking agents, calcium channel blockers, other antihypertensive), diuretics, antiarrhythmics, statins, ischaemic heart disease Neurological co-morbidities and medications: Mental disorders and dementia and/or Alzheimer, Seizures, syncope, cerebrovascular disease, malignant neoplasms, medications such as anti-parkinson drugs, antipsychotics/lithium, anticonvulsants, sedating antihistamines, Respiratory co-morbidities and medications: COPD, Bronchodilators (including beta-2-adrenoceptors agonist and anticholinergics) Previous history of fractures and history of other bone diseases (Paget s disease, osteogenesis imperfect) Gastrointestinal related medications and morbidities: proton pump inhibitors, antiemetic (metoclopramide), and inflammatory bowel disease, liver disease Methodological Comparison of Marginal Structural Models and Time-varying PS methods 119

120 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 120 Table 2 shows the crude and adjusted hazard ratios (HRs) for hip fracture associated with current SSRI use. The crude HRs were 1.75 [95%CI: 1.12, 2.72] in Mondriaan and 2.09 [1.89, 2.32] in BIFAP. After adjustment for gender and age using the time-varying Cox model, the HR increased to, 2.36 [1.52, 3.69] in Mondriaan, but decreased to 1.52 [1.37, 1.69] in BIFAP. Additional adjustment for baseline TCA use only marginally changed the risk of hip fracture associated with SSRI use: HR 2.59 [1.63, 4.12] in Mondriaan and 1.56 [1.40, 1.73] in BIFAP. In both cohorts, further adjustment for other covariates did not materially alter the HR: when fully adjusted for all covariates in Table 1, the HRs were 2.62 [1.63, 4.19] and 1.52 [1.37, 1.69] in Mondriaan and BIFAP, respectively. As we previously reported, 17 there seems to be an indication for interaction between SSRI use and age in Mondriaan (P value for the interaction term was = 0.11, and effect of current SSRI use taking in to account interaction was 1.22 [0.42, 3.58]) but not in BIFAP (p-value for the interaction term was 0.77, and effect of current SSRI use taking into account interaction was 1.74 [1.56, 1.94]). Table 3 shows the results of time-dependent propensity score based Cox analyses. Using propensity score adjustment, the adjusted HR for hip fracture of current SSRI use versus non-current SSRI use was 2.82 [95%CI: 1.63, 4.25] and 1.61 [1.45, 1.78] in Mondriaan and BIFAP, respectively. When quintile and decile stratification on the propensity score were used, the HRs were 2.64 [1.63, 4.25] and 2.72 [1.63, 4.54] in Mondriaan, 1.54 [1.39, 1.71] and 1.53 [1.38, 1.70] in BIFAP, respectively. In the IPTW approach, the means of the stabilized weights for current SSRI use were centered close to unity. The resulting HR in Mondriaan was 2.08 [1.33, 3.25] and 1.73 [1.56, 1.91] in BIFAP. Table 4 shows results from MSMs with and without accounting for potential informative censoring. The mean (range) of stabilized treatment weights was 0.97 (0.02 to 218) in Mondriaan and 0.96 (0.06 to 110) in BIFAP. Similarly, the mean (range) of stabilized weights for censoring was 0.99 (0.05 to 111) in Mondriaan and 0.98 (0.21 to 8.24) in BIFAP. Estimates Table 2. Associations Between Current SSRI Use and the Risk of Hip Fracture Using Time-Varying Cox Models Adjusted for Mondriaan BIFAP HR 95%CI HR 95%CI None (Crude) , , 2.32 Gender , , 2.30 Gender + Age , , 1.68 Gender + Age + TCA t , , 1.73 Gender + Age + TCA t + Benzo , , 1.71 All Confounders* , , 1.69 * Age, Gender, TCA t use, Benzodiazepine use (benzo), Bone related medications, Anti-inflammatory medications, cardiovascular co-morbidities, Neurological co-morbidities, Respiratory co-morbidities, Previous history of fractures, and Gastrointestinal medications as listed in the Table 1-Footnote.

121 from MSMs using combined treatment and censoring weights were similar to those using only treatment weights (HR: 2.15 [95% CI: 1.30, 3.55] versus 2.13 [95% CI: 1.32, 3.45] in Mondriaan and 1.63 [95% CI: 1.28, 2.07] versus 1.66 [95% CI: 1.30, 2.12] in BIFAP, respectively). When stabilized weights were trimmed at 0.5th and 99.5 th percentiles, the treatment effect estimates in Mondriaan changed, particularly when adjustment was made for confounders. On the other hand, weight truncation in BIFAP resulted in improved precision of the effect estimate without a substantial change in the point estimates even after adjustment was made for additional confounders (Table 5). The range of the weights after truncation were narrower: 0 to 80 in Mondriaan and 0.33 to 1.65 in BIFAP, respectively. When weights were trimmed at 2.5 th and 97.5 th percentiles or at 1st and 99th percentile, results were similar in both Mondriaan and BIFAP. Table 3. Associations Between Current SSRI Use and the Risk of Hip Fracture Using Propensity Score Based Cox Analyses Adjusted for Mondriaan BIFAP HR 95%CI HR 95%CI PS adjustment , , 1.78 PS Stratification Quintiles , , 1.71 Deciles , , 1.70 IPTW (Unstabilized) , , 1.83 IPTW (Stabilized)* , , 1.91 * IPTW differ from that of IPW of MSM in that the weights are not cumulative over observation periods Table 4. Association Between Current SSRI Use and the Risk of Hip Fracture Using IPW estimation of Marginal Structural Models Not accounting for informed censoring* Accounting for informed censoring** Adjusted for Mondriaan BIFAP HR 95% CI HR 95% CI None (Crude) , , 2.96 Gender , , 2.94 Gender + Age , , 2.12 None (Crude) , , 2.95 Gender , , 2.93 Gender + Age , , 2.07 *Only inverse probability of treatment weights were used **Combined inverse probability of treatment and censoring weights were used Both treatment and censoring weights were stabilized 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 121

122 Table 5. Association Between Current SSRI Use and the Risk of Hip Fracture Using Trimmed IPW estimation of Marginal Structural Models Without and With Accounting for Censoring 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 122 Not accounting for informed censoring* Accounting for informed censoring** Adjusted for Mondriaan BIFAP HR 95% CI HR 95% CI None (Crude) , , 2.39 Gender , , 2.38 Gender + Age , , 1.72 None (Crude) , , 2.30 Gender , , 2.28 Gender + Age , , 1.70 *Only inverse probability of treatment weights were used **Combined inverse probability of treatment and censoring weights were used Both treatment and censoring weights were stabilized and trimmed at (0.5th, 99.5 th percentile) Figure 1 summarizes the observed associations between current SSRI use and the risk of hip fracture based on the different time-varying analyses in Mondriaan and BIFAP. Figure 1. Hazard ratios and 95% Confidence Intervals for Current SSRI Use and The Risk of HIP Fracture Using Different Models in Mondriaan and BIFAP. Crude, unadjusted; Cox Reg., time-varying Cox model; PS.Strat, time-varying propensity score stratification (on quintiles of propensity score); MSM.Iptw, MSMs using inverse probability of treatment weights without trimming; MSM.Iptcw, MSMs using inverse probability of treatment and censoring weights without trimming; MSM.Iptw, MSMs using inverse probability of treatment weights with trimming; MSM.TIptcw, MSMs using inverse probability of treatment and censoring weights with trimming;

123 DISCUSSION Our study shows that current use of selective serotonin receptor inhibitor (SSRI) is associated with an increased risk of hip fracture in both cohorts (Mondriaan and BIFAP) when comparing to non current SSRI users. In addition, this increased risk was consistently found for different analytical approaches: time-varying Cox model, propensity score methods, and marginal structural models. However, the magnitude of the association clearly differed between Mondriaan and BIFAP. For example, HRs were 2.61 versus 1.52 using a time-varying Cox model and 2.13 versus 1.58 when applying marginal structural models, in Mondriaan and BIFAP respectively. In Mondriaan, estimates from time-varying Cox and propensity score models differed from those of marginal structural models. On the other hand, in BIFAP, estimates from MSMs, time-varying Cox and propensity score models were similar. Differences in treatment effect estimates between the two cohorts are unlikely due to the applied methods for two reasons. First, the performance of the different methods to control for confounding was very similar within cohorts, but the impact of confounding adjustment was in opposite directions in the two cohorts. Second, substantial effort has been made to harmonize the design of the study, protocol, and data specifications. 12,14 Possible explanations for the observed differences in confounding adjustment across databases may include substantial differences in time-dependent confounding 6,7,10 and/ or collider-stratification bias 6,7,10,18, non-collapsibility of the hazard ratio, 10,19,20 differences in quality of confounder information between the datasets, or the small number of events in Mondriaan leading to unstable estimates. With regard to the first possible explanation, time-varying Cox model and propensity score models condition on time-dependent covariates, for example severity of depression which are may also be intermediates in the causal pathway between treatment (current SSRI use) and hip fracture, whereas MSMs reweight the original population without conditioning on such time-dependent covariates. When such time-dependent confounders are affected by unmeasured factors that predict hip fracture risk (for example alcohol consumption, on which we had no information in Mondriaan), the analytic approaches considered, except MSMs, may induce collider-stratification bias. Although collider-stratification bias tends to be a less substantial source of bias than confounding, under certain circumstances it could result in a dramatic change not only in the magnitude of the effect estimate but also in the direction of the effect. 20,21 Regarding the second explanation, the magnitude of noncollapsibility increases with the effect of the covariate on the outcome, the baseline risk, and the strength of the treatment effect (the latter seems more likely in Mondriaan than BIFAP since non-collapsibility causes the estimate to change away from the null effect, HR=1.0). 20,22 However, non-collapsibility often results in an effect estimate away from the null 19,20 and its impact in practice is often difficult to quantify. Both time-dependent confounding and collider stratification bias, but not non-collapsibility, can be best identified using causal diagrams. 18, Methodological Comparison of Marginal Structural Models and Time-varying PS methods 123

124 3.2 Inverse probability weighted estimation of MSMs not only controls for time-dependent confounding without any risk of collider-stratification bias, but can also account for bias due to competing risks (i.e., informative censoring). 6,7 However, the impact of informative censoring seems minimal, if any, in both cohorts (assuming the models for censoring are correctly specified and all predictors of informative censoring were observed). This was demonstrated by comparable treatment effect estimates with and without the use of censoring weights. Methodological Comparison of Marginal Structural Models and Time-varying PS methods Weight truncation reduced variability of the weights and improved the precision, which has been described before. 24 However, treatment effect estimates were sensitive to weight truncation, particularly in Mondriaan, which could in part be due to strong covariatetreatment association and/or the small number of events in Mondriaan. Although the optimal level of truncation is difficult to determine, it is important to explore the sensitivity of effect estimates and precision to progressive weight truncations. Alternatively, other approaches proposed to deal with the positivity assumption could be employed. 25 Importantly; investigators should focus on the procedures leading to the generation of weights (i.e., proper specification of the propensity score model) rather than relying on ad-hoc methods such as weight truncations. 24 We conducted this study in two large cohorts with a reasonably long follow-up time. In addition, detailed information was collected on exposure, co-morbidities, co-medications, as well as the outcome (hip fracture). However, the level of detail on important information such as comorbidities limited optimal adjustment and comparison. 12,14,17 Although, there might still be unmeasured confounding due to patient characteristics not recorded in both databases, such as body mass index and alcohol consumption, we previously demonstrated that additional adjustment for these variables had limited impact on effect estimates. 17 In conclusion, this study indicates an increased risk of hip fracture associated with current SSRI, which was consistently observed using different analytical approaches in two large electronic record healthcare databases. Although differences between methods to control for time-dependent confounding were small, relevant differences in treatment effects estimates between the two datasets were observed. These are possibly attributable to different confounder information in the datasets. 124

125 REFERENCES 1. Ginzburg R, Rosero E. Risk of fractures with selective serotonin-reuptake inhibitors or tricyclic antidepressants. Ann Pharmacother 2009; 43: Van den Brand MWM, Samson MM, Pouwels S, et al. Use of anti-depressants and the risk of fracture of the hip or femur. Osteoporosis Int 2009; 20: Hu XH, Bull SA, Hunkeler EM, et al. Incidence and duration of side effects and those rated as bothersome with selective serotonin reuptake inhibitor treatment for depression: Patient report versus physician estimate. J Clin Psychiatry 2004; 65: Cox DR. Regression models and life tables. J R Statist Soc Series B. 1972; 34: Fisher LD, Lin DY. Time-dependent covariates in the cox proportional-hazards regression model. Annu Rev Public Health. 1999; 20: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MÁ, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 1992; 3: Robins JM. Marginal structural models. Proceedings or the American Statistical Association, Section on Bayesian Statistical Science. 1997: Ali MS, Groenwold RHH, Pestman WR, et al. Time-dependent propensity score and colliderstratification bias: An example of beta2-agonist use and the risk of coronary heart disease. Eur J Epidemiol 2013; 28: Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168: Abbing-Karahagopian V, Kurz X, de Vries F, et al. Bridging differences in outcomes of pharmacoepidemiological studies: Design and first results of the PROTECT project. Curr Clin Pharmacol 2014; 9: De Vries F, Pouwels S, Bracke M, et al. Use of β2 agonists and risk of acute myocardial infarction in patients with hypertension. Br J Clin Pharmacol 2008; 65: ENCePP Guide on Methodological Standards in Pharmacoepidemiology. EMA/95098/2010. Available at: standardsandguidances. Accessed March 21, R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL Robins JM. Causal inference from complex longitudinal data. In: Berkane M, eds. Latent variable modeling and applications to causality. New York: Springer-Verlag; 1997: Abbing-Karahagopian V SP, Martin E, Huerta C, et al. Understanding inconsistency in the results from observational pharmacoepidemiological studies: The case of antidepressant use and risk of hip fracture. Submitted Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology 2004; 15: Greenland S. Quantifying biases in causal models: Classical confounding vs collider-stratification bias. Epidemiology 2003; 14: Greenland S, Pearl J. Adjustments and their consequences collapsibility analysis using graphical models. Intl Stat Rev 2011; 79: Methodological Comparison of Marginal Structural Models and Time-varying PS methods 125

126 Whitcomb BW, Schisterman EF, Perkins NJ, Platt RW. Quantification of collider-stratification bias and the birthweight paradox. Paediatr Perinat Epidemiol 2009; 23: Pang M, Kaufman JS, Platt RW. Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Stat Methods Med Res 2013; doi: / Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology 1999; 10: Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6: e Xu S, Shetterly S, Raebel MA, et al. Estimating the effects of time-varying exposures in observational studies using cox models with stabilized weights adjustment. Pharmacoepidemiol Drug Saf 2014;doi: /pds Methodological Comparison of Marginal Structural Models and Time-varying PS methods 126

127 Appendices Appendix 1 Using causal diagrams, both sources of bias (confounding and collider stratification) can be easily identified and described although they do not provide any quantitative information regarding causal associations. In Figure 1(a), the directed acyclic graph indicates that both SSRI use and confounders (e.g., Benzodiazepine) are time varying, but the time-dependent confounder (benzodiazepine use) at time t (Benzo t ) is not affected by SSRI use at the previous time point t-1 (SSRI t-1 ). Hence, conventional time-varying Cox regression or propensity score (PS) methods could be used to adjust for time-dependent confounder, Benzo t. In Figure 1(b), time-dependent confounder, benzodiazepine use, at time t (Benzo t ) is affected by previous SSRI use at time t-1 (SSRI t-1 ); hence, Benzo t is not only a confounder in the association between SSRI t and hip fracture (HF) but also an intermediate in the causal pathway from SSRI t-1 to HF. In Figure 1(c) Benzo t is not only a confounder in the association between SSRI t and HF and an intermediate in the causal pathway from SSRI t-1 to HF, like Figure 1(b), but also shares unmeasured common cause (U) with the outcome (HF) and is therefore a collider in the path: SSRI t-1 Benzo t U HF. Conditioning on Benzo in both t Figures 1b and 1c, will block the open path between SSRI t-1 on HF, SSRI t-1 Benzo t HF. On the other hand, conditioning on Benzo t in Figure 1c will also open the closed path, SSRI t-1 Benzo t U HF, inducing non-causal associations between SSRI and HF. As a result, t-1 treatment effect estimates from conventional time-varying Cox regression or PS methods adjusting for time-varying confounder (Benzo t ) are biased. The net effect of SSRI use on hip fracture can be split into its direct effect (indicated by the directed path, SSRI t-1 SSRI t HF) and its indirect effect mediated by time-dependent confounders themselves being affected by previous SSRI treatment, Benzo t (indicated by the directed path, SSRI t-1 Benzo t HF). However, time-dependent Cox or propensity score methods conditioning on Benzo t, indicated by a box on Benzo t, block the path (SSRI t-1 Benzo t HF) thereby adjusting away the indirect effect SSRI treatment on the risk of hip fracture; hence, the net effect estimate is biased. In the presence of common cause of Benzo t and HF (U) in Figure1c, time-dependent Cox or propensity score methods conditioning on the collider, Benzo t, open the clothed path (SSRI t-1 Benzo t U HF) inducing colliderstratification bias. In contrast, marginal structural models (MSMs) whose parameters can be estimated by the inverse of probability of treatment weighting do not condition on Benzo t but remove the paths Benzo t-1 SSRI t-1 and Benzo t SSRI t thereby controlling for confounding without adjusting away the indirect effect as well as introducing any collider-stratification bias. Furthermore, using Inverse probability of censoring weights, one can account for informative censoring using MSMs. 3.2 Methodological Comparison of Marginal Structural Models and Time-varying PS methods 127

128 SSRI t-1 SSRI t HF SSRI t-1 SSRI t HF 3.2 Benzo t-1 Benzo t Benzo t-1 Benzo t (a) (b) Methodological Comparison of Marginal Structural Models and Time-varying PS methods Benzo t-1 SSRI t-1 Benzo (c) SSRI t Figure 1. Directed acyclic graphs (DAGs) representing possible causal association among selective serotonin-reuptake inhibitors (SSRIs), time-varying confounders (such as benzodiazepines), hip fracture (HF). Unmeasured factors (U), current time (t) and previous time (t-1). Time-varying confounders at time t (Benzo t ) are independent prognostic factors for hip fracture (HF), predictors of subsequent selective serotonin-reuptake inhibitors use (SSRI t ), intermediates on the path: SSRI t-1 Benzo t HF, and collider in the path: SSRI t-1 Benzo t U HF. HF U 128

129 Appendix 2 Inverse probability of treatment weight at each time t was defined as, t Pr( A ( k ) = ai ( k) A ( k 1 ) = ai ( k 1 ) ) STWi ( t) =, (1) 0 l i ( k ) ) k= Pr( A ( k ) = ai ( k) A ( k 1 ) = ai ( k 1 ), L ( k ) = 3.2 where the numerator represent the probability of current SSRI use for each patient i at time period k (A(k) = a i(k)) conditional on previous SSRI use and baseline covariates ( A (k) = a i (k),, in this case gender and age), the denominator represent the probability of current SSRI use for each patient i at time period k (A(k) = a i(k)) conditional on previous SSRI use, baseline covariates ( A (k) = a i (k),, in this case gender and age), and time-varying covariates ( L( k) = l i ( k) ), respectively. Inverse probabilities of censoring weights were estimated in the same way, except that both the numerator and denominator represent the probability of remaining uncensored (C=1) up to the time t given previous SSRI use and baseline covariates (gender and age), without and with conditioning on time-varying covariates, respectively: t Pr( C ( k ) = 1 C ( k 1 ) = 0, A ( k 1 ) = ai ( k 1 ) ) SCWt ( ) = (2) i k= 0 Pr( C ( k ) = 1 C ( k 1 ) = 0, A ( k 1 ) = ai ( k 1 ), L ( k ) l i( k ) Separate pooled logistic regression models were fitted for the numerator and denominator. Treatment and censoring weights were then multiplied to obtain overall weights (SW i ) in each time periods, SW i (t) = STW i (t) SCW i (t). Informally, the denominator of SW i (t) is the probability that patient i had his/her observed history of current SSRI use and censoring up to time t. Methodological Comparison of Marginal Structural Models and Time-varying PS methods 129

130

131 CHAPTER IV PROPENSITY SCORE METHODS AND UNMEASURED CONFOUNDING

132

133 CHAPTER Propensity Score Methods and Unmeasured Confounding Imbalance M Sanni Ali, RHH Groenwold, OH Klungel Health Service Research 2014; 49: Propensity Score Methods and Unmeasured Confounding Imbalance 133

134

135 ABSTRACT In their recent Health Services Research article, Brooks and Ohsfeldt addressed an important topic on the balancing property of the propensity score (PS) with respect to unmeasured covariates. They concluded that PS methods that balance measured covariates between treated and untreated subjects exacerbate imbalance in unmeasured covariates that are unrelated to measured covariates. Furthermore, they emphasized that for PS algorithms, an imbalance on unmeasured covariates between treatment and untreated subjects is a necessary condition to achieve balance on measured covariates between the groups. We argue that these conclusions are the results of their assumptions on the mechanism of treatment allocation. In addition, we discuss the underlying assumptions of PS methods, their advantages compared with multivariate regression methods, as well as the interpretation of the effect estimates from PS methods. 4.1 INTRODUCTION The use of propensity score (PS) methods in observational studies of medical treatments to adjust for measured confounding has increased substantially during the last decade. 1 The PS is defined as a subject s probability of treatment given his or her pre-treatment characteristics. For groups of subjects with the same PS, measured covariates that were used to construct the score tend to be balanced across treatment groups. 2 However, unlike random assignment of treatments in a randomized trial, covariates that were not measured (and thus not included in the PS model) will not necessarily be balanced when conditioning on the PS. Hence, imbalances in unmeasured covariates are not addressed by propensity score methods. In a recent study, Brooks and Ohsfeldt 3 assessed the balancing property of propensity scores with respect to unmeasured covariates, and wondered why subjects with the same PS may receive different treatments. Essentially, they stated that for two subjects with the same propensity score, e.g. a propensity score of 0.65 (i.e., a probability of 0.65 for receiving treatment), of whom one received treatment and the other did not, there must be a reason why one received treatment and the other did not. They argued that, apparently, this reason was not included in the propensity score model (thus being unmeasured covariates, potentially confounders). Thereafter, they concluded that propensity score methods balance measured confounders at a cost of exacerbating any imbalance in unmeasured covariates that are independent of the measured covariates. Further extended, if these unmeasured covariates are confounders (related to both treatment and outcome), propensity score methods can exacerbate the bias in treatment effect estimates. We do not agree with their main conclusions for reasons outlined below. In the following paragraphs, we focus on three topics: (1) The assumptions underlying propensity score methods; (2) The conceptual advantage of propensity score methods in contrast to classical regression techniques; and (3) The estimand (treatment effect estimate) obtained using multivariable regression and the different propensity score approaches. Propensity Score Methods and Unmeasured Confounding Imbalance 135

136 THE ASSUMPTIONs UNDERLYING PS METHODS 4.1 Propensity Score Methods and Unmeasured Confounding Imbalance 136 Two main assumptions underlying propensity score methods are relevant to interpret the findings by Brooks and Ohsfeldt 3 : exchangeability and positivity. As described in the original paper by Rosenbaum and Rubin, 2 propensity score methods rely on the assumption of strong ignorability or exchangeability, which is a stronger version of ignorable mechanisms coined by Rubin. 4, 5 It can be stated formally as: {Y (0), Y (1)} T X, where X denotes a vector of (measured) covariates, T denotes treatment assignment (Yes/No), and Y (0) and Y (1) are the potential outcomes under control and treatment conditions, respectively. It means that conditional on (measured) covariates (X), treatment assignment (T) is independent of potential outcomes (Y (0), Y(1)). Intuitively, this assumption is equivalent to the colloquial notion of all confounders having been measured. 6 This assumption of no systematic, unmeasured, pre-treatment differences between treated and untreated subjects that are related to the outcome under study is needed not only for propensity score methods but also for ordinary methods to adjust for confounding (e.g. multivariate regression methods) to get an unbiased treatment effect estimate. In a large randomized clinical trial (RCT) where subjects are assigned to treatment or control by flip of a fair coin, the probability of being assigned to treatment (i.e., the propensity score) is 0.5. This means that approximately half of the subjects will be treated while the other half will not. The randomization implies that, on average, measured as well as unmeasured covariates will be balanced between the two treatment groups. In contrast, in observational studies, treatment assignment is a non-random process. Nevertheless, propensity score methods help researchers to mimic randomization by creating a sample of subjects receiving the treatment that is comparable on all observed covariates to a sample of subjects not receiving the treatment. 7-9 Hence, randomization in a RCT and mimicking random assignment of treatment in propensity score analysis (e.g. by propensity score matching) are sufficient to generate group of subjects with the same propensity score yet receiving different treatment modalities. This does not require difference in unmeasured covariates. However, randomization implies that unmeasured covariates are balanced as well, but propensity score methods will not guarantee balance of unmeasured covariates. Brooks and Ohsfeldt 3 considered a deterministic treatment assignment model rather than a random treatment assignment model given measured and unmeasured covariates. Particularly, in their simulations set-up, treatment status depends on a number of covariates. Given that the values of these covariates are known, the treatment status is fixed, i.e., it no longer depends on any random process. Thus, if treatment status of a subject as well as the values of certain covariates (e.g., three out of four covariates) is known, this will restrict the values of the other covariate(s). In fact, the set-up of the data generation implies that among the treated subjects, those who have low values for certain covariates will have high values for the other covariates (and vice versa). Consequently by design, balancing part of the covariates between the treatment groups (by propensity score matching) will result in an imbalance in the other covariates. Brooks and Ohsfeldt 3 demonstrated an exacerbation

137 in the imbalance of unmeasured covariates when using propensity score methods to balance measured covariates compared to the full-unweighted sample. This finding could in part be explained by the fact that they not only assumed unmeasured covariate variation that is unrelated to measured covariates as a requirement for propensity score methods but also imposed this thought in generating their data (in particular, the simulation of treatment status). Moreover, in their simulations, Brooks and Ohsfeldt 3 demonstrated that the exacerbation in imbalance of unmeasured confounders was only detected when the measured and unmeasured covariates are uncorrelated. Hence, if one attempts to measure as much covariates as possible, it seems more likely that risk factors correlated to unmeasured covariates or proxies for unmeasured covariates included in the propensity score model, for example, using high dimensional propensity score hdps method could lessen the exacerbation in the imbalance or improve the balance of unmeasured covariates. Apart from the assumption of no unmeasured covariates (confounders), another requirement for identifying a causal effect is that both treated and untreated subjects exist at all levels the of confounders in the population under study, commonly known as the positivity assumption. 13 In terms of the propensity score, this would mean that there is sufficient overlap of the propensity score distributions between treated and untreated subjects. For example, in case of propensity score matching (where treated and untreated subjects with the same propensity scores are matched), this requirement is definitely met. The absence of sufficient overlap of propensity score between treatment groups (i.e., violation of the positivity assumption) can increase both the variance and bias of causal effect estimates. 14 Therefore, processing data, for example, using propensity score matching to generate two group of subjects with the same propensity score receiving different treatment modalities assures validity of positivity, without requiring systematic differences in an unmeasured covariate that is unrelated to measured covariates. ADVANTAGES OF PROPENSITY SCORE METHODS Brooks and Ohsfeldt 3 further claimed that the conceptual advantage of propensity score -based methods relative to standard regression appears to hinge on the assumption that balancing measured covariates between treated and non-treated subjects leads to unmeasured covariate balance between treated and non-treated subjects. This thought is shared by several researchers applying propensity score methods. 1 However, the primary goal of propensity score methods is to create balance of measured covariates between treatment groups, and propensity score methods may not seem to be superior to multivariable regression methods with respect to adjustment for unmeasured covariates. Nonetheless, Rosenbaum argued that propensity score matching can lead to a reduction in both sample variability and the estimated treatment effect s sensitivity to potential omitted variables. 15 Furthermore, confounding by variables unmeasured in the main study can be addressed using variables measured in another validation study via propensity score calibration Propensity Score Methods and Unmeasured Confounding Imbalance 137

138 4.1 Propensity Score Methods and Unmeasured Confounding Imbalance 138 It is worth mentioning the potential advantages of propensity score methods compared to conventional regression methods, although both methods give similar effect estimates as demonstrated in most published empirical and simulation studies. 1,16 Propensity score methods provide an effective way of controlling for covariates in case of a limited number of events, thereby overcoming the dimensionality problem where the introduction of a new balancing covariate in the regression model increases the minimum necessary number of observations (outcome events) in the sample. 17,18 This is particularly common in pharmacoepidemiology, where outcomes are often rare compared to a large number of covariates available for estimation of (adverse) drug effects. 18 Cepeda et al. proposed a helpful guideline on when to use PS methods to effectively improve estimation (fewer than eight outcomes per included covariate). 17 Propensity score methods in general and propensity score matching in particular also help to reduce the dependence of causal inference on hard-to-justify but commonly made statistical modelling assumptions and allows for a simple, transparent analysis. 18,19 In addition, unlike most multivariable regression models, assumptions regarding model specifications, variable functional form, normality, and linear projections beyond the observed data are not required particularly when 18, 20 matching on propensity score. THE TREATMENT EFFECT ESTIMATE (THE ESTIMAND) When addressing the bias in the treatment effect estimate, Brooks and Ohsfeldt 3 compared effect estimates from different approaches, including regression and several propensity score methods. Furthermore, they concluded that propensity score methods can exacerbate the bias in treatment effect estimates when there are unmeasured confounders related to measured confounders. We agree that propensity score methods may not reduce bias from unmeasured confounders, but would like to point out that a potential drawback of direct comparison of estimates from propensity score methods and regression analysis where covariates are included in the adjustment model. In studies in which non-linear models (such as logistic regression or Cox proportional hazard model) are applied, effect estimates may differ between PS methods and regression adjustment methods not only due to differential adjustment for confounding but also due to non-collapsibility. 21 Non-collapsibility is the phenomenon that when estimating the treatment-outcome association using an odds ratio (OR) or hazard ratio (HR), the conditional OR or HR does not equal the marginal OR or HR in the presence of non-null 7, 21, 22 treatment effect, even in the absence of confounding and effect modification. Adjusting for covariates that are predictors of the outcome will change the treatment effect estimate, even when there is no confounding present. Obviously, the number of covariates included in the adjustment model can be very different when comparing propensity score matching with regression analysis where covariates are included in the adjustment model and thus a direct comparison may be flawed.

139 In addition, while both propensity score and regression methods provide different effect estimates, the inferential goal of the research question determines which estimand is appropriate. For example, in the absence of non-collapsibility (e.g. with linear models), marginal structural modelling (MSM) using inverse probability of treatment weighing (IPTW), multivariable regression, covariate adjustment using propensity score, and propensity score stratification estimates the average treatment effect in the population, ATE, the treatment effect estimate obtained from RCTs (i.e., the average causal treatment effect if everyone in the population is treated versus if everyone in the population is untreated) On the other hand, propensity score matching typically focuses on either the average treatment effect in the treated (ATT) or the average treatment effect in the untreated (ATU) not on the ATE; hence, the target of the causal contrast being the population that is going to receive the treatment. 6,26 This is particularly important when there is treatment-effect modification regardless of the presence of confounding. 25, 27 In the study by Brooks and Ohsfeldt, 3 exacerbation in bias was not clearly evident when using propensity score to balance measured covariates compared with regression estimates even in the presence of independent variation in the unmeasured covariate (example, Scenarios 1 and 2 of Table 3) despite inappropriate comparisons of treatment effect estimates from different models. 4.1 CONCLUDING REMARKS The authors should be commended for raising an important topic in propensity score methodology, namely the implications of balancing measured covariates using propensity score on the (im)balance of unmeasured covariates. However, we disagree with Brooks and Ohsfeldt s 3 statement that systematic differences in the unmeasured covariates is required for propensity algorithms to balance measured covariates between treated and untreated subjects. In addition, the impact of unmeasured covariate imbalance on the bias of estimated treatment effects, if any, cannot be inferred from their study. This is due to the fact that the estimands are different and direct comparison may not be appropriate. Therefore, the findings of Brooks and Ohsfeldt 3 should be interpreted with caution and further research is needed to evaluate the balancing properties of propensity algorithms with respect to unmeasured confounding using properly designed simulation settings. Propensity Score Methods and Unmeasured Confounding Imbalance 139

140 REFERENCES 4.1 Propensity Score Methods and Unmeasured Confounding Imbalance Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58: Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Brooks JM, Ohsfeldt RL. Squeezing the balloon: propensity scores and unmeasured covariate balance. Health Serv Res 2013; 48: Rubin DB. Inference and missing data. Biometrika 1976; 63: Rubin DB. Assignment to treatment group on the basis of a covariate. J Educ Behav Stat 1977; 2: Hill J. Discussion of research using propensity-score matching: Comments on A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: Austin PC. The performance of different propensity score methods for estimating marginal odds ratios. Stat Med 2007; 26: Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011; 46: Jo B, Stuart EA. On the use of propensity scores in principal causal effect estimation. Stat Med 2009; 28: Rassen JA, Schneeweiss S. Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiol Drug Saf 2012; 21: Rassen JA, Glynn RJ, Brookhart MA, Schneeweiss S. Covariate selection in high-dimensional propensity score analyses of treatment effects in small samples. Am J Epidemiol 2011; 173: Schneeweiss S, Rassen JA, Glynn RJ, et al. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 2009; 20: Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168: Petersen ML, Porter K, Gruber S, et al. Diagnosing and responding to violations in the positivity assumption U. C. Berkeley Division of Biostatistics Working Paper Series 269, (2010). 15. Rosenbaum PR. in Encyclopedia of Statistics in Behavioral Science DOI: / bsa454 (Wiley Online Library, 2005). 16. Stürmer T, Schneeweiss S, Avorn J, Glynn RJ. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol 2005; 162: Cepeda MS, Boston R, Farrar JT & Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003; 158: Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol 2006; 98: Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15:

141 20. Zanutto EL. A comparison of propensity score and linear regression analysis of complex survey data. JDS 2006; 4: Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Statist Sci 1999; 14: Martens EP, Pestman WR, De Boer A, et al. Systematic differences in treatment effect estimates between propensity score methods and logistic regression. Int J Epidemiol 2008; 37: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Hernán MÁ, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Fang G, Brooks JM, Chrischilles EA. Apples and oranges? Interpretations of risk adjustment and instrumental variable estimates of intended treatment effects using observational data. Am J Epidemiol 2012; 175: Rassen JA, Shelat AA, Myers J, et al. One-to-many propensity score matching in cohort studies. Pharmacoepidemiol Drug Saf 2012; 21: Stürmer T, Joshi M, Glynn RJ, et al. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol 2006; 59: Propensity Score Methods and Unmeasured Confounding Imbalance 141

142

143 CHAPTER Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures M Sanni Ali, MJ Uddin, RHH Groenwold, WR Pestman, SV Belitser, AW Hoes, A de Boer, KCB Roes, OH Klungel Both authors have contributed equally Epidemiology 2014; 25: Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 143

144

145 ABSTRACT Background Methods Results Conclusion INTRODUCTION Instrumental variable (IV) analysis can in theory control for unmeasured confounding in non-randomized studies. One of the assumptions is that the IV is independent of confounders. We conducted a simulation study to assess the performance of balance measures commonly used in propensity score methods (i.e., standardized difference) to quantitatively falsify this assumption. We simulated cohorts of varying sample sizes, binary IV and exposure, continuous outcome, and several confounders. Data were analyzed using the two-stage least squares method. The balance of confounders across IV levels was assessed using the standardized difference. Bias of IV estimates increased with weaker IVs (i.e., weaker association between IV and exposure) and increasing values of the standardized difference (i.e. decreasing balance of confounders across IV levels). IV estimates were more biased than conventional regression estimates with increasing values of the standardized difference, and a weak IV amplified this bias. Balance measures that are commonly used in propensity score methods can be useful tools to falsify an important assumption underlying IV analysis, i.e., that the IV is independent of confounders. However, these balance measures only quantify the balance of measured confounders and researchers should complement it with theoretical justifications for balance on unmeasured confounders. If measured confounders are imbalanced between IV categories, such imbalance is likely to exist in unmeasured confounders as well and adjusting for measured confounders in the IV models may result in more biased estimate compared to conventional regression estimate; hence, researchers should consider refraining from IV analysis. Instrumental variable (IV) analysis might be an attractive method to adjust for confounding in non-randomized studies, since it potentially controls for both measured and unmeasured confounding. 1,2 An IV is a variable that is associated with the exposure under study and related to the outcome only through the exposure. 1,2 This implies that an IV should satisfy three basic assumptions: i) the IV is (strongly) associated with the exposure under study; ii) the IV affects the outcome only through the exposure; 2-4 and iii) the IV is independent of confounders. 2,3 If these assumptions are satisfied, with additional assumption (homogeneous treatment effect, see later) IV analysis may provide consistent estimate of exposure effect on the outcome. 2 However, if one of the basic assumptions is violated, the IV estimate can be severely biased and inconsistent. 1,2 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 145

146 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 146 To check the first assumption, statistical tools such as the F-statistic, 1, 5-7 R squared, 8 pseudo- R-squared, 9 and the odds ratio 9,10 have been used. There is no well-established method for checking the second and third assumptions and several authors 1,9,11,12 argued that these assumptions are unverifiable or directly untestable from data as they involve unmeasured confounders. 2,13 On the other hand, Glymour et al. 14 proposed several non fail-safe methods that are useful for evaluating the validity of IV although they require additional assumptions. Moreover, several other authors provided supportive evidence for the third assumption, by describing the balance of measured patient characteristics between IV categories. Alternatively, an imbalance in measured patient characteristics can falsify the third assumption and help researchers assess whether it is appropriate to proceed with IV analysis. 26 In this article, we propose familiar and easy to apply methods that will help researchers to falsify the third assumption by assessing the independence between the IV and measured confounders. These methods are based on balance measures commonly used in propensity score (PS) methods In PS methods, 30 balance measures such as the standardized difference, the Kolmogorov-Smirnov distance, and the overlapping coefficient 28,29 can be used to quantify balance of measured confounder distributions between exposure groups. These balance measures are chosen since they are robust with respect to sample size and known among epidemiologists. 28,29,31 If measured confounders are insufficiently balanced between the IV categories, the IV and measured confounders are not independent, which means that the third assumption is violated; hence, IV analysis is not appropriate. However, if measured confounders are balanced, investigators should rely on substantial background knowledge to argue that such balance could be carried over to unmeasured confounders although it cannot be verified from the data. 15,25 The objective of this study was to explore the usefulness of balance measures to quantitatively falsify the assumption that the IV should be independent of confounders. In addition, we illustrated this method using an empirical example on the relation between inhaled long-acting beta 2 adrenoceptor agonists and the risk for acute myocardial infarction (AMI). MATERIALS AND METHODS Balance Measures for Measured Confounders Balance measures have been used in PS methods to assess balance of confounders between treatment groups. 28,29 For a detailed explanation of the balance measures, we refer to the literature We used the standardized difference (SDif ), the Kolmogorov- Smirnov (KS) distance, and the overlapping coefficient (OVL) as balance measures, but we only report results for the SDif owing to its common application in the medical literature, better performance in various scenarios (e.g., covariate distribution and sample size), and simplicity of calculation compared to other balance measures. 32,33 For binary IVs and binary confounders, SDif is the difference in proportions of the confounder between IV categories standardized to the variation in the confounding variable (i.e. the standard deviation). For

147 binary IVs and continuous confounders, SDif is the difference in the means of confounders standardized to the variation. SDif has a minimum value of zero ( perfect balance) but no maximum value. An imbalance in measured confounders between IV categories indicated by the balance measure (e.g. SDif > 0.10, an arbitrary cut-off ) 34 means that the third assumption is violated. Simulation Setting We used Monte Carlo simulations to assess the third assumption of IV analysis using the SDif. The scenarios we considered included binary IV, binary exposure, continuous outcome, and continuous confounders. We used the following notations: Y denotes the outcome, X denotes the exposure, Z denotes the IV, and C and U denote set of measured and unmeasured confounding variables, respectively. We used statistical software R (Windows, version ) for simulations and analyses. 35 Data Generation First, we generated four continuous confounders (C 1, C 2, C 3, and C ) using the multivariate 4 normal distribution (MVND) with mean 0 and variance 1. The correlation coefficient between confounders was varied between 0 and 0.4. A binary IV was generated based on the following logistic models (equation 1). logit [ Prob(Z 1 C ) ] = α 0 + α1 C1 i + α2 C2 i + α3 C3 i + α4 C4 i = [1] The values for α 1 - α 4 were varied between 0.0 and 0.60 to induce different association between IV and confounders. α 0 was set to in the logistic model (equation 1) in order to achieve 40% prevalence of the binary IV. Next, a binary exposure was generated based on logistic model (equation 2) [ Prob(X 1 Z,C) ] = β0 + βz Zi + β1 C1 i + β2c2 i + β3 C3 i β4c4 i log it = + [2] The value of β 0 was set to -1.5 so that nearly 50% of the subjects were treated, the values of β z was varied between 0.20 to 2.50 to induce different associations between the IV and the exposure, and β 1 through β 4 were set to 1.5 in different scenarios. A continuous outcome, Y, was generated using the following model (equation 3): Y i = δ 0 + δx Xi + δ1 C1 i + δ2 C2 i + δ3 C3 i + δ4 C4 i + εi ; for 1,2,... n i = [3] where X indicates the exposure variable generated previously (equation 2) and the variable C 1, C 2, C 3, and C 4 denote the confounding factors. δ 0, δ x, and δ δ denote the intercept 1-4 (set to 1.0), the true exposure effect (set to 1.0), and the effects of the confounders on the outcome (set to 1.5), respectively. ε is the error term for the outcome, which follows a normal distribution with mean zero and variance of unity. 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 147

148 We distinguished three scenarios which are schematically presented in Figures 1a-b, 2a-b, and 3a-b, using causal diagrams. 4.2 Scenario 1: All confounders are measured (no unmeasured confounding) (Figure 1a-b); Scenario 2: One of the confounders (C 4 = U) is unmeasured; this unmeasured confounder has no association with the measured confounders (C 1, C 2, and C ) (Figure 2a-b). 3 Scenario 3: One of the confounders (C 4 = U) is unmeasured; however, this confounder is associated with the measured confounders (C 1, C 2, and C ) (Figure 3a and 3b). 3 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 148 In scenarios 2 and 3, when there is unmeasured confounding (U), the measured confounder C 4 was considered to be unmeasured in the analysis stage. Hence, no assessment was made on the balance of the distribution of this variable between IV categories. All scenarios were evaluated for a sample size of 10,000 and each scenario was replicated 10,000 times. In order to identify the average exposure effect among the study population, we assumed that the effect of exposure on the outcome was the same for all subjects. 2 Analysis of Simulated Data In all the three scenarios, the balance of measured confounders between IV groups was assessed using SDif. In the presence of unmeasured confounders, balance was only assessed on measured confounders. We analyzed data using the two-stage least squares method. The first-stage model was a linear regression model, in which the exposure was the dependent and the IV was the independent variable. 36 The second-stage model was also a linear regression model, in which the outcome was regressed on the predicted exposure (i.e., the predicted value of exposure status based on equation 4), rather than the actual exposure. These models can be summarized: First-stage model: Xi = γ 0 + γ 1 Zi + ε i 1 ; for i = 1, 2,... n [4] Second-stage model: Y ˆ i = θ 0 + θx X i + εi2; for i = 1,2,... n [5] where X and Z are exposure and IV, respectively. X^ denotes the predicted value of the exposure, predicted from equation [4], ε 1 and ε 2 are the error terms, which follow a normal distribution with mean zero and constant variance σ 2, γ and θ denote intercepts in the 0 0 first and second-stage models, respectively. The parameter θ x in equation [5] is called the IV estimator, an estimate for the exposure effect on the outcome. In IV analysis, the measured confounders can be included in both the first and the secondstage models since the conditional independence and exclusion restrictions underlying IV estimation are more likely to be valid after conditioning on covariates 6 and the precision of the estimates can be improved. 6,25,37,38 In addition, Brookhart et al. 39 suggested reporting an unadjusted IV estimate and exploring the sensitivity of the results to the inclusion/exclusion

149 of covariates, particularly if there is not a strong theoretical reason to believe that they confound the instrument-outcome association. To evaluate this approach, we performed additional analyses including all measured confounders in each stage models, equations [4] and [5]. In addition, we used conventional multivariable linear regression models adjusting for all measured confounders to estimate the exposure effect on the outcome and compared the results with the IV estimates. 4.2 Bias was defined as the difference between the average of the estimated effects and the true exposure effect (i.e., 1.0). Confidence Intervals (CIs) were estimated using the standard errors of the mean of the estimates (i.e., standard deviation of the IV estimates divided by square root of the number of simulations) to identify the precision of estimating the bias of the IV estimates. Empirical Example To illustrate the method, we used data from a pharmacoepidemiologic study on the relation between inhaled long-acting beta 2 adrenoceptor agonists and the risk for acute myocardial infarction (AMI). For this follow up study, data from the Dutch Mondriaan database was used, which comprises general practitioners (GP) data complemented by pharmacy dispensing data and linkages to survey data on about 1.4 million patients. For this example we used GP data from adult patients with a diagnosis of asthma and/or COPD and at least one prescription of inhaled beta 2 -agonists (long acting or short acting, LABA/SABA) or inhaled muscarinic antagonists (MA) (n = 27,459). The index date is defined for each individual patient, as the date of first prescription of an inhaled SABA/LABA or an inhaled MA after the start of valid data collection. The observation period for each patient lasts from the index date (from 1 January 2002 onwards) to the end of data collection (31 December 2009), the date of the first AMI, the date of death, whichever occurs first. A patient can switch between current, recent and past periods and between the treatment classes. A patient is a current user from the beginning of the prescription up to the calculated end date of the prescription, or a recent user during the 91 days following the calculated end date of the prescription, or a past user after 91 days following the calculated end of prescription. The choice for the 91days in the calculation interval was based on the fact that Dutch health insurance policies cover the dispensing of the majority of drugs for three months. 40,41 The period of past user will expire if the patient becomes a new user or on the end of followup. If a patient switches between the treatment classes, a new treatment period starts at the date of the prescription of the new drug. For our analysis, we considered two groups for comparison: current LABA users and non-laba users. Non-LABA users could be SABA/MA users or recent/past LABA users. Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures The outcome status (myocardial infarction) was based on GP records (the international classification of primary care, ICPC code =K75). Patients were excluded if they had any history of myocardial infarction prior to or at the start of follow up. 149

150 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 150 Co-morbidities and co-medications were assessed, from one year prior to study entry until the end of the study period, every time exposure changes (current, recent, past) and every six months (if the patient did not change his/her exposure status); hence, considered as time varying confounders. For co-morbidities, patients were classified as having the disease from the first date of diagnosis through follow up. Physician s prescribing preference which is measured using the proportions of time for LABA prescriptions (PTLP) per GP centre was used as an IV in two-stage IV analysis. 25 The PTLP is the ratio of the follow-up time contributed by patients who were current users of LABA under a GP to the total follow-up time contributed by patients under the same GP. Hence, PTLP is a continuous variable ranging between 0 and 1. Although dichotomizing a continuous variable (either IV or any other variable) is generally advised against, for illustration purpose, we dichotomized the continuous IV (PTLP). We used the median of the PTLP as a cut-off to create a binary IV. The first-stage model of the IV analysis was a linear regression model and the second-stage model was a Cox proportional hazards model The 95% CIs were estimated using bootstrapping (number of bootstraps=1000). The strength of the relation between IV and exposure (current LABA users versus non-laba users) was quantified by the odds ratio. The SDif between IV categories were calculated for each potential (time-varying) confounder to assess balance of measured confounders between IV categories. In addition, crude and adjusted hazard ratios were estimated using conventional Cox proportional hazards model where confounders except gender were considered time-varying. RESULTS Figure 1 shows the relation between the SDif and bias of the IV estimate for the scenario without unmeasured confounding (i.e. all confounders measured).the magnitude of the bias increased when the balance on confounders between IV categories decreased (as indicated by an increase in the SDif ). When IV was independent of measured confounders (indicated by intersection point in the plot that corresponds to the zero-point of the mean SDif ), unadjusted IV estimates were unbiased. However, when the IV was associated with measured confounders, unadjusted IV estimates were biased even for stronger IV (e.g. α i = 1.5 in equation 1 with the corresponding value of SDif was 0.6 and β z = 2.5 in equation 2, the bias in the IV estimate was as high as 5.5). When IV was associated with measured confounders, the magnitude of the bias was also influenced by the strength of the IV (i.e. the association between IV and exposure, β z ). For example, for two IVs with β z = 0.5 and 2.5, when the SDif of 0.6, the corresponding bias were close to 9 and 5.5, respectively. In the same Figure, the conventional multivariable linear regression estimates were unbiased where as those of unadjusted IV estimates were not except when IV is perfect, i.e. IV is independent of confounders corresponding to SDif of zero. However, results from the adjusted IV models (models that also included the measured confounders both in first and second stage models) were unbiased. The strength of association between the exposure and confounders influenced the magnitude of bias of IV estimate but not the balance of

151 a b Bias of IV estimate c Mean Standarized Difference measured confounders between IV categories. We therefore only presented results for an association between the exposure and the confounders of β i = 1.5. Figure 2 shows the relation between the SDif and bias for the scenarios with unmeasured confounding which was independent of measured confounders. In this situation, both unadjusted IV method and the conventional multivariable regression method provided biased estimates (Figure 2c), due to association between IV and confounders (except when IV is perfect, the starting point of BetaU.Z line in the plot corresponding to the nearly zero- SDif ), and the presence of unmeasured confounding, respectively. Moreover, in the case of an IV that was associated with confounders (e.g. SDif = 0.05 to 0.80), the results obtained from the conventional linear regression model were less biased than those of unadjusted IV models. Again, the magnitude of bias increased with increasing SDif. In situations where the IV was independent of the measured confounders but not of unmeasured confounders, IV estimates were still biased even though the SDif was close to zero, which is due to the fact that the SDif was determined based only on the measured confounders. When measured confounders were included as covariates in the IV models, the bias of IV estimates was close to zero when the IV was independent of the unmeasured confounders (Figure 2d). In addition, the bias for adjusted IV estimates was smaller than that of unadjusted IV estimates. Beta.Z=0.50 Beta.Z=1.50 Beta.Z=2.50 Conventional Regression Figure 1. Directed acyclic graphs (1a and 1b, where Z= Instrumental variable, X= exposure, Y=outcome, C=measured confounder) and plot of mean standardized difference vs. bias of IV (for different strength of the IV, Beta.Z (β z ) =regression coefficient of Z) and conventional regression estimates in the absence of unmeasured confounding (1c). Z is independent of C (1a) and Z is associated to C (1b). The different points on the lines indicate different correlations between IV and measured confounders (For example, at the intersection point of the three lines corresponding to the zero SDif, the IV is independent of measured confounders) 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 151

152 Importantly, when the IV was strong but related to the confounders (both measured and unmeasured), estimates from adjusted IV models were more biased than those of conventional multivariable regression estimates. 4.2 When the unmeasured confounder was associated with measured confounders (Scenario 3, Figure 3a-b), the results showed a similar pattern as described above, with the exception that the conventional multivariable regression model was less biased (Figure 3c). Also in this scenario, the magnitude of the bias increased with increasing SDif and estimates Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 152 a b d c Bias of IV estimate Bias of IV estimate Mean Standarized Difference Beta.U.Z=0.60 Beta.U.Z=0.30 Beta.U.Z=0.0 Conventional Regression Mean Standarized Difference Beta.U.Z=0.60 Beta.U.Z=0.30 Conventional Regression Beta.U.Z=0.0 Figure 2. Directed acyclic graphs (2a and 2b, where U=unmeasured confounder) and plots of mean standardized difference vs. bias of IV (for different association between Z and U, Beta.U.Z (β4) =regression coefficient of U) and conventional regression estimates in the presence of U that is independent of C (1c and 1d). Z is independent of U (2a) and Z is associated to U (2b). Standard IV/ regression estimates (2c) and adjusted IV/ regression estimates (2d).

153 from adjusted IV models were biased when the IV was associated with the unmeasured confounder (Figure 3d). Empirical Example: LABA use and the Risk of Myocardial Infarction The total follow-up time was 110,146 person years and the proportion of time for LABA prescriptions per GP ranged between 0 and 0.56 (median 0.29). The mean age of patients was 52.3 (SD=17.8) years and 447 of the patients experienced a MI during follow-up. The odds ratio of the relation between the IV (PTLP) and exposure (current LABA use) was b a c d Bias of IV estimate Bias of IV estimate Mean Standarized Difference Beta.U.Z=0.60 Beta.U.Z=0.30 Beta.U.Z=0.0 Conventional Regression Mean Standarized Difference Beta.U.Z=0.60 Beta.U.Z=0.30 Conventional Regression Beta.U.Z=0.0 Figure 3. Directed acyclic graphs (3a and 3b) and plots of mean standardized difference vs. bias of IV (for different association between Z and U, Beta.U.Z (β 4 ) =regression coefficient of U) and conventional regression estimates in the presence of U that is associated with C (3c and 3d). Z is independent of U (3a) and Z is associated to U (3b). Standard IV/regression estimates (3c) and adjusted IV/ regression estimates (3d). Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 153

154 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 154 Table 1. Balance of confounders between exposure and instrumental variable categories Exposure IV* Current LABA versus Non-LABA PTLP=1 PTLP=0 PTLP* Current LABA Non-LABA Standardized Difference** Mean or Frequency (%) Mean or Frequency (%) Standardized Difference** Mean or Frequency (%) Mean or Frequency (%) Confounders Age (Mean) Gender (Male) Diabetes mellitus (DM) DM Medication Beta-Blocker Lipid modifying Agents Stroke Ischaemic heart disease Antithrombotic agents Diuretics Agents acting on the renin angiotensin system Oral corticosteroid Inhaled corticosteroid Disease (COPD and/or Asthma) *IV= (PTLP): proportion of time for LABA prescription, binarized at its median **Standardized difference in bold indicates confounders are not balanced between exposure as well as IV groups.

155 The distribution of several patient characteristics differed systematically between exposure groups; hence, there was a large potential for confounding of the exposure-outcome relation (Table 1). Indeed, the crude and adjusted hazard ratios (HR) from the conventional analyses differed considerably: HR 1.39 [ ] and 0.90 [ ], respectively. Some of the measured confounders were not balanced between IV categories (SDif values for age, oral corticosteroid, and disease (COPD, Asthma, or both) were 16%, 11% and 16%, respectively); hence, the third IV assumption did not seem to hold. The effect estimates from unadjusted and adjusted IV analyses were HR 1.69 [ ] and 0.19 [ ], respectively. DISCUSSION Our simulation study shows that balance measures can be used to falsify the third assumption of IV analysis, i.e. the assumption that the IV is independent of confounders. The standardized difference (SDif ), a measure of the degree of balance on measured confounders between IV categories, was strongly correlated with the bias of IV estimates. Values of the SDif close to zero indicate that at least the measured confounders are balanced between IV categories. When this assumption is violated, IV analysis may result in more biased estimates than conventional regression analysis. The magnitude of bias was associated with the strength of the IV and the balance of measured confounders between IV categories. A higher value of SDif (e.g., larger than 10%), 32 i.e., strong association between IV and measured confounders, indicates a violation of the third assumption and is associated with highly biased estimates. This bias can be remedied by including measured confounders in the IV model under the assumption of no unmeasured confounding or unmeasured confounders being independent of the IV. However, IV analysis is mainly considered in settings where unmeasured confounding cannot be ruled out. An imbalance in measured confounders as indicated by SDif would, therefore, suggest that IV analysis with or without inclusion of measured confounders would yield biased estimates. Interestingly, in the presence of unmeasured confounding that was related to IV, conventional multivariable regression analysis yielded less biased estimates than IV analysis in our simulations even in the presence of a strong IV. The bias and variation in IV estimates increased considerably when the association between IV and exposure was weak (i.e., weak IV), which is in line with previous studies. 1,2 In those cases, estimates from IV analysis were more biased than conventional regression analysis, even when measured confounders were included both in the first and second-stage IV models. However, when the IV was strong (e.g. β z =2.5), including measured confounders in the IV models provide essentially unbiased estimates like linear regression in the absence of unmeasured confounding despite poor performance of IV methods in finite sample size. 8 In our empirical example, the adjusted estimate from conventional regression was not significant and close to the no relevant differences between treatment groups in a 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 155

156 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 156 meta-analysis of RCTs. 45 On the other hand, the estimates from unadjusted and adjusted IV analyses were divergent which could be due to imbalance of measured confounders (age, oral corticosteroids, and disease) between IV categories, i.e. violation of the third assumption. Moreover, weak association between IV and exposure (current LABA use) was evident as reflected by wider confidence interval in the unadjusted IV estimate. Although adjustment for measured confounders in IV models improved the precision of the estimate, the IV estimate is far from the estimates in meta-analyses of RCTs on this topic. 45 This difference could be due to the imbalance in measured confounders between IV groups and it seems plausible that such imbalance could also exist in unmeasured confounders. Hence, when there is an imbalance in unmeasured confounders between IV categories, adjustment for measured confounders in the IV analysis could result in more biased estimates than conventional regression methods. Therefore, IV analysis is inappropriate in such cases. On the other hand, the non-collapsibility 46 of hazard ratio could in part explain the difference between adjusted and unadjusted (IV or conventional regression) estimates. This study has several strengths. First, we explored the usefulness of balance measures (SDif ) in several realistic settings to falsify the third assumption of IV, which is easy to apply. The SDif has several desirable properties compared to other tests of independence (e.g. t-test), including independence of sample size. 28 Second, we considered a wide range of scenarios for associations among IV, exposure, and confounders (both measured and unmeasured). A limitation of our simulation study is that we restricted the simulations to a continuous outcome with linear model. We chose this approach, because IV estimates are reported to be biased in the case of a binary outcome with non-linear model. 10,47 Future research could extend the simulations to settings with binary outcome. In addition, although the different confounders in the simulations had different associations with the exposure and/ or outcome, we gave equal weights to all confounders in estimating the standardized difference. Nevertheless, the choice of the weights is not straightforward and its impact on the bias is not substantial. 28 Furthermore, we used SDif for only binary IV; however, a similar approach can be used for continuous IV, i.e. assessing balance of measured confounders between quintiles of the continuous IV. Using Monte Carlo simulations, we demonstrated that balance measures such as the standardized difference can help researchers in assessing the third assumption of IV analysis with respect to measured confounders. Although balance on measured confounders between IV categories does not guarantee balance on unmeasured confounders, 14,48 despite such claims being prevalent in the medical literature in IV analysis, 15,16,19,21,23,26,49 our study indicated that balance measures seems to be useful for falsification of the IV assumption. Hence, when balance in measured confounders is achieved, investigators should rely on theoretical justifications with regard to balance of unmeasured confounders 26,50 for the validity of the IV analysis. If balance measures indicate that the confounders are imbalanced between IV categories, and thus falsify the third IV assumption, researchers should consider refraining from IV analysis even adjusting for measured confounders.

157 REFERENCES 1. Martens EP, Pestman WR, De Boer A, Belitser SV, Klungel OH. Instrumental variables: Application and limitations. Epidemiology 2006; 17: Hernán MA, Robins JM. Instruments for causal inference: an epidemiologist s dream? Epidemiology 2006; 17: Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol 2000; 29: Davies NM, Smith GD, Windmeijer F, Martin RM. Issues in the reporting and conduct of instrumental variable studies: A systematic review. Epidemiology 2013; 24: Stock JH, Wright JH, Yogo M. A survey of weak instruments and weak identification in generalized method of moments. J Bus Econ Stat 2002; 20: Angrist JD, Pischke JS. Mostly harmless econometrics: An empiricist s companion. : Princeton Univ Pr; Staiger D, Stock JH. Instrumental variables regression with weak instruments. Econometrica 1997; 65: Bound J, Jaeger DA, Baker RM. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. J Am Stat Assoc 1995; 90: Yoo BK, Frick KD. The instrumental variable method to study self-selection mechanism: A case of influenza vaccination. Value Health 2006; 9: Uddin MJ, Groenwold RHH, de Boer A, et al. Performance of instrumental variable methods in cohort and nested case control studies: a simulation study. Pharmacoepidemiol Drug Saf 2013; 23: Didelez V, Meng S, Sheehan NA. Assumptions of IV methods for observational epidemiology. Stat Sci 2010; 25: Chen Y, Briesacher BA. Use of instrumental variable in prescription drug research with observational data: a systematic review. J Clin Epidemiol 2011; 64: Didelez V, Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Stat Methods Med Res 2007; 16: Glymour MM, Tchetgen EJT, Robins JM. Credible mendelian randomization studies: Approaches for evaluating the instrumental variable assumptions. Am J Epidemiol 2012; 175: Schneeweiss S, Solomon DH, Wang PS, Rassen J, Brookhart MA. Simultaneous assessment of shortterm gastrointestinal benefits and cardiovascular risks of selective cyclooxygenase 2 inhibitors and nonselective nonsteroidal antiinflammatory drugs: An instrumental variable analysis. Arthritis Rheum 2006; 54: Schneeweiss S, Setoguchi S, Brookhart A, Dormuth C, Wang PS. Risk of death associated with the use of conventional versus atypical antipsychotic drugs among elderly patients. Can Med Assoc J 2007; 176: Earle CC, Tsai JS, Gelber RD, et al. Effectiveness of chemotherapy for advanced lung cancer in the elderly: Instrumental variable and propensity analysis. J Clin Oncol 2001; 19: Wang PS, Schneeweiss S, Avorn J, et al. Risk of death in elderly users of conventional vs. atypical antipsychotic medications. N Engl J Med 2005; 353: Lu-Yao GL, Albertsen PC, Moore DF, et al. Survival following primary androgen deprivation therapy among men with localized prostate cancer. JAMA 2008; 300: Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 157

158 4.2 Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures Stuart BC, Doshi JA, Terza JV. Assessing the impact of drug use on hospital costs. Health Serv Res 2009; 44: Zhang Y. Cost-saving effects of olanzapine as long-term treatment for bipolar disorder. J Ment Health Policy Econ 2008; 11: Schneeweiss S, Seeger JD, Landon J, Walker AM. Aprotinin during coronary-artery bypass grafting and risk of death. N Engl J Med 2008; 358: Park T-, Brooks JM, Chrischilles EA, Bergus G. Estimating the effect of treatment rate changes when treatment benefits are heterogeneous: Antibiotics and otitis media. Value in Health 2008; 11: Ramirez SPB, Albert JM, Blayney MJ, et al. Rosiglitazone is associated with mortality in chronic hemodialysis patients. Journal of the American Society of Nephrology 2009; 20: Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating short-term drug effects using a physician-specific prescribing preference as an instrumental variable. Epidemiology 2006; 17: Swanson SA, Hernán MA. Commentary: How to report instrumental variable analyses (suggestions welcome). Epidemiology 2013; 24: Austin PC. The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies. Medical Decision Making 2009; 29: Belitser SV, Martens EP, Pestman WR, Groenwold RH, de Boer A, Klungel OH. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20: Groenwold RHH, de Vries F, de Boer A, et al. Balance measures for propensity score methods: A clinical example on beta-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf 2011; 20: Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: A systematic review and suggestions for improvement. J Thorac Cardiovasc Surg 2007; 134: Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria 2010; ISBN: Available online at Angrist JD, Krueger AB. Instrumental variables and the search for identification: From supply and demand to natural experiments. J Econ Perspect 2001; 15: Burgess S. Statistical issues in Mendelian randomization: use of genetic instrumental variables for assessing causal associations. Diss. University of Cambridge, Burgess S, Thompson SG, Andrews G, et al. Bayesian methods for meta-analysis of causal relationships estimated using genetic instrumental variables. Stat Med 2010; 29: Brookhart MA, Rassen JA, Schneeweiss S. Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol Drug Saf 2010; 19:

159 40. Ali MS, Groenwold RH, Pestman WR, et al. Time-dependent propensity score and colliderstratification bias: an example of beta2-agonist use and the risk of coronary heart disease. Eur J Epidemiol 2013; 28: De Vries F, Pouwels S, Bracke M, et al. Use of β2 agonists and risk of acute myocardial infarction in patients with hypertension. Br J Clin Pharmacol 2008; 65: Schmoor C, Caputo A, Schumacher M. Evidence from nonrandomized studies: A case study on the estimation of causal effects. Am J Epidemiol 2008; 167: Bosco JLF, Silliman RA, Thwin SS, et al. A most stubborn bias: no adjustment method fully resolves confounding by indication in observational studies. J Clin Epidemiol 2010; 63: Rascati KL, Johnsrud MT, Crismon ML, Lage MJ, Barber BL. Olanzapine versus risperidone in the treatment of schizophrenia: A comparison of costs among Texas Medicaid recipients. Pharmacoeconomics 2003; 21: Ferguson GT, Funck-Brentano C, Fischer T, Darken P, Reisner C. Cardiovascular safety of salmeterol in COPD. Chest 2003; 123: Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci 1999; Palmer TM, Thompson JR, Tobin MD, Sheehan NA, Burton PR. Adjusting for bias and unmeasured confounding in Mendelian randomization studies with binary responses. Int J Epidemiol 2008; 37: Ward A, Johnson PJ. Addressing confounding errors when using non-experimental, observational data to make causal claims. Synthese 2008; 163: Tentori F, Albert JM, Young EW, et al. The survival advantage for haemodialysis patients taking vitamin D is questioned: Findings from the Dialysis Outcomes and Practice Patterns Study. Nephrology Dialysis Transplantation 2009; 24: Burgess S. Re: credible mendelian randomization studies: Approaches for evaluating the instrumental variable assumptions. Am J Epidemiol 2012; 176: Quantitative Falsification of Instrumental Variables Assumption Using Balance Measures 159

160

161 CHAPTER V GENERAL DISCUSSION

162

163 CHAPTER Application of Propensity Score Methods to Quantify Treatment Effects: a Step-by-Step Approach for Applied Researchers M Sanni Ali, RHH Groenwold, SV Belitser, AW Hoes, A de Boer, OH Klungel Manuscript In preparation Application of Propensity Score Methods to Quantify Treatment Effects 163

164

165 The chapter aims to explain the concept of the propensity score to applied researchers and to provide a step-by-step guidance on how propensity score based methods can be applied, to address confounding in estimating causal treatment effects using observational data. In doing so, we will combine the findings in this thesis with what is already available in the literature. Section 1 (Causal Framework) includes a short introduction to the assessment of causal treatment effects. Section 2 (Propensity Scores) introduces the concept of propensity scores. In section 3 (Covariate selection for propensity score model), various variable selection methods for the propensity score model and their implications are discussed. Section 4 (Propensity Score estimation) describes various methods to estimate the propensity score. In section 5 (Propensity score methods to control for confounding), different types of propensity score methods are introduced. Section 6 (Assessment of balance achieved by propensity score model) gives an overview of methods to assess balance on covariates, i.e., whether the propensity score model has been adequately specified. In section 7 (Estimation and interpretation of treatment effects), we elaborate on estimation and interpretation of causal treatment effects using the different propensity score methods introduced in section 5. Section 8 (Reporting of propensity score analysis) highlights reporting of propensity score based results. Section 9 summarizes the extended applications, strengths and limitations of propensity score methods. Section 10 concludes the discussion with indications for future research. Causal Framework Inferences about the causal effects of a treatment on an outcome, whether resulting from a randomized experiment or non-experimental (i.e. non-randomized or observational) study, involve speculations about the effect the treatment would have had on a subject who, in fact, received no treatment or some other treatment. 1,2 Rubin 2 extended Nyman s representation of potential outcomes in randomized trials (the outcomes, let s say Y i T=1 and Y i T=0, that would have happened if a subject i received treatment (T i =1) or did not receive the treatment (T i =0), for a binary treatment T, respectively), also called the counterfactual outcomes, 3, 4 for non-randomized studies. He formalized that the causal effect for subject i is the comparison of subject i s outcome (Y i ) had (s)he received the treatment (Y i T=1, the potential outcome under treatment) and subject i s outcome (Y i ) had (s)he not received the treatment (Y i T=0, the potential outcome under no treatment or alternative treatment). This is sometimes called the Neyman-Rubin causal model. 5 Each subject at a certain point in time can only be either treated or untreated (or can get an alternative treatment). Hence, one can only observe one of the potential outcomes for subject i, the other potential outcome is missing. This was noted as the fundamental problem of causal inference by Holland. 6 Rubin 7 suggested that estimation of causal effects could be thought of as a missing data problem where the interest lies in predicting the unobserved potential outcome. Hence, efficient causal inference involves proper estimation of unobserved potential outcomes by comparing two groups that are very similar in all 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 165

166 characteristics except the treatment they received; one group being treated and the other untreated. 8 The causal model relies on no interference between subjects, 9 i.e., the potential outcomes for subject i is not allowed to depend on the treatment received by other subjects, which was called the stable unit treatment assignment assumption (SUTVA) by Rubin Application of Propensity Score Methods to Quantify Treatment Effects 166 Causal Treatment Effects Causal treatment effects are ideally investigated in experiments (randomized controlled trials, RCTs) that apply randomization that assigns subjects to treatment or no treatment (control)/ alternative treatment groups. 2 This randomization uses chance e.g. the flip of a coin for assigning one subject to treatment, another to the control group - to construct groups; often, it does not use measured baseline (pretreatment) characteristics of the subjects to drive treatment allocation. 11 As a consequence, the groups tend be, on expectation, comparable or exchangeable with respect to both measured and unmeasured characteristics 11,12 which also means that treatment assignment is ignorable, i.e., treatment is independent of baseline characteristics and potential outcomes (or a subject s prognosis) 1 T=1 T=0 : T Y i, Y i. The ignorable treatment assignment is the key to identifying causal parameters via direct comparison of outcomes between treated and untreated subjects. 1,3 In observational studies, in contrast, treatment assignment is not random at all. In daily practice, the decision to initiate a specific treatment critically depends on patient characteristics (such as the severity of the symptoms, prior history, and many other prognostic factors. Thus, it is very likely that subjects receiving treatment systematically differ from subjects not receiving the treatment with respect to baseline characteristics (covariates). Hence, direct comparison of outcomes between treated and untreated subjects may be misleading. As a result, causal inference in observational studies involves preprocessing the data or using adjustment methods so that treated and untreated groups would be balanced with respect to important baseline characteristics. One can consistently estimate a causal treatment effect only under conditions of ignorable treatment assignment 1 which implies that treatment assignment is independent of potential outcomes given a set of covariates (X), T Y i T=1, Y i T=0 X, and that there are treated and untreated subject for all values of covariates, X (i.e., there is a non-zero probability of receiving each treatment for all values of X: i.e. 0 < P(T=1 X) < 1 for all X, where T is the treatment under study). The treatment independence on potential outcomes given the set of covariates (X) is often called conditional exchangeability, or no unmeasured confounding. Rubin and Rosenbaum 1 proposed propensity score methods as tools to balance treatment groups with respect to measured characteristics so that an unbiased treatment effect can be estimated with the assumption of ignorable treatment assignment given the set of covariates included in the propensity score model (i.e. no unmeasured confounding). Propensity Scores The propensity score (PS) is the conditional probability (between 0 and 1) of receiving the treatment (T=1) versus control (T=0) given a vector of observed covariates (X): e i (T i ) = P(T i =1 X i = X). 1 Intuitively, the PS is a measure of the likelihood that a subject would have been

167 treated based only on his/her covariate values. It is a scalar summary of all pretreatment covariates included in the model with three important features. First, it is a balancing scores meaning that at each value of the PS, the distribution of the covariates X defining the PS is similar in the treated and untreated groups. Second, if treatment assignment is ignorable given the set of covariates X, then treatment assignment is also ignorable given the PS. Third, estimated propensity scores are better than true propensity scores at removing bias because they also remove some chance imbalances in measured covariates in addition to systematic differences. 13 The implications of these properties will be discussed later. Unlike randomized experiments, the true propensity score is unknown in observational studies, and therefore needs to be estimated from the data, for instance, using logistic regression of the binary treatment category (treatment vs no treatment or alternative treatment) on the measured covariates. Applied researchers who have not used PS before may ask why one should estimate the probability that a subject receives a certain treatment, while the data clearly shows whether an individual receives the treatment. A brief answer to this question is: to create a quasi-randomized experiment by using the probability that a subject would have been treated (i.e., the propensity score) as a summary score of all potential (and measurable) confounders, to enable appropriate adjustment of the estimate of the treatment effect. 14 This explains one of the key properties of the PS mentioned earlier: if we find two subjects, one in the treated group and one in the untreated group, with the same PS, we could imagine that these two subjects were more or less randomly assigned to each group in the sense of being equally likely to be treated or not. In a RCT, randomization, which assigns subjects to the treated and untreated (control) or alternative treatment groups, is better than balancing on PS because randomization attempts to balance both measured and unmeasured covariates. 1,14,15 Although PS based results are conditional on the measured covariates only, one can be more confident that the effect estimates are unbiased, when (almost) all covariates believed to be related to the treatment assignment and prognosis (and thus could act as confounders) are measured in a valid manner. 15 Covariate Selection for Propensity Score MODEL In many practical settings, particularly in pharmacoepidemiology, investigators encounter high dimensional data (i.e., large number of covariates) with common exposure and relatively few outcome events (notably in studies of rare adverse events). In an attempt to estimate an unbiased causal treatment effect, selection of important covariates should be made before or during model fitting, particularly when conventional regression methods are being used, to avoid problems such as over-fitting. 16,17 In such settings, the PS methods are invaluable tools for reducing high-dimensionality in covariates. Despite the popularity of PS methods, there are no well-developed tools for variable selection in PS models, and as a consequence applied researchers often use methods that were developed for conventional regression models. Different types of variables can be distinguished based on their relationship with treatment (T), outcome (Y) and other variables. These include: confounding variables (variables that 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 167

168 5.1 determine the probability of initiation of treatment and are also related to the outcome, for example, X in Figure 1a); instrumental variables (IVs, variables that are strongly related to the treatment status but not independent predictors of the outcome, other than through their relationship with the treatment, IV in Figure 1b); risk factors (variables that are related to the outcome but unrelated to treatment status, R in Figure 1c); intermediate variables (post-treatment variables that are influenced by treatment and lie in the causal pathway from treatment to outcome, I in Figure 1d); colliders (variables that are a common effect of two independent causes, C in Figure 1e and X t-1 in Figure 1f ); and time-varying confounding variables (variables that vary over time and confound the association between treatment status and the outcome at different time points with or without being affected by previous treatment status, X t and X t-1 in Figure 1f ). Conditioning on confounding variables using propensity score or regression models generally reduces cofounding, whereas controlling for risk factors of the outcome improves precision of causal effect estimates without increasing (or reducing) bias However, conditioning on other types of variables is unnecessary and may induce confounding. 20,21,24,25 Application of Propensity Score Methods to Quantify Treatment Effects 168 T (a) (c) (e) X V T Y R V D T X C X Y E Y R Figure 1. Causal diagrams depicting different associations between (time-varying) treatment (T), outcome (Y), confounding variables (X), risk factors of the outcome (R), instrumental variables (V), intermediate variables (I), common effects also called colliders(c), time-dependent confounders at different time points, t, t-1 (X t, X t-1 ), and unmeasured common causes (U). V V X t-1 T T t-1 T I X (d) X t (f) (b) T t X Y Y R U Y

169 Consequently, Rubin and Thomas 26 suggested that the PS model should include all pretreatment variables thought to be related to the outcome, whether or not they are related to the treatment variable (thus, both confounding variables and risk factors of the outcome), to reduce bias and improve precision. 26 Later, Rubin 17, 27 argued that if variables strongly related to treatment had even a weak effect on outcome (and thus are near IVs and not risk factors ), their bias reducing potential outweighs any loss of efficiency in a reasonably sized study. Bhattacharya and Vogt, 28 Wooldridge, 29 and Pearl 22 independently found that such variables that are strongly related to treatment and weakly related to outcome have bias amplifying properties in the presence of unmeasured confounding which led to Pearl s 21 recommendation to condition only on outcome-related covariates and to exclude strong predictors of treatment. Myers et al. 30 on the other hand, suggested that, in practice, bias amplification should be a secondary concern compared to residual confounding (that might arise from excluding such variables) and favor possible err on the side of inclusion rather than exclusion of potential confounders. Simulation studies demonstrated that the bias amplification is due to exacerbation in the imbalance of unmeasured confounders that is independent of measured factors, including instrumental variables. 23,31,32 In chapter 2.3 of this thesis, it was shown that when there is an association between such an instrumental variable and measured or unmeasured confounding variables, inclusion of such a variable will optimize bias reduction, supporting the views expressed by Rubin 17 and Myers et al. 30 This underlines the importance of evaluating associations among covariates, in addition to their association with treatment and outcome. Furthermore, when such a variable fulfills the three assumptions underlying instrumental variable methods: i.e. (1) strong association with the treatment, (2) no association with the outcome conditional on the treatment, and (3) independence from measured and unmeasured confounding factors, investigators may be better off employing instrumental variable analyses rather than using regression and propensity score methods excluding such a variable. Although one can empirically assess the first assumption (i.e., whether a variable is strongly association with the treatment), the last two cannot be tested using the data since they involve unmeasured confounding. In chapter 4.2, we described how the propensity score balance measures described in section 4 of this chapter, such as the standardized difference, can also be employed to falsify the third assumption. When the assumptions are violated (or when there is reasonable doubt whether they are fulfilled), exclusion of such a variable carries a higher risk of inducing bias than inclusion of the variable (with the possibility of bias amplification). This was discussed in chapter 2.3. Although the goal of model development in propensity score methods differs from those of regression modeling, it is not uncommon to see the variable selection techniques specifically developed for regression methods (Table 1) being applied for propensity score approaches. 37 Prior studies demonstrated that such techniques failed to detect variables that should not be adjusted for, such as colliders, intermediate variables and instrumental variables. Table 1 summarizes different variable selection techniques, their advantages, and disadvantages with respect to the goal of PS methods. 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 169

170 Application of Propensity Score Methods to Quantify Treatment Effects 170 Table 1. Commonly Used Variable Selection Methods, Their Advantages and Disadvantages in View of Propensity Score Objectives (Continued) Method Description Advantages Disadvantages - - Does not take all available prior knowledge into account unless combined with prior knowledge Straightforward and easy to apply - - Forward selection and backward elimination based on statistical significance (P-value) Conventional stepwise regression - - Only takes into account variable-treatment associations and not variable-outcome associations - Cannot distinguish between confounder and collider - Arbitrary cut-point (p-value) - Focuses on strong confounders and ignores the possible combined effect of multiple weak confounders - - Influenced by the order in which confounders are included - - or deleted from the model - - Influenced by sample size (p-values) - - Does not take advantage of all prior information unless combined with prior knowledge - - Straightforward and easy to apply - - Forward selection and backward elimination based on relative or absolute change in the estimated treatment effect Change-in-estimate methods - Cannot distinguish between confounders and colliders - Arbitrary cut-point (e.g. p-value, or percentage change) - Focus on strong confounders and ignores the possible combined effect of multiple weak confounders - - Influenced by the order in which confounders are included or deleted from the model - - Does not take advantage of all prior information unless combined with prior knowledge - - Straightforward and easy to apply Goodness-of-fit tests - - Based on e.g., AIC or Hosmer- Lemeshow goodness-of-fit test statistics - Cannot distinguish between confounder and collider - Focus on strong confounders and ignores the possible combined effect of weak confounders - - Focus on strength of association with treatment, not with the outcome - - Arbitrary cut-point (p-value)

171 - - Table 1. Commonly Used Variable Selection Methods, Their Advantages and Disadvantages in View of Propensity Score Objectives (Continued) Method Description Advantages Disadvantages - - Does not take advantage of all prior information unless combined with prior knowledge - - Straightforward and easy to apply - - Based on how the model discriminates between treated and untreated subjects (measured as receiver operating characteristic, ROC, curve or C-statistic) Model discrimination test (c-statistics) Cannot distinguish between confounder and collider - Focus on strength of association with treatment, not with the outcome - Arbitrary cut-off (P-value) - Relies on the data at hand - - Arbitrary or subjective (since it is based on prior knowledge) - - Straightforward and easy to apply - - Distinguish between Background knowledge - - Based on clinical knowledge and/or the use of causal diagrams - - confounder and collider - - Optimal confounder identification - Doesn t rely on the data at hand - Takes (all) prior information into account - - Can be pre-specified before conducting the data analysis * In propensity score model, the dependent variable is the treatment status as a function of a set of covariates. 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 171

172 Propensity Score Estimation 5.1 Propensity scores are most frequently estimated using logistic regression although several data mining techniques such as neural networks, classification trees, meta-classifiers and support vector machines have also been suggested. 14,38-40 Previous simulation studies compared logistic regression and machine learning techniques including classification and regression trees (CARTs, their variants: Pruned CART and Bagged CART) as well as neural networks. 38,39 Neural networks and CARTs were shown to have better performance than logistic regression in terms of bias reduction and consistent 95 percent confidence interval coverage under conditions of both moderate non-additivity and moderate non-linearity. 38,39 However, both studies considered only main effects without interaction and higher order terms in logistic regression models (and thus, potentially underestimating the performance of these models), whereas machine learning techniques model such terms naturally without the need to specify interactions and square terms. 40 Inclusion of carefully chosen interactions and square terms in the logistic regression models, as proposed by Rubin, 41 although not commonly employed, in practice 37 might change the conclusions based on these simulation studies. Application of Propensity Score Methods to Quantify Treatment Effects Logistic regression has several advantages: it is a familiar, reasonably well-understood statistical tool for investigators and is easy to implement in most statistical packages. 40 In Table 2, different propensity scores estimation methods are compared. For detailed information on machine learning techniques we refer to the literature Propensity Score Methods to Control for Confounding The basic idea of propensity score methods is to collapse a set of confounding variables into a single function of these covariates, the propensity score, which can be used as if it were the only confounding variable. Once the propensity score is estimated (section 4) and optimal balance is achieved (see next section for details on how to check balance), the propensity score can be used for (i) creating a matched sample of treated and untreated subjects with similar propensity scores; (ii) stratifying subjects on their propensity scores and estimating treatment effect within propensity score strata; (iii) covariance adjustment using the propensity score; or (iv) inverse probability weighting as used in marginal structural models. 1,42-44 At this point, it might seem puzzling why this section proceeds the section on assessment of covariate balance. The answer is straightforward: the choice of the propensity score method (which depends in part on the research question in mind, i.e., the inferential goal of the research) determines how balance of covariates or specification of the propensity score model should be evaluated. It also dictates estimation and interpretation the treatment effect (Section 7, Estimation and interpretation of treatment effects). 8,37, Propensity Score Matching Propensity score matching is the commonly used propensity score method to control for confounding in the medical literature. 37,46 It involves constructing matched pairs of

173 Table 2. Comparison of Different Methods for Estimating Propensity Score Estimation Methods Description Advantages Disadvantages - - Makes a parametric assumption of a logistic relation between covariates and exposure Fairly well-understood by investigators - Software is available for different statistical packages (SAS, STATA, and R) 40 Logistic Regression - - A dichotomous linear classifier - that utilizes regression equation Sensitive to interaction and higher order terms particularly with highdimensional data Black box to the data analyst 38 - Training of the neural networks is still - Designed to deal with high-dimensional data - Can approximate any smooth polynomial 40 function not very well developed; requires expertise in tuning the learning algorithm 38,40 regardless of the order of the polynomial and the number of interaction terms - Nonparametric 40 Neural Networks - - Takes input values and transform them according to weights on its directed ages, and then outputs values such as probability of class membership Software is available for different statistical - - packages (SAS, STATA, and R) Performance is dependent on the amount of pruning (removal of highly specific node) Designed to deal with high-dimensional data 40 - Easy to interpret and implement - Nonparametric - Software is available for standard statistical packages (SAS, STATA, and R) Classification algorithms which specify a tree of cut points that minimize some measure of diversity in the final nodes once the tree is complete 38,40 Classification and Regression tree (CARTs) Requires expertise in tuning the learning algorithm - Designed to deal with high-dimensional data - Semi-parametric Makes classification decisions based on a linear combination of the features (covariates) of the data points Support Vector Machines - - Software is experimental in some (e.g., SAS) Does not provide interpretable coefficients - Designed to deal with high-dimensional data - Require less expertise - Less susceptible to over-fitting than single classifier techniques - Non-parametric 40 - Software is available for standard statistical packages (SAS, STATA, and R) Combines the results of many weak classifiers (i.e., classifiers with performance only slightly better than chance) to form a single strong classifier 40 Meta Classifiers (Boosting) 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 173

174 5.1 treated and untreated subjects who have the same or similar values of the propensity score. 1,41 One-to-one matching, which pairs a treated and an untreated subject having similar propensity scores, is the most commonly applied propensity score matching methods. 37,46 However, several alternative approaches are available including fixed ratio matching and variable ratio matching, which matches a fixed or variable number of untreated subjects to each treated subject, respectively. 26,47-49 A more sophisticated form of matching, full matching, involves creating a series of matched sets each containing at least one treated subject and at least one untreated subject with a possibility to have many more from either group. 50,51 The propensity score matching methods described above can also be classified based on the closeness of the propensity scores of the matched treated and untreated subjects: 8,52 exact matching, 53 nearest neighbor matching, and nearest neighbor matching within a specified caliper distance defined by the propensity score. 54 Application of Propensity Score Methods to Quantify Treatment Effects 174 Exact matching, which selects for matching treated and untreated subjects with exactly the same propensity score values, can be very restrictive when it involves one-to-one pairing, particularly when the data set is not very large. It can lead to changes in the estimated treatment effect, the average treatment effect in the treated (ATT), and/or loss of precision, when many treated subjects have to be discarded due to problems in finding sufficient number of exact matches. 54 The precision of such an analysis can be increased when all exact untreated matches are used (i.e., variable ratio matching) for each treated unit resulting in a further reduction in variance without any increase in bias. 8 Nearest neighbor matching, on the other hand, selects for a given treated subject an untreated match whose propensity score is closest to that of the treated subject without any restrictions being placed upon the maximum acceptable difference between the propensity scores of two matched subjects. In nearest neighbor matching, the order in which potential matches are made can influence the quality or closeness of the matches. The descending matching method starts with a treated subject with the highest propensity score and proceeds in a decreasing order of the propensity score, whereas the ascending method starts with the treated subject with the lowest propensity score and move upward, and the random order method matches the treated subjects in random order of their propensity score. 55,56 When multiple untreated subjects have propensity scores that are equally close to that of the treated subject, one of these untreated subjects is selected at random. The three methods are also called greedy matching methods since at each step in the matching process, the untreated subject nearest to the given treated subject is selected for matching, even if it would be a better match for a subsequent treated subject. 11,52 The performance of these methods, in terms of bias reduction and improvement in precision, varies depending on the data structure, and in particular on the separation between treated and untreated subjects in terms of the propensity scores. 55 Optimal matching, which takes into account the overall set of matches when selecting an untreated subject as a match for a treated subject, thereby minimizing the global within-pair difference of the propensity score, can

175 be an alternative solution to avoid the impact of matching order, particularly when there is high competition for untreated matches. 57 Nearest neighbor matching can lead to poor matches, particularly when combined with (variable) ratio matching, if no restriction is imposed on the permitted difference in propensity scores between matched individuals. Nearest neighbor matching within a specified caliper distance defined by the propensity score, will select a match for the treated subject if the propensity score of the treated subject lies within the caliper, thereby avoiding poor matches. 54 The choice of caliper influences the closeness of matches that can be achieved (i.e., the balance, therefore, the potential to reduce bias) as well as the precision of the treatment effect estimate. A tighter caliper generally leads to closer matches and less bias, leaving some subjects unmatched, the implication of which is discussed in section 6. With a wider caliper, on the other hand, large number of subjects can be matched, thereby improving the precision of the effect estimate at a cost of introducing bias in the effect estimate. Cochran and Rubin 58 examined the reduction in bias due to a single normally distributed confounding variable by matching on this confounding variable using calipers which were proportional to the standard deviation of the confounding variable. Based on these findings, Rosenbaum and Rubin 54 suggested matching on the logit of the propensity score, which is more likely to be normally distributed than the propensity score itself, using a caliper width that is a proportion of the standard deviation of the logit of the propensity score. Although the appropriate caliper width depends on the structure of the data set as well as the association between the matching variable and the outcome variable, caliper widths of 0.25 standard deviation of the logit of the propensity score is a universally accepted recommendation. Although this caliper width has been shown to be more robust in several studies, 55 considering several calipers might improve the trade-off between balance and the variance of the treatment effect. The utility of balance metrics for choosing optimal caliper width has been discussed in chapter 2.3. Alternative matching methods include the best-first matching, 55 5-to-1-digit matching, 8-to-1-digit matching, 46, 59 and kernel density matching. Matching with and without replacement Another key issue regarding matching methods is whether untreated subjects can be used as matches for more than one treated subject, i.e., whether matching can be done with replacement or without replacement. 11,60 When using matching without replacement, once an untreated subject has been matched to a given treated subject, the untreated subject is no longer available for consideration as a potential match for subsequent treated subjects. As a result, each untreated subject is included in at most one matched set. In contrast, matching with replacement allows a given untreated subject to be included in more than one matched set. Matching with replacement often decreases bias by improving balance between matched sets and can be helpful when the data contains few untreated subjects compared to the treated pool. Unlike matching without replacement, the order in which matches are formed does not matter when matching with replacement is used; 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 175

176 however, inference and variance estimates become more complex with matching with replacement due to the fact that the same untreated subject may be in multiple matched sets. 61 Methods to account for such lack of independence are discussed in section Application of Propensity Score Methods to Quantify Treatment Effects 176 An attractive feature of propensity score matching is that it separates the study design phase (i.e., nonparametric preprocessing of the data without the need to use outcome data, which may eliminate or reduce the relationship between treatment variable and pretreatment covariates) from the data analysis phase, thereby minimizing model dependence in estimating treatment effect. 53 Propensity Score Stratification Stratification on the propensity score (also called subclassification) involves grouping subjects into mutually exclusive strata defined by their estimated propensity score. 1 It assumes that, given the propensity score model is correctly specified, treated and untreated subjects within each propensity score stratum will have similar distribution of measured baseline covariates. The most common approach is to divide subjects, after sorting them according to their propensity scores, into five approximately equal-sized subclasses using the quintiles of the estimated propensity score. 37,41 Rosenbaum and Rubin 41 demonstrated that sub-classification on quintiles of the propensity score eliminates approximately 90 percent of the bias due to the measured confounders. This work was based on previous findings of Cochran, 62 where subclassification on five subclasses of a continuous confounding variable eliminated approximately 90 percent or more of the bias attributable to that variable. The decision on the number of classes should be made based on the sample size as well as empirical assessment of bias reduction. With a large dataset it might be desirable to use more than five strata to optimize bias reduction 62,63 whereas in a small dataset, it may not be feasible to form more than three strata and still having adequate balance of the covariate distributions. 63,64 Like propensity score matching, stratification on the propensity score separates the design and analysis steps thereby eliminating model dependence. 16 In addition, it is both understandable and intuitively attractive to researchers (and readers) with limited statistical background. 1 Regression Adjustment Using the Propensity Score Another common application of the propensity score is covariate adjustment using the propensity score as a covariate. It involves a regression modeling of outcome on treatment adjusting for the estimated propensity score. 1 This approach is very similar to conventional regression methods except that the set of measured covariates in the outcome model is replaced with a single composite measure, the propensity score. In contrast to propensity score matching and stratification, regression adjustment using the propensity score directly incorporates the estimated propensity scores in the outcome model; as a result, like the conventional regression models, correct specification of the regression model relating the outcome with the treatment and covariate is required.

177 Inverse Probability of Treatment Weighting Using the Propensity Score Inverse probability of treatment weighting (IPTW) is a weighting method in which the inverse of the estimated propensity score is used as weights for treated subjects and the inverse of one minus the propensity score as weights for untreated subjects ,65 The weighting results in a pseudo-population where treatment is independent of potential outcomes and measured covariates included in estimating the propensity score, given correct specification of the model relating treatment and covariates. Rosenbaum proposed IPTW as a form of model-based direct standardization and the use is similar to applications in survey sampling that assign weights to survey samples so that the samples are representative of a specific population. 66 This method can be viewed as extensions of stratification as the number of observations and subclasses tends to infinity or as a method of analysis based on the ideas of Horvitz-Thomson estimators. 67,68 It belongs to a large family of causal models called marginal structural models. 43,44 They are marginal models, because they model the distribution of counterfactual random variables rather than the joint distribution. They are called structural models, because they model the probabilities of counterfactual variables (models for counterfactuals, in social science and econometric literatures, are often referred to as structural). 44 Like the regression adjustment using propensity score, IPTW using propensity score requires correct specification of the propensity score as well as the outcome model. The four propensity score methods, their description, advantages and disadvantages are summarized in Table 3. Assessment of Balance Achieved by Propensity Score Model Propensity score model specification should be assessed based on its performance in creating balance on covariates and not on how well the PS model discriminates between treated and untreated subjects, i.e., whether the treatment process is correctly modeled 16 or whether the eventual treatment-effect estimates are larger or smaller than expected. 16,41,54,69 Rubin 41 described propensity score model fitting as an iterative step where the propensity score is re-estimated by including different interactions or higher order terms (in the propensity score model) until an acceptable balance on covariates is achieved. Although there exists no threshold below which the level of imbalance is always acceptable, 70 the use of arbitrary cut-offs for balance diagnostics (for example below 10% for the absolute standardized difference) is prevalent in the medical literature. 37 Several balance diagnostics such as the standardized difference have been proposed in the literature. 23,59,67,71-73 Although hypothesis testing statistics such as significance tests are commonly used to decide whether the propensity score model is correctly specified and are reported by applied researchers, statisticians and/or methodologists have cautioned their use. Imai, King and Stuart 70 motivated that hypothesis tests such as t-test are functions not 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 177

178 Application of Propensity Score Methods to Quantify Treatment Effects 178 Table 3. Comparison of the Different Propensity Score Methods, Their Advantages and Disadvantages Method Description Advantages Disadvantages - No consensus on variance estimations - Interpretations could be complicated particularly when some observations are excluded - Straightforward and easy to apply - Minimize model dependence - Separates the study design and data analysis stage of a study - - Construct treated and untreated matched groups having similar propensity score Propensity Score Matching - - Primarily estimates ATT but ATE can also - - Easy to check improvements on be estimated with slight modification - - Interpretations can be complicated covariate balance - Straightforward and easy to apply - Separates the study design and data analysis stage of a study - Minimizes model dependence - Easy to check improvements on covariate balance - - Construct strata of treated and untreated subjects having closer propensity score Propensity Score Stratification - - Can Estimate ATT or ATE Straightforward and easy to apply - - Checking improvements on covariate balance is not straightforward - - Requires correct specification of propensity score Propensity score is used as a single summary of all covariates (included in propensity score model) in regression model Regression Adjustment using propensity Score model - - Mixes up the design and analysis stage of a study - - Estimates ATE and focuses on the analysis stage than design - - Relies on the assumption of linear relationships between the propensity score and outcome - - Interpretations could be complicated when non-collapsible effect measures such as odds ratios are used - - Extrapolates even when there is non-positivity* - - Requires correct specification of propensity score and outcome model - Straightforward and easy to apply - Extends to time-varying treatment and confounding setting - Focuses on the analysis stage than design - Sensitive to observations with extreme weights and non-positivity - - Propensity scores are used as weights to create a pseudo-population in which treatment becomes independent of measured covariates included in the treatment (propensity) score model Inverse Probability of Treatment Weighting Using Propensity Score - - Estimates either ATE or ATT

179 only of balance but also of power (i.e., sample size) and, unlike balance, such tests are not property of a particular sample but refer to a hypothetical super-population ; hence, should not be used as stopping rules for maximizing balance. In PS matching, for example, nonsignificant p-values in a matched sample compared to the original unmatched population might be considered as indicators for improved balance even when it is due to a reduced sample size (power) as a result of excluding unmatched subjects, whereas the actual balance may either improve, remain the same or even get worse. 70,72 Further, they suggested that a balance measure should not be influenced by sample size and should be characteristic of a sample and not of a hypothetical population. 70 Both goodness-of-fit (e.g., Hosmer- Lemeshow goodness-of-fit test) as well as discrimination tests of a model (c-statistics or the area under the receiver operating characteristics (ROC) curve of the propensity score model) give no indication of whether an important confounding variable has been omitted from the propensity score model 69,74 nor are they related to the degree of covariate balance after conditioning on the PS. 75 For example, one can improve the c-statistic of a model by including instrumental variables which might, on the other hand, result in amplification of residual bias due to unmeasured confounding as well as reduced overlap of propensity score distributions, thereby decreasing precision of the treatment effect estimate. Hence, the use of goodness-of-fit statistics or measures of model discrimination for developing and evaluating propensity score models should not be advocated 69,74,75 Decisions about whether a PS model is correctly specified should be made based only on examining patient characteristics measured before the collection of any outcome measures. 67 The only way to assess whether unmeasured characteristics are balanced is to collect data on as many characteristics as possible including proxies for unmeasured factors and to examine the balance on measured covariates to which they are related. Additional balance diagnostics include Kolmogorov-Smirnov distance, Levy distance, nonparametric overlapping coefficient, L1 distance, and Mahalanobis distance, among others. Several simulation and empirical studies demonstrated that the absolute standardized difference is the most robust in terms of covariate distributions and sample size requirements compared to other non-parametric methods such as overlapping coefficients. 23,71 In addition, the absolute standardized difference is a well understood and easy to calculate statistical tool; hence, it is a recommend methods for checking and reporting covariate balance in propensity score methods. A limitation of the absolute standardized difference, as of most balance measures such as the Kolmogorov-Smirnov distance, is that it measures (im)balance on one covariate at a time and one has to average (possibly weighted) covariate specific standardized differences to obtain a summary balance measure as introduced by Belitser et al 71 and later applied, with a slight modification, by Franklin et al. 76 Such an approach requires outcome information to specify the weights, which may seem to contradict with an important feature of PS matching and subclassification: execution of the study design stage separately from the data analysis stage. 67 When investigators have some additional knowledge on which covariates are prognostically important, weighted balance measures may be preferred because better balance is more important for strong predictors of outcome than for weak predictors. 53 On the other hand, the definition of weights is not straightforward. 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 179

180 5.1 Recently, Franklin et al 76 proposed a post-matching or post-stratification or post-weighted c-statistic as measures of balance in propensity score matching, stratification, and inverse probability weighting, respectively. Although, in their simulation setting, the post-matching c-statistic did not outperform (weighted) average absolute standardized difference, it has the advantage of evaluating balance on several covariates, interaction and square terms simultaneously and, unlike the pre-matching c-statistic discussed earlier, it has shown a consistently strong association with the amount of bias. 76 This metric, however, may require a large sample size, particularly when propensity score stratification is used, where the balance metric has to be estimated within each stratum of the propensity score, and its application with adjustment for the propensity score in the regression analysis is not straightforward. In the same study, the performance of other multivariate balance metrics (the Mahalanobis balance 57 and the L1 balance metric 77 ) was poor in terms of bias reduction. 76 A comparison of the different balance measures discussed so far can be found elsewhere. 37 Application of Propensity Score Methods to Quantify Treatment Effects All balance metrics should be calculated in a way similar to how the outcome analysis will be conducted: between matched groups when using propensity score matching, within strata of the PS when using stratification on the PS, and between treated and untreated subjects in the weighted population when using inverse probability of treatment weights. When regression adjustment using the PS is used, balance could be assessed using standardized difference on the logit of the propensity scores or variance ratios of the residuals of the covariate after adjusting for the logit of the propensity scores. 67 The use of graphical methods such as quintile-quintile plots, side-by-side (weighted) box plots, plots of standardized difference of means, and empirical density plots for comparing distribution of continuous baseline covariates can provide a quick overview of whether balance has improved for individual covariates. 59,72 Importantly, examining the distribution of propensity scores using histograms or density plots facilitates subjective judgment on whether there is sufficient overlap between the two propensity score distributions, commonly called the common support. It can also guide the choice of matching algorithms in propensity score matching. 60 When, for example, the overlap in the propensity scores is not substantial, meaning that treated and untreated subjects are somewhat different, matching with replacement can be a better option since it will be difficult to find sufficient numbers of matches for treated subjects. When the overlap is too limited, investigators should be aware that the database, no matter how large, cannot support any causal conclusion about the effect of treatment. 64,70,78 The implication of propensity score overlap or common support on treatment effect estimate is further discussed in the next two sections. Estimation and Interpretation of Treatment Effects 180 Matching Using Propensity Score Once propensity score matching achieved acceptable balance on important covariates and interactions or square terms, causal treatment effects can be estimated via direct

181 comparison of outcomes between matched groups using difference in means or proportions without the need to rely on parametric models, like randomized controlled trials. 64,67 Whether to account for the matched nature of the data (i.e., lack of independence between observations) in the analysis is still an ongoing debate. Some argue that the treated and untreated subjects can be regarded as independent and the same analysis that would have been done using the original unmatched data, but using the matched data instead, is appropriate. 8,45,51,53 In contrast, Austin argued that, within the same data set matched on propensity scores, treated and untreated subjects are not independent observations since their baseline covariates, which might be related to outcomes, come from the same multivariate distribution. As a result, the likelihood of having similar outcomes is higher for matched subjects than randomly selected subjects suggesting the need to account for lack of independence when estimating the variance of the treatment effect; a view shared by several others. 83 When matching is done with replacement and variable ratio matching, weights for untreated subjects reflecting the number of times they are selected as a match to treated subjects and the number of untreated subjects matched to the treated individual, respectively, need to be incorporated in the analysis. 8,60,84 Regression adjustment using propensity score matched samples has double robustness properties in the sense that if either the matching or the parametric model is correct, but not necessarily both, causal estimates will still be consistent. 85 That is, if the parametric model is misspecified, but the matching is correct, or if the matching is inadequate but the parametric model is correctly specified, estimates will still be consistent. 70 Nearest neighbor matching handles the common support problems (limited overlap in propensity scores) in some ways, in that only the closest untreated matches are selected for treated subjects and unmatched controls are excluded from the analysis. Hence, it focuses, almost always, on estimating the average treatment effect in the treated, ATT 83,84 not the average treatment effect in the entire sample (ATE). When there is limited common support, treated and untreated subjects in the non-overlapping region should be excluded because one cannot infer treatment effects in this region without extensive extrapolation. It is important to note that exclusion of unmatched subjects from the analysis not only affect the precision of the effect estimate but also have consequences on the generalizability of the findings, even for the ATT. For example, exclusion of untreated subjects due to lack of close matches changes the estimate from ATT to the average effect of treatment in those treated subjects for whom we can find untreated matches. 84 For ATT, it is sufficient to ensure that potential untreated matches are available for treated subjects; ATE additionally requires the existence of potential treated matches for untreated subjects. One can also estimate the average treatment effect in the sample with modifications of the matching algorithms. For example, full matching which used all subjects for analysis can estimate either the ATT or the ATE. Although matching, in general, discards some data (i.e., unmatched subjects), it can actually increase the efficiency of treatment effect estimates. 8, Application of Propensity Score Methods to Quantify Treatment Effects 181

182 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 182 Stratification Using the Propensity Score Propensity score stratification, on the other hand, could be conceptualized as a meta-analysis of a set of quasi-randomized controlled trials. 82 Within the strata formed by the propensity score, measured covariates are assumed to be balanced between treatment groups, hence, the treatment effect can be estimated via direct comparison of outcomes between treated and untreated subjects. 41,82 The stratum-specific treatment effects can then be aggregated across subclasses to obtain an overall measure of treatment effect. 41 It can estimate either the stratum specific or overall ATT or ATE depending on how the subclass estimates are weighted. Weighting stratum-specific estimates by the proportion of treated subjects in each strata provides ATT estimation whereas weighting by the total number of subjects in each strata allows estimation of ATE. 8,83 Similarly, pooling stratum specific variances provide pooled estimates of the variance for the pooled ATT or ATE estimate. Like propensity score matching, balanced stratification can be combined with modelbased adjustment to provide improved estimates of stratum-specific treatment effects by reducing residual imbalances on covariates between treatment groups within strata. 41 Sometimes it may not be feasible to estimate stratum-specific treatment effects, particularly when using regression modeling, due to very small number of observations within strata. In such cases, one can fit a joint regression model with strata and treatment indicators as independent variables. 8 The estimate from such an approach is more complicated when using a non-collapsible effect measure such as odds ratio and hazard ratio, where the conditional (aggregated stratum-specific) and marginal effects differ in the presence of a non-null treatment effect even if there is no confounding. 3,25 Regression Adjustment Using the Propensity Score In covariance adjustment using the propensity score, the outcome variable is regressed on the treatment variable and the estimated propensity score. Although it is easy to apply, it is considered as the least optimal application of propensity scores for at least three reasons. First, treatment effect estimation is highly model-dependent because it mingles the study design and data analysis stage, hence, requiring correct specification of the propensity score models. 16,17 Second, it makes additional assumptions unique to regression adjustment, namely, the relationship between the outcome and the estimated propensity score must be linear and there should be no interaction between treatment and propensity score. 1,67,82 Third, although it generally allows for estimation of the average treatment effect (ATE), the interpretation is complicated in the case of non-continuous outcomes where the estimate of interest is non-collapsible (e.g., odds ratio or hazard ratio). 3,23,24,86 Inverse Probability of Treatment Weighting Using Propensity Score Inverse probability of treatment weighting, like propensity score matching and stratification, can be viewed as a situation involving preprocessing the data using weights to create an artificial population, pseudo-population, in which treatment becomes independent of measured covariates. As a consequence, one can estimate treatment effect by direct

183 comparison of outcomes between treated and untreated subjects in the weighted population. Alternatively, the weights can be used directly in the regression models. Although this method focuses on estimating the average effect in the population, ATE, modification of the weights allows one to estimate ATT. 66 Most importantly, the variance estimation should take into account the weighted nature of the artificial population, for example, using the sample weights in robust variance estimation. 87 Alternatively, bootstrapping can be used to account for the weighted nature of the population. The downside of this approach is that when some subjects have probabilities to receive the treatment close to 0 or 1, the weights for such subjects become unstable. To address this problem, stabilizing weights have been proposed to help normalize the range of the inverse probabilities and increase efficiency of the analysis. 43,44,88 The application of IPTW for settings where there is time-varying treatment in the presence of time-dependent confounding has been discussed in chapters 3.1 and 3.2; 66 It will be further summarized in section 9 (Extended applications, strengths, and limitations of PS methods). 5.1 Reporting of Propensity Score Analysis Propensity score methods are invaluable tools in observational studies. However, like regression analysis, the quality of the results obtained from propensity score analysis depends on state-of-the-art conduct of the consecutive steps involved. The reader relies on the information provided for a critical appraisal of the validity of the results. Despite substantial developments in and common applications of propensity score methods, reporting of aspects of the propensity score analysis is generally poor or inconsistent in the medical literature. 37,46,80,89,90 This could be, in part, due to lack of standards for conduct as well as reporting of propensity score methods in guidelines on the reporting of observational studies, such as the STROBE statement. 91,92 Details on important aspects of propensity score analysis that should be reported are included in Figure 2 of chapter Extended Applications, Strengths, and Limitation of Propensity Score Methods Time-varying Treatment PS methods are mainly used to balance baseline characteristics of treatment groups in pointtreatment studies, i.e. where confounders and treatment are not time-varying. However, in the presence of non-compliance or treatment switching, too simple (Yes/No) classification of treatment may result in biased estimation of the treatment effect. On the other hand, results from time-varying analysis of treatment with the standard propensity score approaches, like time-dependent regression methods, can still be biased in some settings. Such a setting includes the presence of time-varying covariates, which are themselves affected by prior treatment, and predict subsequent treatment. 43,44 Adjustment for such time-varying covariates that are affected by prior treatment may (1) underestimate the true treatment effect (by adjusting-away part of the treatment effect) and (2) induce collider- Application of Propensity Score Methods to Quantify Treatment Effects 183

184 5.1 Application of Propensity Score Methods to Quantify Treatment Effects stratification bias when there is a common cause for the time varying covariates and the outcome. However, inverse probability of treatment weighting in marginal structural models (MSMs) allows unbiased estimation of the treatment effect. 43,44 An additional feature of this approach is that informative censoring can be controlled for using the probability of being uncensored up to a certain time point conditional on measured characteristics included in the model for the treatment. 43 As mentioned in section 7, Estimation and interpretation of treatment effects, the IPTW approach is very sensitive to observations having probabilities close to 1 and 0 because they will have extreme weights and induce instability in the model. The use of stabilized weights with or without weight trimming has been shown to provide 95 percent confidence intervals that are not only narrower but also have actual coverage rates that are closer to 95 percent. 38,44 Applications of IPTW approaches using (stabilized) treatment and censoring weights and comparisons with conventional time-varying regression and propensity score methods have been addressed in chapters 3.1 and 3.2. Marginal structural models resemble standard models compared to other alternatives such as g-estimation of structural nested model and the parametric g-computation algorithm in that their applications are more intuitive to applied researchers. 43 A detailed discussion of this topic is beyond the scope of this chapter, but can be found elsewhere. 43,44,93 Strengths of Propensity Score Methods The propensity score methods are primarily aimed to balance treatment groups with respect to covariate distributions; when such balance is achieved, this balance is relatively easy to detect and communicate 16 using simple statistics or plots. Similarly, propensity score methods, unlike regression methods, can also warn investigators that due to inadequate overlap in covariates distribution (i.e., poor common support ) between treatment groups, a particular database cannot address the causal question without relying on untrustworthy model-dependent extrapolations. 16,17,64 The investigator might opt for restricting the inference to the group of subjects sufficiently represented in both treatment groups using methods such as nearest neighbor matching with caliper widths that results in excluding subjects in non-overlapping regions of the propensity scores. Although dropping observations is almost anathema to most trialists, it poses less of a problem in observational data. 11 Like randomized experiments, propensity score methods allows for designing a study separate from the data analysis part of the study (i.e., covariate balance can be achieved without any access to the outcome variable) so that the causal inference can be made with minimal model-dependence. 1,16,17,53 It is important to note that if nonparametric preprocessing of the data results in no reduction of model dependence, it is likely that the data contain little information to reliably support causal inferences by any method. This, obviously, still would mean useful information and the conclusion may be correct. 16, Propensity score methods also provide an efficient way for controlling for covariates or potential confounding variables, when the number of outcome events is limited compared

185 to the number of covariates, thereby minimizing the curse of high dimensionality in the data, typical of pharmacoepidemiologic studies. 1,41,53,64 Previous research has suggested that at least 10 events are required to for every covariate included in a regression model. 94,95 Similarly, Cepeda et al. 96 suggested guidelines on when to use the propensity score methods to improve efficiency of treatment effect estimation (i.e. fewer than eight outcomes per included covariate). Limitations of Propensity Score Methods Despite the increasing and wide-spread application of propensity score methods, when addressing causal questions from observational studies, it is crucial to keep in mind that propensity score, like other regression methods, can only control for measured confounding variables and not unmeasured ones. 1,41 As a result propensity score analysis can only be as good as the quality of the completeness of potential confounding variables that are at the disposal of the researcher. The ignorable treatment assignment assumption cannot be tested directly, and only a rich set of covariates can convince the readership that no unmeasured confounding variables were missed. Therefore, it is of important that investigators provide a detailed account of the variables collected and included in the propensity score model. Furthermore, sensitivity analyses 1,11,97,98 are invaluable tools to assess the plausibility of the ignorability assumption and how violations of it affect the conclusions. 8 An additional limitation of propensity score methods is that they work better in large samples 64 because the distributional balance achieved on measured covariates, like in randomized experiments, is an expected balance. As a result, in smaller studies, imbalance of covariates is inevitable even with when the propensity score model is correctly specified and whichever propensity score method is used. As a consequence, investigators attempting to answer causal questions using observational studies should strive for large dataset without compromising the quality of data. Future Prospects and Conclusions As we introduced in chapter 1, non-randomized studies are useful mainly for drug safety research involving unintended effects, particularly when the adverse event is rare, unexpected and unpredictable, which means that it is usually not associated with the (contra-) indications for treatment However, when adverse events are related to main pharmacological effect of the treatment and more or less predictable, e.g. type A adverse events, prescribing will be guided by the patients prognosis taking into account the potential for adverse effect of the treatment. 99,101, 102 This results in systematic differences in prognostic factors between treated and untreated groups, a situation similar to nonrandomized studies on intended effects of therapies. 99,101 The availability of large electric health record databases increased the focus on evaluating type A adverse events and comparative effectiveness of medications using real life data, increasing the potential for confounding 103 by (contra-)indication and the need for statistical methods such as propensity scores to prevent confounding. 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 185

186 5.1 This chapter described the concept of propensity scores and provided some guidance on how propensity scores can be used in designing and analyzing non-experimental studies quantifying the effect of treatments in a way that mimics randomized experiments. In addition, the other chapters of this thesis addressed important aspects of propensity score methods using simulations, reviews, and empirical studies. We believe that our in-depth and step-by-step description of the propensity score methodology presented in this chapter will help applied researchers conduct propensity score analysis to appropriately answer their causal research questions, and communicate the findings to their readers. Based on the experience gained from the thesis, several avenues for future research can be formulated: We discussed in section 3 (Covariate selection for Propensity Score model) that the goals of variable selection for outcome regression, prediction and propensity score models differ, achieving optimum balance being the ultimate goal in the latter. Development of variable selection methods that specifically take into account balance of covariates and precision of effect estimates is an important area of future research. Application of Propensity Score Methods to Quantify Treatment Effects 186 In section 4 (Propensity Score estimation), we summarized alternative approaches to logistic regression, for estimation of the propensity scores. Previous studies comparing the different approaches such as regression and classification trees, neural networks, and logistic regression have shown that the former two methods are more reliable for propensity score estimation particularly in case of mild non-linearity and non-additivity; however, comparisons were made with logistic regression models without interaction and square terms while the others model incorporate such terms in a natural way. Hence, further assessment of the usefulness of logistic regression models in which interaction and higher order terms are included in a broad range of realistic scenarios is required. In chapter 2.2, we have shown that the absolute standardized difference is a robust balance measure in terms of bias reduction (also discussed in section 6, Assesement of balance acheived by PS model). However, it is a univariate measure and pooling covariate specific standardized differences into one summary measure requires that the strength of covariates association with the outcome is taken into account since balance of strong predictors of both treatment and outcome is more important that weaker ones. Although, a weighted standardized difference has been proposed 71 and implemented, 71,76 the definition of the weights is not straightforward. Hence, the sensitivity of covariate balance for different weighting approaches should be investigated further in future studies. In our discussion (section 5, PS methods to control for confounding, and 7, Estimation and interpretation of treatment effects) we highlighted the different algorithms in propensity score matching and their implications for bias, variance as well as interpretation of the estimated treatment effect. In addition, we pointed out that variance estimation is one of the most debated topics in matching. Comparison of several empirical formulas 104 for variance estimation as well as bootstrapping procedures in accounting for uncertainty in the matching and lack of independence between matched subjects need to be conducted.

187 Propensity score methods only account for measured confounding and rely on the ignorable treatment assumption given the measured covariates. Since this ignorable assumption cannot be empirically tested using the data, sensitivity analysis to assess the plausibility of this assumption or the impact of hypothetical unmeasured confounding on treatment effect estimates should accompany propensity score analysis. Further simulation as well as empirical studies focusing on the relative performance of such methods will improve their adequate use by researchers. 5.1 As an extension of the propensity score weighting to the time-varying treatment and time-varying confounding setting, we demonstrated marginal structural models whose parameters are estimated using the inverse probability of treatment weights. In addition, we compared the results with those of conventional time-varying Cox and propensity score methods. Although several studies have demonstrated their robust performance in controlling time-dependent confounding without introducing bias, treatment effect estimates are highly sensitive to specification of the treatment model as well as observations with extreme weights. Future research should focus on improving diagnostics on correct model specifications as well as methods to address instability of weights such as weight trimming or truncation. In conclusion, propensity score methods are invaluable tools for estimating treatment effects, from observational data in the most transparent way. They should neither be regarded as a panacea for the deficiencies of observational studies, nor as replacement for model-based adjustments, but as critical tools contributing to their initial designs 16 and should be used in combination with model based adjustment methods to minimize model-dependence. Utilizing the full advantage of the methods requires, in addition to the initial study design, the full specification of all statistical analyses to be performed. In addition, adequate reporting of aspect of the propensity score analysis is as crucial as the analysis itself, since readers depend on the information reported to judge the quality of the analysis and validity of the results, as do other investigators who would need to replicate the study. Furthermore, critical items relevant to propensity score analysis should be incorporated in guidelines on the conduct and reporting of observational studies, such as the STROBE 92 statement and the ENCePP 91 guide on methodological standards in pharmacoepidemiologic to improve the quality of conduct and reporting propensity score based studies. Application of Propensity Score Methods to Quantify Treatment Effects 187

188 REFERENCES 5.1 Application of Propensity Score Methods to Quantify Treatment Effects Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974; 66: Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Statist Sci 1999; 14: Dawid AP. Causal inference without counterfactuals. JASA 2000; 95: Pearl J. Causal diagrams for empirical research. Biometrika 1995; 82: Holland PW. Statistics and causal inference. JASA 1986; 81: Rubin DB. Inference and missing data. Biometrika 1976; 63: Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci 2010; 25: Cox DR. Planning of experiments. London: Chapman & Hall; Rubin DB. Comment. JASA 1980; 75: Rosenbaum PR. Observationalstudies. In: Encyclopedia of Statistics in Behavioral Science.Wiley Online Library, Greenland S, Robins JM. Identifiability, exchangeability, and epidemiological confounding. Int J Epidemiol 1986; 15: Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol 1999; 150: D Agostino RBJ. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med 1998; 17: D Agostino RBJ. Propensity scores in cardiovascular research. Circulation 2007; 115: Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf 2004; 13: Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med 2007; 26: Brookhart MA, Schneeweiss S, Rothman KJ et al. Variable selection for propensity score models. Am J Epidemiol 2006; 163: Patrick AR, Schneeweiss S, Brookhart MA et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20: Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174: Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol 2011; 174: Pearl J. On a class of bias-amplifying variables that endanger effect estimates. In: Grünwald P, Spirtes P, Eds. Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 2010) Corvallis, OR: Association for Uncertainty in Artificial Intelligence; 201: Ali MS, Groenwold RH, Pestman WR, et al. Propensity score balance measures in pharmacoepidemiology: a simulation study. Pharmacoepidemiol Drug Saf DOI: / pds Greenland S, Pearl J. Adjustments and their consequences collapsibility analysis using graphical models. International Statistical Review 2011; 79:

189 25. Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 2003; 14: Rubin DB, Thomas N. Matching using estimated propensity scores: relating theory to practice. Biometrics 1996; 52: Rubin DB. Causal inference through potential outcomes and principal stratification: application to studies with censoring due to death. Statist Sci 2006; 21: Bhattacharya J & Vogt WB. Do Instrumental variables belong in propensity scores. (NBER Technical Working Paper no. 343). Cambridge, MA: National Bureau of Economic Research; Wooldridge J. Should instrumental variables be used as matching variables? Unpublished Manuscript. East Lansing, MI: Michigan State University Myers JA, Rassen JA, Gagne JJ, et al. respond to Understanding bias amplification. Am J Epidemiol 2011; 174: Ali MS, Groenwold RH, Belitser SV, et al. Selection of covariates and caliper using balance measures in propensity score matching. Submitted Brooks JM, Ohsfeldt RL. Squeezing the balloon: propensity scores and unmeasured covariate balance. Health Serv Res 2013; 48: Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. JASA 1996; 91: Hernan MA, Robins JM. Instruments for causal inference: an epidemiologist s dream? Epidemiology 2006; 17: Martens EP, Pestman WR, de Boer A, et al. Instrumental variables: application and limitations. Epidemiology 2006; 17: Uddin M, Groenwold RHH, de Boer A, et al. Performance of instrumental variable methods in cohort and nested case control studies: a simulation study. Pharmacoepidemiol. Drug Saf. 2014; 23: Ali MS, Groenwold RHH, Belitser SV, et al. Covariate selection and assessment of balance in propensity score analysis in the medical literature: a systematic review Accepted. I Clin Epidemiol Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6: e Setoguchi S, Schneeweiss S, Brookhart MA, et al. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf 2008; 17: Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63: Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. JASA 1984; 79: Czajka JL, Hirabayashi SM, Little RJA, Rubin DB. Projecting from advance data using propensity modeling: An application to income and tax statistics. J Business & Economic Statistics : Hernán MÁ, Brumback B & Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: Stuart EA. Developing practical recommendations for the use of propensity scores: Discussion of A critical appraisal of propensity score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: Application of Propensity Score Methods to Quantify Treatment Effects 189

190 5.1 Application of Propensity Score Methods to Quantify Treatment Effects Austin PC. A critical appraisal of propensity-score matching in the medical literature between 1996 and Stat Med 2008; 27: Smith HL. Matching with multiple controls to estimate treatment effects in observational studies. Sociol Methodol 1997; 27: Ming K, Rosenbaum PR. Substantial gains in bias reduction from matching with a variable number of controls. Biometrics 2000; 56: Ming K, Rosenbaum PR. A note on optimal matching with variable controls using the assignment algorithm. J Comp Graph Stat 2001; 10: Hansen BB. Full matching in an observational study of coaching for the SAT. JASA 2004; 99: Stuart EA, Green KM. Using full matching to estimate causal effects in nonexperimental studies: examining the relationship between adolescent marijuana use and adult outcomes. Dev Psychol 2008; 44: Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat 2011; 10: Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15: Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985; 39: Lunt M. Selecting an appropriate caliper can be essential for achieving good balance with propensity score matching. Am J Epidemiol 2014; 179: Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat. Med 2014; 33: Gu XS, Rosenbaum PR. Comparison of multivariate matching methods: Structures, distances, and algorithms. J Comp Graph Stat 1993; 2: Cochran WG & Rubin DB. Controlling bias in observational studies: a review. The Indian Journal of Statistics, Series A 1973; 35: Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf 2008; 17: Dehejia R, Wahba S. Propensity score-matching methods for nonexperimental causal studies. Rev Econ Stat 2002; 84: Hill J, Reiter JP. Interval estimation for treatment effects using propensity score matching. Stat Med 2006; 25: Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 1968; 24: Hullsiek KH & Louis TA. Propensity score modeling strategies for the causal analysis of observational data. Biostatistics 2002; 3: Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med 1997; 127: Imbens GW. The role of the propensity score in estimating dose-response functions. Biometrika 2000; 87: Morgan SL & Todd JJ. A diagnostic routine for the detection of consequential heterogeneity of causal effects. Sociol Methodol 2008; 38: Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Serv Outcomes Res 2001; 2:

191 68. Horvitz DG & Thompson DJ. A generalization of sampling without replacement from a finite universe. JASA 1952; 47: Westreich D, Cole SR, Funk MJ, Brookhart MA & Stürmer T. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: Imai K, King G & Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. J R Statist Soc A 2008; 171: Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20: Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: Austin PC. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiol Drug Saf 2008; 17: Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Weaknesses of goodness-of-fit tests for evaluating propensity score models: the case of the omitted confounder. Pharmacoepidemiol Drug Saf 2005; 14: Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med 2007; 26: Franklin JM, Rassen JA, Ackermann D, et al. Metrics for covariate balance in cohort studies of causal effects. Stat Med 2014; 33: Iacus SM, King G, Porro G. cem: Software for coarsened exact matching King G, Nielsen R, Coberley C, Pope JE & Wells A. Comparative effectiveness of matching methods for causal inference. Unpublished manuscript 15 (2011). 79. Austin PC. Some methods of propensity score matching had superior performance to others: results of an empirical investigation and Monte Carlo simulations. Biom J 2009; 51: Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. J Thorac Cardiovasc Surg 2007; 134: Austin PC. Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many-to-one matching on the propensity score. Am J Epidemiol 2010; 172: Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011; 46: Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: A review. Rev Econ Stat 2004; 86: Hill J. Discussion of research using propensity-score matching: Comments on A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003 by Peter Austin, Statistics in Medicine. Stat Med 2008; 27: Robins JM, Rotnitzky A. comment on the Bickel an Kwon article, Inference for semiparametric models: Some questions and an answer-comments Statistica Sinica, 2001,11: Statistica Sinica 2001; 11: Ali MS, Groenwold RHH, Klungel OH. Propensity Score Methods and Unobserved Covariate Imbalance: Comments on Squeezing the Balloon. Health Serv Res 2014; 49: Joffe MM, Ten Have TR, Feldman HI & Kimmel SE. Model selection, confounder control, and marginal structural models. Am Stat 2004; 58: Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol 2006; 163: Application of Propensity Score Methods to Quantify Treatment Effects 191

192 5.1 Application of Propensity Score Methods to Quantify Treatment Effects 89. Weitzen S, Lapane KL, Toledano AY, Hume AL & Mor V. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf 2004; 13: D Ascenzo F, Cavallero E, Biondi-Zoccai G, et al. Use and misuse of multivariable approaches in interventional cardiology studies on drug-eluting stents: s systematic review. J Interv Cardiol 2012; 25: Jadrijević-Mladar Takač M. ENCePP Guide on Methodological Standards in Pharmacoepidemiology. 3rd PharmSciFair, Pharmaceutical Sciences for the Future of Medicines (2011). 92. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med 2007; 45: Cole SR & Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168: Peduzzi P, Concato J, Feinstein AR & Holford TR. Importance of events per independent variable in proportional hazards regression analysis II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48: Peduzzi P, Concato J, Kemper E, Holford TR & Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: Cepeda MS, Boston R, Farrar JT & Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003; 158: Rosenbaum PR. Model-based direct adjustment. JASA 1987; 82: Rosenbaum PR. Observational studies. Philadelphia, PA, USA: Springer; Miettinen OS. The need for randomization in the study of intended effects. Stat Med 1983; 2: Stricker BH & Psaty BM. Detection, verification, and quantification of adverse drug reactions. BMJ 2004; 329: Vandenbroucke JP. When are observational studies as credible as randomised trials? The Lancet 2004; 363: Vandenbroucke JP. Observational research, randomised trials, and two views of medical science. PLoS Medicine 2008; 5: e Abbing-Karahagopian V, Kurz X, de Vries F, et al. Bridging differences in outcomes of pharmacoepidemiological studies: Design and first results of the PROTECT project. Curr Clin Pharmacol 2014; 9: Schafer JL & Kang J. Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychol Methods 2008; 13:

193 APPENDICES

Improved control for confounding using propensity scores and instrumental variables?

Improved control for confounding using propensity scores and instrumental variables? Improved control for confounding using propensity scores and instrumental variables? Dr. Olaf H.Klungel Dept. of Pharmacoepidemiology & Clinical Pharmacology, Utrecht Institute of Pharmaceutical Sciences

More information

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Propensity Score Methods for Causal Inference with the PSMATCH Procedure Paper SAS332-2017 Propensity Score Methods for Causal Inference with the PSMATCH Procedure Yang Yuan, Yiu-Fai Yung, and Maura Stokes, SAS Institute Inc. Abstract In a randomized study, subjects are randomly

More information

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision ISPUB.COM The Internet Journal of Epidemiology Volume 7 Number 2 Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision Z Wang Abstract There is an increasing

More information

Application of Propensity Score Models in Observational Studies

Application of Propensity Score Models in Observational Studies Paper 2522-2018 Application of Propensity Score Models in Observational Studies Nikki Carroll, Kaiser Permanente Colorado ABSTRACT Treatment effects from observational studies may be biased as patients

More information

BIOSTATISTICAL METHODS

BIOSTATISTICAL METHODS BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH PROPENSITY SCORE Confounding Definition: A situation in which the effect or association between an exposure (a predictor or risk factor) and

More information

Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_

Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_ Journal of Evaluation in Clinical Practice ISSN 1356-1294 Evaluating health management programmes over time: application of propensity score-based weighting to longitudinal datajep_1361 180..185 Ariel

More information

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies

An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies Multivariate Behavioral Research, 46:399 424, 2011 Copyright Taylor & Francis Group, LLC ISSN: 0027-3171 print/1532-7906 online DOI: 10.1080/00273171.2011.568786 An Introduction to Propensity Score Methods

More information

PubH 7405: REGRESSION ANALYSIS. Propensity Score

PubH 7405: REGRESSION ANALYSIS. Propensity Score PubH 7405: REGRESSION ANALYSIS Propensity Score INTRODUCTION: There is a growing interest in using observational (or nonrandomized) studies to estimate the effects of treatments on outcomes. In observational

More information

The role of media entertainment in children s and adolescents ADHD-related behaviors: A reason for concern? Nikkelen, S.W.C.

The role of media entertainment in children s and adolescents ADHD-related behaviors: A reason for concern? Nikkelen, S.W.C. UvA-DARE (Digital Academic Repository) The role of media entertainment in children s and adolescents ADHD-related behaviors: A reason for concern? Nikkelen, S.W.C. Link to publication Citation for published

More information

Propensity scores: what, why and why not?

Propensity scores: what, why and why not? Propensity scores: what, why and why not? Rhian Daniel, Cardiff University @statnav Joint workshop S3RI & Wessex Institute University of Southampton, 22nd March 2018 Rhian Daniel @statnav/propensity scores:

More information

Instrumental variable analysis in randomized trials with noncompliance. for observational pharmacoepidemiologic

Instrumental variable analysis in randomized trials with noncompliance. for observational pharmacoepidemiologic Page 1 of 7 Statistical/Methodological Debate Instrumental variable analysis in randomized trials with noncompliance and observational pharmacoepidemiologic studies RHH Groenwold 1,2*, MJ Uddin 1, KCB

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

Introduction to Observational Studies. Jane Pinelis

Introduction to Observational Studies. Jane Pinelis Introduction to Observational Studies Jane Pinelis 22 March 2018 Outline Motivating example Observational studies vs. randomized experiments Observational studies: basics Some adjustment strategies Matching

More information

Tobacco control policies and socio-economic inequalities in smoking cessation Bosdriesz, J.R.

Tobacco control policies and socio-economic inequalities in smoking cessation Bosdriesz, J.R. UvA-DARE (Digital Academic Repository) Tobacco control policies and socio-economic inequalities in smoking cessation Bosdriesz, J.R. Link to publication Citation for published version (APA): Bosdriesz,

More information

University of Groningen. Physical activity and cognition in children van der Niet, Anneke Gerarda

University of Groningen. Physical activity and cognition in children van der Niet, Anneke Gerarda University of Groningen Physical activity and cognition in children van der Niet, Anneke Gerarda IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite

More information

Propensity Score Methods to Adjust for Bias in Observational Data SAS HEALTH USERS GROUP APRIL 6, 2018

Propensity Score Methods to Adjust for Bias in Observational Data SAS HEALTH USERS GROUP APRIL 6, 2018 Propensity Score Methods to Adjust for Bias in Observational Data SAS HEALTH USERS GROUP APRIL 6, 2018 Institute Institute for Clinical for Clinical Evaluative Evaluative Sciences Sciences Overview 1.

More information

Chapter 17 Sensitivity Analysis and Model Validation

Chapter 17 Sensitivity Analysis and Model Validation Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations

More information

Peter C. Austin Institute for Clinical Evaluative Sciences and University of Toronto

Peter C. Austin Institute for Clinical Evaluative Sciences and University of Toronto Multivariate Behavioral Research, 46:119 151, 2011 Copyright Taylor & Francis Group, LLC ISSN: 0027-3171 print/1532-7906 online DOI: 10.1080/00273171.2011.540480 A Tutorial and Case Study in Propensity

More information

pharmacoepidemiology and drug safety 2016; 25(Suppl. 1): ORIGINAL REPORT

pharmacoepidemiology and drug safety 2016; 25(Suppl. 1): ORIGINAL REPORT pharmacoepidemiology and drug safety 2016; 25(Suppl. 1): 114 121 Published online in Wiley Online Library (wileyonlinelibrary.com).3864 ORIGINAL REPORT Methodological comparison of marginal structural

More information

Methods to control for confounding - Introduction & Overview - Nicolle M Gatto 18 February 2015

Methods to control for confounding - Introduction & Overview - Nicolle M Gatto 18 February 2015 Methods to control for confounding - Introduction & Overview - Nicolle M Gatto 18 February 2015 Learning Objectives At the end of this confounding control overview, you will be able to: Understand how

More information

University of Groningen. Functional outcome after a spinal fracture Post, Richard Bernardus

University of Groningen. Functional outcome after a spinal fracture Post, Richard Bernardus University of Groningen Functional outcome after a spinal fracture Post, Richard Bernardus IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

More information

Performance of prior event rate ratio adjustment method in pharmacoepidemiology: a simulation study

Performance of prior event rate ratio adjustment method in pharmacoepidemiology: a simulation study pharmacoepidemiology and drug safety (2014) Published online in Wiley Online Library (wileyonlinelibrary.com).3724 ORIGINAL REPORT Performance of prior event rate ratio adjustment method in pharmacoepidemiology:

More information

Using machine learning to assess covariate balance in matching studies

Using machine learning to assess covariate balance in matching studies bs_bs_banner Journal of Evaluation in Clinical Practice ISSN1365-2753 Using machine learning to assess covariate balance in matching studies Ariel Linden, DrPH 1,2 and Paul R. Yarnold, PhD 3 1 President,

More information

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized

More information

University of Groningen. ADHD & Addiction van Emmerik-van Oortmerssen, Katelijne

University of Groningen. ADHD & Addiction van Emmerik-van Oortmerssen, Katelijne University of Groningen ADHD & Addiction van Emmerik-van Oortmerssen, Katelijne IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please

More information

Chapter 10. Considerations for Statistical Analysis

Chapter 10. Considerations for Statistical Analysis Chapter 10. Considerations for Statistical Analysis Patrick G. Arbogast, Ph.D. (deceased) Kaiser Permanente Northwest, Portland, OR Tyler J. VanderWeele, Ph.D. Harvard School of Public Health, Boston,

More information

Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002

Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002 DETAILED COURSE OUTLINE Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002 Hal Morgenstern, Ph.D. Department of Epidemiology UCLA School of Public Health Page 1 I. THE NATURE OF EPIDEMIOLOGIC

More information

Studies on inflammatory bowel disease and functional gastrointestinal disorders in children and adults Hoekman, D.R.

Studies on inflammatory bowel disease and functional gastrointestinal disorders in children and adults Hoekman, D.R. UvA-DARE (Digital Academic Repository) Studies on inflammatory bowel disease and functional gastrointestinal disorders in children and adults Hoekman, D.R. Link to publication Citation for published version

More information

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Supplement 2. Use of Directed Acyclic Graphs (DAGs) Supplement 2. Use of Directed Acyclic Graphs (DAGs) Abstract This supplement describes how counterfactual theory is used to define causal effects and the conditions in which observed data can be used to

More information

Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration

Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration 208 International Journal of Statistics in Medical Research, 2015, 4, 208-216 Using Propensity Score Matching in Clinical Investigations: A Discussion and Illustration Carrie Hosman 1,* and Hitinder S.

More information

Methods to adjust for confounding

Methods to adjust for confounding Methods to adjust for confounding Propensity scores and instrumental variables Edwin Martens Druk: Universal Press, Veenendaal Omslag: gebaseerd op een ets van Sjoerd Bakker (www.sjoerd-bakker.nl) Bijschrift

More information

The role of the general practitioner in the care for patients with colorectal cancer Brandenbarg, Daan

The role of the general practitioner in the care for patients with colorectal cancer Brandenbarg, Daan University of Groningen The role of the general practitioner in the care for patients with colorectal cancer Brandenbarg, Daan IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's

More information

University of Groningen. Cost and outcome of liver transplantation van der Hilst, Christian

University of Groningen. Cost and outcome of liver transplantation van der Hilst, Christian University of Groningen Cost and outcome of liver transplantation van der Hilst, Christian IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

More information

University of Groningen. Alcohol septal ablation Liebregts, Max

University of Groningen. Alcohol septal ablation Liebregts, Max University of Groningen Alcohol septal ablation Liebregts, Max IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document

More information

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation Institute for Clinical Evaluative Sciences From the SelectedWorks of Peter Austin 2012 Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

More information

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research 2012 CCPRC Meeting Methodology Presession Workshop October 23, 2012, 2:00-5:00 p.m. Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine Confounding by indication developments in matching, and instrumental variable methods Richard Grieve London School of Hygiene and Tropical Medicine 1 Outline 1. Causal inference and confounding 2. Genetic

More information

Citation for published version (APA): Zeddies, S. (2015). Novel regulators of megakaryopoiesis: The road less traveled by

Citation for published version (APA): Zeddies, S. (2015). Novel regulators of megakaryopoiesis: The road less traveled by UvA-DARE (Digital Academic Repository) Novel regulators of megakaryopoiesis: The road less traveled by Zeddies, S. Link to publication Citation for published version (APA): Zeddies, S. (2015). Novel regulators

More information

Biostatistics II

Biostatistics II Biostatistics II 514-5509 Course Description: Modern multivariable statistical analysis based on the concept of generalized linear models. Includes linear, logistic, and Poisson regression, survival analysis,

More information

University of Groningen. A geriatric perspective on chronic kidney disease Bos, Harmke Anthonia

University of Groningen. A geriatric perspective on chronic kidney disease Bos, Harmke Anthonia University of Groningen A geriatric perspective on chronic kidney disease Bos, Harmke Anthonia IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/38506 holds various files of this Leiden University dissertation. Author: Nies, Jessica Annemarie Bernadette van Title: Early identification of rheumatoid

More information

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine A Review of Inferential Statistical Methods Commonly Used in Medicine JCD REVIEW ARTICLE A Review of Inferential Statistical Methods Commonly Used in Medicine Kingshuk Bhattacharjee a a Assistant Manager,

More information

UvA-DARE (Digital Academic Repository) Anorectal malformations and hirschsprung disease Witvliet, M.J. Link to publication

UvA-DARE (Digital Academic Repository) Anorectal malformations and hirschsprung disease Witvliet, M.J. Link to publication UvA-DARE (Digital Academic Repository) Anorectal malformations and hirschsprung disease Witvliet, M.J. Link to publication Citation for published version (APA): Witvliet, M. J. (2017). Anorectal malformations

More information

Citation for published version (APA): Casteleijn, N. (2017). ADPKD: Beyond Growth and Decline [Groningen]: Rijksuniversiteit Groningen

Citation for published version (APA): Casteleijn, N. (2017). ADPKD: Beyond Growth and Decline [Groningen]: Rijksuniversiteit Groningen University of Groningen ADPKD Casteleijn, Niek IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

More information

University of Groningen. Pulmonary arterial hypertension van Albada, Mirjam Ellen

University of Groningen. Pulmonary arterial hypertension van Albada, Mirjam Ellen University of Groningen Pulmonary arterial hypertension van Albada, Mirjam Ellen IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please

More information

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN Ji An & Laura M. Stapleton University of Maryland, College Park May, 2016 WHAT DOES A PROPENSITY

More information

UvA-DARE (Digital Academic Repository) The systemic right ventricle van der Bom, T. Link to publication

UvA-DARE (Digital Academic Repository) The systemic right ventricle van der Bom, T. Link to publication UvA-DARE (Digital Academic Repository) The systemic right ventricle van der Bom, T. Link to publication Citation for published version (APA): van der Bom, T. (2014). The systemic right ventricle. General

More information

Combining machine learning and matching techniques to improve causal inference in program evaluation

Combining machine learning and matching techniques to improve causal inference in program evaluation bs_bs_banner Journal of Evaluation in Clinical Practice ISSN1365-2753 Combining machine learning and matching techniques to improve causal inference in program evaluation Ariel Linden DrPH 1,2 and Paul

More information

Enzyme replacement therapy in Fabry disease, towards individualized treatment Arends, M.

Enzyme replacement therapy in Fabry disease, towards individualized treatment Arends, M. UvA-DARE (Digital Academic Repository) Enzyme replacement therapy in Fabry disease, towards individualized treatment Arends, M. Link to publication Citation for published version (APA): Arends, M. (2017).

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015 Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method

More information

Building blocks for return to work after sick leave due to depression de Vries, Gabe

Building blocks for return to work after sick leave due to depression de Vries, Gabe UvA-DARE (Digital Academic Repository) Building blocks for return to work after sick leave due to depression de Vries, Gabe Link to publication Citation for published version (APA): de Vries, G. (2016).

More information

University of Groningen. Carcinoembryonic Antigen (CEA) in colorectal cancer follow-up Verberne, Charlotte

University of Groningen. Carcinoembryonic Antigen (CEA) in colorectal cancer follow-up Verberne, Charlotte University of Groningen Carcinoembryonic Antigen (CEA) in colorectal cancer follow-up Verberne, Charlotte IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish

More information

Propensity Score Analysis Shenyang Guo, Ph.D.

Propensity Score Analysis Shenyang Guo, Ph.D. Propensity Score Analysis Shenyang Guo, Ph.D. Upcoming Seminar: April 7-8, 2017, Philadelphia, Pennsylvania Propensity Score Analysis 1. Overview 1.1 Observational studies and challenges 1.2 Why and when

More information

Data Mining Scenarios. for the Discoveryof Subtypes and the Comparison of Algorithms

Data Mining Scenarios. for the Discoveryof Subtypes and the Comparison of Algorithms Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms Data Mining Scenarios for the Discoveryof Subtypes and the Comparison of Algorithms PROEFSCHRIFT ter verkrijging van

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

University of Groningen. Diagnosis and imaging of essential and other tremors van der Stouwe, Anna

University of Groningen. Diagnosis and imaging of essential and other tremors van der Stouwe, Anna University of Groningen Diagnosis and imaging of essential and other tremors van der Stouwe, Anna IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite

More information

Matching methods for causal inference: A review and a look forward

Matching methods for causal inference: A review and a look forward Matching methods for causal inference: A review and a look forward Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health Department of Mental Health Department of Biostatistics 624 N Broadway,

More information

GUIDELINE COMPARATORS & COMPARISONS:

GUIDELINE COMPARATORS & COMPARISONS: GUIDELINE COMPARATORS & COMPARISONS: Direct and indirect comparisons Adapted version (2015) based on COMPARATORS & COMPARISONS: Direct and indirect comparisons - February 2013 The primary objective of

More information

Insights into different results from different causal contrasts in the presence of effect-measure modification y

Insights into different results from different causal contrasts in the presence of effect-measure modification y pharmacoepidemiology and drug safety 2006; 15: 698 709 Published online 10 March 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/pds.1231 ORIGINAL REPORT Insights into different results

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/38594 holds various files of this Leiden University dissertation Author: Haan, Melina C. den Title: Cell therapy in ischemic heart disease models : role

More information

Use of the comprehensive geriatric assessment to improve patient-centred care in complex patient populations Parlevliet, J.L.

Use of the comprehensive geriatric assessment to improve patient-centred care in complex patient populations Parlevliet, J.L. UvA-DARE (Digital Academic Repository) Use of the comprehensive geriatric assessment to improve patient-centred care in complex patient populations Parlevliet, J.L. Link to publication Citation for published

More information

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY No. of Printed Pages : 12 MHS-014 POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY Time : 2 hours Maximum Marks : 70 PART A Attempt all questions.

More information

Assessing the impact of unmeasured confounding: confounding functions for causal inference

Assessing the impact of unmeasured confounding: confounding functions for causal inference Assessing the impact of unmeasured confounding: confounding functions for causal inference Jessica Kasza jessica.kasza@monash.edu Department of Epidemiology and Preventive Medicine, Monash University Victorian

More information

University of Groningen. Adaptation after mild traumatic brain injury van der Horn, Harm J.

University of Groningen. Adaptation after mild traumatic brain injury van der Horn, Harm J. University of Groningen Adaptation after mild traumatic brain injury van der Horn, Harm J. IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

UvA-DARE (Digital Academic Repository) Falling: should one blame the heart? Jansen, Sofie. Link to publication

UvA-DARE (Digital Academic Repository) Falling: should one blame the heart? Jansen, Sofie. Link to publication UvA-DARE (Digital Academic Repository) Falling: should one blame the heart? Jansen, Sofie Link to publication Citation for published version (APA): Jansen, S. (2015). Falling: should one blame the heart?

More information

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions. Greenland/Arah, Epi 200C Sp 2000 1 of 6 EPI 200C Final, June 4 th, 2009 This exam includes 24 questions. INSTRUCTIONS: Write all answers on the answer sheets supplied; PRINT YOUR NAME and STUDENT ID NUMBER

More information

HOW STATISTICS IMPACT PHARMACY PRACTICE?

HOW STATISTICS IMPACT PHARMACY PRACTICE? HOW STATISTICS IMPACT PHARMACY PRACTICE? CPPD at NCCR 13 th June, 2013 Mohamed Izham M.I., PhD Professor in Social & Administrative Pharmacy Learning objective.. At the end of the presentation pharmacists

More information

Pathophysiology and management of hemostatic alterations in cirrhosis and liver transplantation Arshad, Freeha

Pathophysiology and management of hemostatic alterations in cirrhosis and liver transplantation Arshad, Freeha University of Groningen Pathophysiology and management of hemostatic alterations in cirrhosis and liver transplantation Arshad, Freeha IMPORTANT NOTE: You are advised to consult the publisher's version

More information

Propensity Score Matching with Limited Overlap. Abstract

Propensity Score Matching with Limited Overlap. Abstract Propensity Score Matching with Limited Overlap Onur Baser Thomson-Medstat Abstract In this article, we have demostrated the application of two newly proposed estimators which accounts for lack of overlap

More information

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA PharmaSUG 2014 - Paper SP08 Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA ABSTRACT Randomized clinical trials serve as the

More information

Citation for published version (APA): van Es, N. (2017). Cancer and thrombosis: Improvements in strategies for prediction, diagnosis, and treatment

Citation for published version (APA): van Es, N. (2017). Cancer and thrombosis: Improvements in strategies for prediction, diagnosis, and treatment UvA-DARE (Digital Academic Repository) Cancer and thrombosis van Es, N. Link to publication Citation for published version (APA): van Es, N. (2017). Cancer and thrombosis: Improvements in strategies for

More information

Proteinuria-associated renal injury and the effects of intervention in the renin-angiotensinaldosterone

Proteinuria-associated renal injury and the effects of intervention in the renin-angiotensinaldosterone University of Groningen Proteinuria-associated renal injury and the effects of intervention in the renin-angiotensinaldosterone system Kramer, Andrea Brechtsje IMPORTANT NOTE: You are advised to consult

More information

University of Groningen. Cardiotoxicity after anticancer treatment Perik, Patrick Jozef

University of Groningen. Cardiotoxicity after anticancer treatment Perik, Patrick Jozef University of Groningen Cardiotoxicity after anticancer treatment Perik, Patrick Jozef IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it.

More information

University of Groningen. Improving outcomes of patients with Alzheimer's disease Droogsma, Hinderika

University of Groningen. Improving outcomes of patients with Alzheimer's disease Droogsma, Hinderika University of Groningen Improving outcomes of patients with Alzheimer's disease Droogsma, Hinderika IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite

More information

RAG Rating Indicator Values

RAG Rating Indicator Values Technical Guide RAG Rating Indicator Values Introduction This document sets out Public Health England s standard approach to the use of RAG ratings for indicator values in relation to comparator or benchmark

More information

University of Groningen. ADHD and atopic diseases van der Schans, Jurjen

University of Groningen. ADHD and atopic diseases van der Schans, Jurjen University of Groningen ADHD and atopic diseases van der Schans, Jurjen IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the

More information

University of Groningen. Understanding negative symptoms Klaasen, Nicky Gabriëlle

University of Groningen. Understanding negative symptoms Klaasen, Nicky Gabriëlle University of Groningen Understanding negative symptoms Klaasen, Nicky Gabriëlle IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please

More information

Clinical implications of the cross-talk between renin-angiotensin-aldosterone system and vitamin D-FGF23-klotho axis Keyzer, Charlotte

Clinical implications of the cross-talk between renin-angiotensin-aldosterone system and vitamin D-FGF23-klotho axis Keyzer, Charlotte University of Groningen Clinical implications of the cross-talk between renin-angiotensin-aldosterone system and vitamin D-FGF23-klotho axis Keyzer, Charlotte IMPORTANT NOTE: You are advised to consult

More information

University of Groningen. The role of human serum carnosinase-1 in diabetic nephropathy Zhang, Shiqi

University of Groningen. The role of human serum carnosinase-1 in diabetic nephropathy Zhang, Shiqi University of Groningen The role of human serum carnosinase-1 in diabetic nephropathy Zhang, Shiqi IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite

More information

Gut microbiota and nuclear receptors in bile acid and lipid metabolism Out, Carolien

Gut microbiota and nuclear receptors in bile acid and lipid metabolism Out, Carolien University of Groningen Gut microbiota and nuclear receptors in bile acid and lipid metabolism Out, Carolien IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you

More information

The role of the general practitioner during treatment and follow-up of patients with breast cancer Roorda-Lukkien, Carriene

The role of the general practitioner during treatment and follow-up of patients with breast cancer Roorda-Lukkien, Carriene University of Groningen The role of the general practitioner during treatment and follow-up of patients with breast cancer Roorda-Lukkien, Carriene IMPORTANT NOTE: You are advised to consult the publisher's

More information

Meta-analysis of clinical prediction models

Meta-analysis of clinical prediction models Meta-analysis of clinical prediction models Thomas P. A. Debray Julius Center for Health Sciences and Primary Care Utrecht University Meta-analysis of clinical prediction models Julius Center for Health

More information

UvA-DARE (Digital Academic Repository) Functional defecation disorders in children Kuizenga-Wessel, S. Link to publication

UvA-DARE (Digital Academic Repository) Functional defecation disorders in children Kuizenga-Wessel, S. Link to publication UvA-DARE (Digital Academic Repository) Functional defecation disorders in children Kuizenga-Wessel, S. Link to publication Citation for published version (APA): Kuizenga-Wessel, S. (2017). Functional defecation

More information

UvA-DARE (Digital Academic Repository) Intraarterial treatment for acute ischemic stroke Berkhemer, O.A. Link to publication

UvA-DARE (Digital Academic Repository) Intraarterial treatment for acute ischemic stroke Berkhemer, O.A. Link to publication UvA-DARE (Digital Academic Repository) Intraarterial treatment for acute ischemic stroke Berkhemer, O.A. Link to publication Citation for published version (APA): Berkhemer, O. A. (2016). Intraarterial

More information

Overview. Goals of Interpretation. Methodology. Reasons to Read and Evaluate

Overview. Goals of Interpretation. Methodology. Reasons to Read and Evaluate Overview Critical Literature Evaluation and Biostatistics Ahl Ashley N. Lewis, PharmD, BCPS Clinical Specialist, Drug Information UNC Hospitals Background Review of basic statistics Statistical tests Clinical

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

UvA-DARE (Digital Academic Repository) Bronchial Thermoplasty in severe asthma d'hooghe, J.N.S. Link to publication

UvA-DARE (Digital Academic Repository) Bronchial Thermoplasty in severe asthma d'hooghe, J.N.S. Link to publication UvA-DARE (Digital Academic Repository) Bronchial Thermoplasty in severe asthma d'hooghe, J.N.S. Link to publication Citation for published version (APA): d'hooghe, J. N. S. (2018). Bronchial Thermoplasty

More information

Propensity score methods : a simulation and case study involving breast cancer patients.

Propensity score methods : a simulation and case study involving breast cancer patients. University of Louisville ThinkIR: The University of Louisville's Institutional Repository Electronic Theses and Dissertations 5-2016 Propensity score methods : a simulation and case study involving breast

More information

Citation for published version (APA): Wolff, D. (2016). The Enigma of the Fontan circulation [Groningen]: Rijksuniversiteit Groningen

Citation for published version (APA): Wolff, D. (2016). The Enigma of the Fontan circulation [Groningen]: Rijksuniversiteit Groningen University of Groningen The Enigma of the Fontan circulation Wolff, Djoeke IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check

More information

Towards strengthening memory immunity in the ageing population van der Heiden, Marieke

Towards strengthening memory immunity in the ageing population van der Heiden, Marieke University of Groningen Towards strengthening memory immunity in the ageing population van der Heiden, Marieke IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you

More information

PHA 6935 ADVANCED PHARMACOEPIDEMIOLOGY

PHA 6935 ADVANCED PHARMACOEPIDEMIOLOGY PHA 6935 ADVANCED PHARMACOEPIDEMIOLOGY Course Description: PHA6935 is a graduate level course that is structured as an interactive discussion of selected readings based on topics in contemporary pharmacoepidemiology

More information

Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary note

Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary note University of Iowa Iowa Research Online Theses and Dissertations Fall 2014 Identification of population average treatment effects using nonlinear instrumental variables estimators : another cautionary

More information

BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS

BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS Nan Shao, Ph.D. Director, Biostatistics Premier Research Group, Limited and Mark Jaros, Ph.D. Senior

More information

UvA-DARE (Digital Academic Repository) Obesity, ectopic lipids, and insulin resistance ter Horst, K.W. Link to publication

UvA-DARE (Digital Academic Repository) Obesity, ectopic lipids, and insulin resistance ter Horst, K.W. Link to publication UvA-DARE (Digital Academic Repository) Obesity, ectopic lipids, and insulin resistance ter Horst, K.W. Link to publication Citation for published version (APA): ter Horst, K. W. (2017). Obesity, ectopic

More information

State-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups. Beth Ann Griffin Daniel McCaffrey

State-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups. Beth Ann Griffin Daniel McCaffrey State-of-the-art Strategies for Addressing Selection Bias When Comparing Two or More Treatment Groups Beth Ann Griffin Daniel McCaffrey 1 Acknowledgements This work has been generously supported by NIDA

More information