THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED-

Size: px

Start display at page:

Download "THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED-"

Morris Chase
5 years ago
Views:

1 THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED- WEDGE CLUSTER RANDOMIZED TRIALS THAT INCLUDE MULTILEVEL CLUSTERING, CONFOUNDING BY TIME, AND EFFECT MODIFICATION A THESIS SUBMITTED TO THE GRADUATE FACULTY in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE BY LANCE FORD Oklahoma City, Oklahoma

2 A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED- WEDGE CLUSTER RANDOMIZED TRIALS THAT INCLUDE MULTILEVEL CLUSTERING, CONFOUNDING BY TIME, AND EFFECT MODIFICATION APPROVED BY: Julie A. Stoner, Ph.D., Chair Ann Chou, Ph.D Tabitha Garwe, Ph.D, MPH Daniel Yan Zhao, Ph.D THESIS COMMITTEE 2

3 COPYRIGHT By Lance Ford July 30,

4 ACKNOWLEDGEMENTS I would like to express my sincerest appreciation to Dr. Julie Stoner for the guidance, mentorship, and support provided during this project. I greatly appreciate the time spent teaching and inspiring me over the last two years. I want to thank my wonderful committee members, Dr. Yan Daniel Zhao, Dr. Ann Chou, and Dr. Tabitha Garwe, for serving on my committee and providing their expertise during writing and revisions. I highly value the time you have committed to this project. I would also like to acknowledge Dr. Daniel Duffy for serving as the principal investigator on this project and allowing me the opportunity to work on this project. Additionally, I am extremely grateful for the time Kimberly Hollabaugh and Dr. David Thompson have spent going above and beyond the required mentoring time while working on projects together in the department. A special thank you to Dung June Dao for being my unofficial peer-mentor. I appreciate your constant encouragement even while studying for your comprehensive exams. Lastly, I would like to thank my friends and family for their unconditional love and support. 4

5 ABSTRACT Introduction: Stepped-wedge design (SWD), a type of cluster randomized trial, is being used more frequently to study the effect of community- and practice-based interventions. SWD is beneficial in cases where the intervention cannot be rolled out all at once, and the intent is to expose all practices to an intervention program. Given the SWD is a relatively new design, optimal analytic approaches to account for correlation and time effects are not well established and have not been widely applied to data. Methods: This simulation study is designed based on a statewide intervention program called Healthy Hearts for Oklahoma. In this intervention study, longitudinal performance measures are collected for patients nested within primary care practices that are nested within practice enhancement associates who deliver the intervention program. This thesis project investigates how (1) levels of clustering and modeling approaches impact statistical bias and efficiency of intervention effect estimates, (2) conclusions differ depending on the extent of extraneous temporal effects among modeling approaches, and (3) estimates vary across modeling approaches when the intervention effect is modified by time under the intervention. Mixedeffects models were used to account for the correlation between and within multiple levels of clustering and among longitudinal measures. Programming simulations were conducted to estimate the bias and precision of estimates from different generalized mixed modeling approaches. Results: In all scenarios, the correct statistical models yielded unbiased estimates of the model coefficients and standard error. However, we found that ignoring all levels of clustering in the modeling approach leads to the average coefficient for both the intercept and treatment effect to be underestimated. The estimated intercept coefficient and treatment effect coefficient were both 22% lower than the true parameter value. When confounding by calendar time was introduced, we found that failing to adjust for the confounding effects of time in the modeling approach resulted in an overestimation of the intercept and treatment effect (6% and 31%, 5

6 respectively). Calculating one treatment effect for the entire study (as opposed to calculating treatment effects for each period when there are clusters receiving treatment) results in an erroneous result for the total treatment effect. Conclusions: In a study design where multiple levels of clustering are present, failing to account for such clustering can lead to biased intervention effect estimates and an underestimation of the intervention effect standard error. Secondary to problems with biased coefficient (intervention effect) estimates, the underestimation of the standard error can ultimately lead to an increase in false positive errors. For SW-CRTs that extend over long periods of time, it is important to adjust for the potential confounding effect of calendar time. Failing to do so can lead to an overestimation of the treatment effect when the confounding factor results in improved outcomes. When it is expected that there might be a delay in the uptake of the intervention or that the intervention effect is not sustained during the entire study period, it is important to include an interaction for time on treatment in the modeling approach. Failing to do so results in an estimate of the treatment effect that overestimates the treatment effect at the beginning of the intervention, and underestimates the intervention once the intervention effect is fully observed. The simulation study scenarios presented in this thesis were driven by a specific intervention study. Future work will expand upon the simulation study to address additional study scenarios. 6

7 TABLE OF CONTENTS ABSTRACT... 5 LIST OF TABLES... 8 LIST OF FIGURES... 9 CHAPTER I CHAPTER II ABSTRACT INTRODUCTION METHODS RESULTS DISCUSSION CHAPTER III LIST OF REFERENCES APPENDICES

8 LIST OF TABLES Table Page TABLE 1. ASSUMED TREATMENT EFFECT ΘM FOR EACH TIME ON TREATMENT (M) WITH A 25% INCREASE IN TREATMENT EFFECT AFTER EVERY TWO TIME PERIODS ON TREATMENT TABLE 2. RESULTS FROM 500 PROGRAMMING SIMULATIONS COMPARING MODELING APPROACHES FOR MULTIPLE LEVELS OF CLUSTERING TABLE 3. RESULTS FROM 500 PROGRAMMING SIMULATIONS COMPARING MODELING APPROACHES THAT DO AND DO NOT ACCOUNT FOR CONFOUNDING BY CALENDAR TIME TABLE 4. RESULTS FROM 500 PROGRAMMING SIMULATIONS COMPARING THE MODELING APPROACHES THAT DO AND DO NOT ACCOUNT TIME ON TREATMENT. 51 8

9 LIST OF FIGURES Figure Page FIGURE 1. DIAGRAM OF PARALLEL CLUSTER RANDOMIZED TRIAL FIGURE 2. DIAGRAM OF CROSSOVER CLUSTER RANDOMIZED TRIAL FIGURE 3. DIAGRAM OF STEPPED-WEDGE CLUSTER RANDOMIZED TRIAL FIGURE 4. DIAGRAM OF HEALTHY HEARTS FOR OKLAHOMA STUDY DESIGN

10 CHAPTER I INTRODUCTION AND LITERATURE REVIEW LONGITUDINAL DATA ANALYSIS Biomedical and population health research often focuses on changes in health status over time. Longitudinal studies are designed to capture information about the time course of disease onset, progression, recovery, or cure within participants. In these studies, multiple observations are measured repeatedly on the same subject over time to model disease processes and to identify protective or risk factors. This results in measurements that have great potential for being positively correlated. This correlation causes a departure from independence which violates one of the major assumptions in classic regression analysis. Further considerations must be made to account for changes in the outcome measure following exposure to a treatment or intervention over time. That is, the outcome might be different at the beginning of the intervention period as compared to the middle or end of an intervention period for a study investigating an intervention effect, for example. Methods have been developed to account for the correlated nature and changing outcome over time that generally occurs within longitudinal data. Linear mixed models (LMM) [1], generalized estimating equations (GEE) [2], and generalized linear mixed models (GLMM) [3] are three common approaches to analyzing longitudinal data. Fixed effect models In standard linear regression (as compared to regression modeling for longitudinal data), an outcome measure is observed at only one time point for a given participant. However, because longitudinal data is measured over multiple time points, change in the outcome over time is likely to occur. Studies designed to accrue longitudinal data often include an intervention. Interventions are very likely to cause an increase or decrease in the outcome measure. One approach used to account for the change in the outcome over time focuses on fitting the mean 10

11 model using fixed effects and modeling the variance and covariance using structured or unstructured variance-covariance matrices. Fixed effect models can be used to assess interactions between time-variant and timeinvariant covariates (e.g., how the outcome is changing over time [time-variant] between a treatment and control group [time-variant if crossover is allowed, and time-invariant if units do not cross over]). However, time-invariant covariates and their corresponding effects are effectively removed when fitting fixed effect models. Thus, fixed effect models are not desirable when estimating coefficients for time-invariant covariates [4]. Mean model Below is an example of a mean model for longitudinal data with continuous outcomes [4]: (1) Y ij = β 1 X ij1 + β 2 X ij2 + +β p X ijp + e ij, where i indexes the subject (i = 1,, n); j indexes occasion (j = 1,, m i ) where m i is the total number of observations for individual i ; Y ij is the outcome of interest for the i th subject at occasion j ; X ij1,, X ijp are independent covariate factors; when the covariate factor is categorical, the corresponding X p represents an indicator variables for the corresponding covariate; commonly X ij1 = 1 for all i and j to define β 1 as a fixed intercept term, then β 2,, β p represent unknown fixed effects for the respective covariates; e ij ~ N(0, σ 2 ) represents the individual subject error term and are expected to be correlated within an individual. It is assumed that the individual error terms are independent and identically distributed according to a normal distribution with mean zero and variance, σ 2. Covariance structure In linear models that only consider fixed effects (random effects introduced in a later section), a covariance matrix is used to account for changing variation in the outcome over time. While it is likely that each subject has different variance and covariance over time from all other 11

12 subjects, it is not feasible to calculate a covariance matrix for each subject. Therefore, it is often assumed that there is a common covariance matrix. That is, all subjects in the study have similar variances and covariances over time. The least restrictive type of covariance structure, called unstructured covariance, is where variances (matrix diagonal) and covariances (matrix off-diagonal) are allowed to vary between each time point for a given set of repeated measures. 2 Y i1 σ 1 σ 12 σ 1n 2 (2) Cov ( Y i2 ) = σ 21 σ 2 σ 2n Y in ( σ n1 σ n2 σ 2 n ) Equation (2) provides an example of an unstructured covariance matrix with n repeated measures for a given individual. There are other covariance structures that make more restrictive assumptions about the variances and covariances such as compound symmetry, Toeplitz, and autoregressive. In practice, it is important to consider the trade-off between too little structure and too much structure, in an effort to model the true, but unknown, covariance structure. With too little structure, you might expend your statistical power trying to estimate too many covariance parameters (you might not have an adequate amount of data and face model convergence issues). Imposing too much structure can lead to model misspecification which ultimately causes incorrect inference about the parameters in the mean model. Mixed effect models Other modeling techniques incorporate random effects into the mean model (such as a random intercept, or the random effect of a covariate). Including random effects is another approach to account for the between- and within-subject variation. An alternative approach to modeling the covariance structure is to include a random effect on the intercept, or on one or more of the covariates in the mean model. The following model is a general form for a mixed effect model for continuous outcomes [4]: 12

13 (3) Y ij = β 1 X ij1 + β 2 X ij2 + +β p X ijp + b i1 X ij1 + b i2 X ij2 + + b ip X ijp + e ij, where the similar terms from equation (1) are defined as above; b ip represents the random effect p for an individual i. When X ij1 = 1 for all i and j (as it is often defined), then b i1 refers to the random intercept effect for a given individual. Random effects are often introduced in vector notation; b i would be a m 1 vector with m random effects. Then, these random effects follow a multivariate normal distribution with mean equal to zero and variance equal to G, where G is a m m matrix composed of the variance and covariance of the random effects. The random effects are independent of the individual subject error term (e ij ) in equation (3). Random effects account for the between-subject variation while the subject error term accounts for the withinsubject variation. Mixed effect models are advantageous over fixed effect models because of their ability to estimate both time-varying and time-invariant covariates (fixed effect models cannot estimate time-invariant covariates). When the assumptions of random effects are met (that there is no correlation between the random effects and the independent covariates), the models will produce lower standard errors as compared to fixed effect models. The drawback is that, if this assumption is not met, the advantage of lower standard errors cannot be guaranteed. Additionally, mixed effect models are useful when multiple levels of clustering occur in the data. These additional levels of clustering can be accounted for by using nested random effects and uses less degrees of freedom as compared to using fixed effect models [5]. Considerations for categorical outcomes When the outcome of interest is not a continuous variable (such as dichotomous or count outcomes), a linear or linear mixed effects model cannot be used. Generalized linear models (GLMs) are an extension of the methods above to account for outcomes that are not continuous [6]. Two of the most commonly used special cases of GLMs include logistic regression (where the outcome variable is binary) and Poisson regression (where the outcome 13

14 variable is a count). These cases have link functions that transform the outcome to be linear in the coefficients of the mean model. Logistic regression uses the logit link (a log odds transform of the expected outcome) while Poisson regression uses a log link (a log transform of the count or rate outcome) [4]. GLMs follow the general form: (4) g(μ i ) = β 1 X i1 + β 2 X i2 + +β p X ip, where the function g represents the link function that transforms the outcome to be linear in the coefficients of the covariates. This can be further extended to include random effects (GLMMs): (5) g(e[y ij b i ]) = β 1 X ij1 + β 2 X ij2 + +β p X ijp + b i1 X ij1 + b i2 X ij2 + + b ip X ijp, where the vector of random effects have similar assumptions as those in Equation (3). Marginal models and mixed effect models General estimating equations (GEEs) [2] is another modeling approach often used for longitudinal data analysis. The GEE models and mixed effect models (LMMs and GLMMS) both account for the within-subject correlation. However, there are differences among the modeling approaches in the target of inference and the estimation methods. There are situations in which GEE estimation may be more favorable than mixed effect modeling and vice versa. The major difference between GEEs and mixed effect models is the level at which you can make an inference. GEEs are often preferred for population-average inference while mixed effect models allow you to draw subject-specific inference. The inference target of interest, either subject-level or population-average level, often drives the decision for choosing one method over the other. When considering categorical outcomes, GEEs have different estimation methods than generalized linear mixed models (GLMMs) [3]. Regarding covariance model misspecification, GEE is more robust as compared to GLMM. The estimation technique used for GEE allows you to specify a working covariance structure that is then iteratively updated during estimation. Such updating estimation techniques are not used in GLMM estimation. Methods for 14

15 estimating GLMMs require an assumption about the common covariance structure. This estimation method is less robust than estimation techniques for GEE and can potentially cause biased estimates in the corresponding mean model [4]. CLUSTER RANDOMIZED TRIALS Longitudinal data often arise in randomized controlled trials (RCTs) where multiple measures are recorded for participants over time. That is, the investigators are manipulating or assigning some exposure that is expected to impact the outcome. Subjects are then randomized to different treatment arms. The randomization at the beginning of RCTs is an important method for eliminating potential confounders and removing other sources of selection bias. The subjects, or unit of randomization, in RCTs do not have to be the individual. Often it is not feasible to randomize at the individual level due to the method or implementation of the intervention, the social structures of the specific participants that are to be evaluated, or to control for some type of contamination bias (if that is of concern). If these issues are of concern, groups of individuals can define clusters that are then randomized as a group. That is, we don t randomize the individual in the study, but the cluster. The class of RCTs that randomize at the group level are commonly called cluster randomized trials (CRTs). CRTs can be further classified by other defining characteristics such as the method of treatment exposure. Parallel and crossover designs are two broad classifications. Parallel designs A parallel design refers to a study design in which eligible patients are randomized to either the treatment or control group at the beginning of the study and remain in their assigned intervention arm until the end of the study (Figure 1). Crossover designs Crossover longitudinal studies take on a variety of forms, but they each share a common feature that at some time point in the study, at least one arm of the study switches methods of intervention (Figure 2). One example of crossover design is where eligible groups are 15

16 randomized to a control and intervention arm. At some point in the study, those in the control arm switch to the intervention arm and those in the intervention arm switch to the control arm. The crossover generally happens after the treatment and control programs have been withdrawn and a washout period to avoid carry-over effects. This design assumes that after the washout period the participant has returned to the original baseline status. There are certain interventions under which the effects cannot be removed once they have been introduced (e.g., educational training, or some behavioral intervention). In this case, often the treatment arm receives the intervention first, and other arms are randomized to receive the intervention at a later time. An advantage of crossover designs as compared to parallel designs is that each study subject serves as its own control. This can help eliminate potential sources of confounding. STEPPED-WEDGE DESIGN Stepped-wedge design (SWD) is a relatively new type of crossover cluster-randomized trial. Under SWD, the initial measurement period is a control period where data is collected from all clusters before the intervention has started. Then, the timing of intervention initiation is randomized. Groups of individuals or participants cross over to the intervention at specified time points. These changes in assignment are referred to as steps in the design. Outcome measures are recorded for all clusters at all time points during the study period until all clusters have been exposed to the intervention and follow-up data has been collected (Figure 3). The design includes pre-intervention, intervention, and (for some studies) post-intervention sustainability measures for all clusters. SWD is beneficial in cases where the intervention cannot be rolled out all at once, and the intent is to expose all clusters to the intervention program. 16

17 Historical context Increasing utilization of SWD SWD is being used more frequently to study community- and practice-based interventions. In 2006, Brown and Lilford [7] published results from a systematic review of 12 studies that were classified as SWD trials. These studies were conducted or planned between the years 1987 and Nine of the 12 studies (75%) were published after the year Mdege et al. [8] conducted a systematic review that included 25 trials from the years 1987 to These authors found that half of the studies were conducted between 1987 and 2006 and the other half from 2007 to Beard et al. [9] reviewed 37 trials from the years 2010 to Shortly after this systematic review, Barker et al. [10] reviewed 42 SWD methodology papers and 102 papers of SWD trials from all of the published literature before As indicated by these review papers, the SWD is gaining popularity in the published literature but is still an emerging methodology. Rationale for SWD The systematic reviews of SWD literature found that the most common rationale for utilizing SWD was due to: logistical limitations (e.g. not being able to roll out the intervention all at once), limited resources (e.g. funding or human resource constraints), ethical considerations (e.g. when it would be unethical to withhold the intervention), and methodological reasons (e.g., the study design allows for underlying temporal change) [7-9]. Power and sample size calculations Barker et al. reviewed the methods used to conduct sample size and power calculations under a SWD [10]. Of the 42 methodology papers reviewed, 28 (67%) included a detailed examination of the power and sample size calculations. Seven papers offered suggestions on the methodology of calculating power and sample size. This review found that the number of steps (time points at which control groups cross over to the intervention arm) in the studies ranges from 2 to 36 with a median of 4. The number of clusters randomized had much more 17

18 variation; this number ranged from 2 to 506 with a median of 12. The majority of studies reviewed (64%, n=65) had less than 20 clusters. Analysis methods Barker et al. found that statistical analysis methods varied greatly among studies [10]. The majority of the studies used GLMMS (59%, n=60) while 17 used GEEs (17%); the remaining 23% used a variety of statistical methods to analyze the data including: generalized linear models, Cox proportional hazard models, Analysis of Covariance, Analysis of Variance, paired t-tests, McNemar s test, Discourse mapping, and Mann-Whitney U tests (individual percentages were not report). Of the 102 studies reviewed, 61 (60%) adjusted for exposure time as a potential confounder. The other studies reviewed either failed to adjust for calendar time as a confounder (without reference to exposure time or duration) or found no association between time and the outcome. Problems with historical methodology The systematic review by Barker et al. [10] revealed potentially incorrect modeling approaches in the previously published SWD literature. One of the potential problems in the analysis is using the statistical modeling approaches like GEE or GLMM without giving proper attention to the number of clusters in the study. Without an adequate number of clusters for these modeling approaches, studies might be underpowered or have biased results [10]. Additionally, including small numbers of clustering increases the risk that confounders (known or unknown) are distributed unequally between the randomization steps [11]. Until recent publications, the literature surrounding SWD did not cover the smallest number of clusters required to ensure unbiased estimation of the intervention effect. In a review of literature from SWD trials, Barker et al. noted that nearly half of the trials reviewed had fewer than 10 clusters [10]. This motivated the simulation studies conducted by Barker et al. in 2017 [12] to determine the minimum number of clusters needed for cross-sectional SWD trials with binary outcomes (as cross-sectional studies with binary outcomes were the most prevalent in 18

19 their 2016 review [10]). From the results of these simulations, Barker et al. recommended that cross-sectional SWD trials with binary outcomes and few clusters (less than 10) should use GLMMs instead of GEEs. They also recommend that cross-sectional SWD trials with three steps should not randomize less than 6 clusters. The GLMM modeling approach resulted in the most consistent estimators when fewer clusters were used. This awareness of potential bias that occurs when using a small number of clusters might explain the transition from GEEs to GLMMs in the analysis of data from SWD trials. The effect of calendar time (i.e., underlying time trends) should also be considered in the analysis of data from SW-CRTs. A common feature among SW-CRTs as compared to CRTs is the increased study duration that is needed to allow for all units to cross over to the intervention with sufficient post-intervention follow-up to evaluate meaningful changes in outcome measures [7]. The underlying time trends may be associated with the outcome of interest as well as the intervention effect. That is, calendar time may be a potential confounder. Confounding could arise if there are changes in exposures or factors over time, not resulting from the intervention under study, which are associated with the outcome. An example would be a new public health anti-smoking campaign that is initiated by a local health department during the mid-point of a SWD study investigating the impact of a smoking cessation program implemented through churches. The intervention assessments would overlap with the community-level health education campaign to a greater extent than the control assessments, resulting in confounding if the community-level campaign impacts smoking behavior. Historically, researchers have failed to account for this potential confounding effect in the sample size calculations and data analysis. It is also important to account for time on treatment (time as an effect modifier of the treatment effect) in the setting of SWD. It may be that the full intervention effect is not observed until the unit has been exposed for at least, say, three months, during a 12-month exposure period. As another example, longitudinal studies may incorporate post-intervention assessments to assess the sustainability of the behavior changes or the ability of the unit or 19

20 cluster to sustain the intervention program after the study investigators have ended the formal intervention program. The first methodological paper [13] on SWD power calculations and modeling highlighted the importance of accounting for the potential delay in treatment effect. Lastly, little research has been conducted to show the importance of accounting for multiple levels of clustering in such SW-CRTs. Multiple levels of clustering may arise, for example, if health measures are made on patients who are nested within clinicians who are nested within practices in a practice-based study. Another example would include children nested within classrooms that are nested within schools in a school-based intervention study. Hemming et al. [14] provided an example of the effects of ignoring hierarchy of clustering on power estimates. They found that ignoring multiple levels of clustering can result in a conservative estimate of power, but caution that this is highly dependent on the amount of correlation within the smallest unit of clustering. In the example of patients nested within clinicians, nested within practices, the correlation within the clusters formed by patients within clinicians would be of greatest concern. Researchers that expect that calendar time may be a potential confounder, time on treatment may be a potential effect modifier, and have studies with multiple levels of clustering should plan for a research design and analysis strategy that reflect these factors. Statistical background for stepped-wedge design Hussey & Hughes model In 2006, Hussey and Hughes [13] proposed a statistical model and formulae for power and sample size calculations. Before these contributions, there were no proposed methodologies for analyzing SWD or for conducting sample size and power calculations. These authors aimed to account for within-cluster correlation that occurs by the design of SW-CRTs. (6) Y ijs = β 0 + β s + θx is + u j + e ijs, where i indexes individual, j indexes cluster and s indexes time; Y ijs is the continuous outcome 20

21 for individual i, in cluster j, at time s ; e ijs ~ N(0, σ 2 (e) ) represents the individual error terms for individual i, in cluster j, at time s ; X is is an indicator variable for exposure to intervention effect for individual i at time s ; θ represents the intervention effect; β s is a fixed categorical time effect; β 0 is an intercept term; u j ~ N(0, σ u 2 ) is a cluster-level random effect (used to account for the correlation between individuals within a given cluster). The statistical model and formula for sample size calculations were the first to be presented in the literature for SWDs. This paper addressed methods to account for the correlation between individuals within a cluster by the use of random effects, considerations to be taken for equal and unequal cluster sizes, and the effect of increasing the number of steps. In addition, the authors present methods for adjusting for confounding of the intervention effect that may occur by calendar time. Lastly, these authors proposed the notion of partial intervention effect. That is, a delay in the intervention effect that might be seen over the course of a given study. The proposed methods in this article set a solid foundation for the analysis of data from SW-CRTs. However, there were two strong assumptions that were proposed 1) there is only a single treatment term included in the proposed model and 2) there is only a fixed effect for time. The use of random effects to account for the within-cluster correlation also limits the inference that can be made in the analysis. Multilevel clustering also tends to occur in SW-CRTs; this model fails to extend itself to these situations. Lastly, this study does not fully address approaches for modeling binary or count outcomes. Repeated cross-sectional versus cohort SWD There are two broad approaches for collecting individual-level measurements across the duration of a SW-CRT [15]. One is a repeated cross-sectional design, where different individuals are measured at each time point for a given cluster. An example of this type of study would be one where primary care practices are randomized, and a random set of 20 patients 21

22 are sampled per practice per time point. The second approach often referred to as a cohort SWD follows the same individuals across time for a given cluster. That is, the individuals included in a cluster at the beginning of the study are the same individuals in the cluster throughout (with the consideration of potential drop out). The choice of collecting individual-level data impacts the considerations that must be made in the sample size calculations and data analysis. The methods proposed by Hussey and Hughes [13] are designed for repeated crosssectional design, while Hooper et al. [16] have proposed methods for cohort SW-CRTs. Methods for data analysis and sample size calculations in cross-sectional SW-CRTs were originally proposed by Hussey et al. [13]. Further work by Hemming et al. and other researchers has expanded the methodology to account for multiple levels of clustering [14]. In 2015, Copas et al. [17] proposed a framework for designing SW-CRTs, which included suggestions for closed and open cohort designs in addition to the cross-section SWD. Here, a closed cohort refers to a study where individuals start the study at the same time, and no individuals are added over time. In an open cohort, individuals are entering and exiting the study at different time points. Cohort designs must account for the within-cluster correlation as well as the correlation that occurs over time among individuals in the same cluster [16]. Hooper et al. [16] reviewed the current methodology for analysis and sample size calculations in crosssectional designs in CRTs and then extended these approaches to closed cohort designs. While Hooper et al. have proposed methods for calculating sample size in closed cohort CRTs, further work is needed to extend this to open cohort CRTs. Also, the methods proposed are only developed for continuous, normally distributed outcomes. For cohort SW-CRTs, it is also important to consider the intensity of collecting information for patients at each specified time point. In addition, selection bias may be introduced into the study in the form of contamination. That is, patients may be recruited into the study after they already know if the treatment/intervention is in effect [14]. 22

23 Power and sample size Just as the correlation between individuals must be taken into account for in the data analysis stage of a SW-CRT, one must also consider this potential source of correlation a priori during power and sample size calculations. For SW-CRTs, one must also consider important factors such as the number of steps in the study design, the delay of treatment effect, and the statistical modeling approach. Using simulations studies, Hussey and Hughes [13] found that the largest power is obtained when each cluster is randomized to intervention at its own step. They also found that when the number of clusters was fixed, but the number of time measurements and randomization steps was varied, increasing the number of clusters randomized to intervention at each step causes the power to decrease. These authors also explored the effect that the delay in treatment effect can have on study efficiency. They found that the longer it takes for a treatment effect to be fully present, the more power is impacted. Longer delays lead to smaller power. For these simulations, the risk ratio was fixed, and the delay of treatment effect was considered for varying values of power and coefficients of variation. The coefficient of variation is defined as the ratio of the standard deviation over the mean ( σ ) and is used to asses variability in relation to the population mean. μ Hussey and Hughes also compared power calculations among differing modeling approaches such as LMM, GLMM, and GEE. They found that for equal cluster sizes, the LMM performed the best (in terms of power). However, for unequal cluster sizes, the GLMM and GEE models performed better. Hemming et al. [15] used simulation studies to explore the differences in efficiency between parallel CRTs and SW-CRTs. They varied the intra-cluster correlation (ICC) across the simulations. When the intra-cluster correlation is small (0.01), the parallel cluster design has power of 0.97 and the SW-CRT (with 4 steps, 5 clusters per step) has power equal to

24 However, when ICC is large (0.1), the parallel cluster design has power of 0.50 and the SW- CRT (with 4 steps, 5 clusters per step) has power equal to Sample size calculations in SW-CRTs are more complicated than parallel cluster trials. From this, we can conclude that when comparing a SWD to a parallel cluster design, SWD will have more power when the intracluster correlation is large. Conversely, the parallel cluster design will be more powerful when the intra-cluster correlation is small. Under certain circumstances discussed by Hemming et al., the sample size calculations need to account for the potential confounding effect of calendar time. This confounding of time often causes worse precision. This requires an increased sample size to maintain an adequate power [15]. Woertman et al. [18] presented a formula for sample size and power calculations for SW-CRTs. These formulae include notation for the design effect. Design effect is used to adjust for clustering and for CRTs is defined as: (7) D eff = 1 + (n 1)ρ, where n is the number of subjects and ρ is the intracluster correlation coefficient (which quantifies the correlation within a given cluster). The design effect takes a more complicated form for SWD: (8) D sw = 1+ρ(ktn+bn 1) 3(1 ρ) 1+ρ( 1 ktn+bn 1) 2t(k 1 ), 2 k where k is the number of steps; b is the number of baseline measurements; t is the number of measurements after each step. Woertman et al. used data simulations to show how CRTs, specifically SW-CRTs, could potentially reduce the number of patients needed to detect a certain intervention effect. These simulations, similar to those mentioned above accounted for ICC, individuals per cluster, number of clusters, and number of steps. These authors showed that SWD is more efficient in terms of sample size when compared to parallel group designs 24

25 and ANCOVA designs. Additionally, as the number of steps in a SWD increases, the design will become more efficient in terms of power and sample size. Some of the drawbacks of SWD in terms of power and sample size include longer study periods, additional costs associated with longer study duration and every cluster receiving treatment, and more complex statistical analysis [18]. Weortman et al also showed that crosssectional studies have less power than cohort studies. This is because within-subject and within-cluster comparisons can be used to measure the effect of the intervention or treatment. In a later publication, Hemming et al. [19] show that the loss of precision by modeling more complex statistical models (often required for SW-CRTs as compared to simpler CRTs) has a direct impact on power. A larger sample size will be needed for more complex model extensions. These studies fail to account for the potential effect of model misspecification on statistical power. This issue is raised, but is left for future research. Multiple levels of clustering In CRTs, including SWD, situations arise in which there are multiple levels of clustering by the design of the study. For example, patients could be nested within practices, and further, groups of practices could be located in different geographic regions which could be considered another level of clustering. This additional level of clustering introduces another source of correlation that should be accounted for in the analysis. Failing to account for the multiple levels of clustering in the analysis of data can cause overly precise results, which can lead to incorrect conclusions [14]. Hemming et al. [14] provide a framework for statistical modeling and power calculations for multi-level SW-CRTs. These authors combined the work of Heo et al. [20] on hierarchical CRTs with the Hussey and Hughes model for SW-CRTs using random effects to account for additional variation introduced by another level of clustering: (9) Y ijls = β 0 + β s + θx is + u j + v jl + e ijls, where the common terms from equation (6) are defined similarly; l indexes a given group of 25

26 clusters; v jl ~ N(0, σ 2 v ) is a random effect for the j th cluster within the l th group of clusters. It is important to remember that this model is presented in the context of cross-sectional SW-CRTs. Further considerations are needed to account for the cohort designs (where enrolled individuals are followed throughout the entire study, as described in Section 3.2.2). Confounding by calendar time Some of the practical reasons for implementing SW-CRTs often result in trials extending over longer periods of time as compared to individual RCTs [7]. Because of the increased time of the study, it is likely that the outcome of interest may be associated with the treatment or intervention, but also extraneous temporal effects, such as other programs, interventions, or exposures that impact outcomes [15]. That is, there is potential for calendar time to be a confounding effect. Neglecting to account for this source of confounding in the analysis can lead to biased estimates. These estimates could be biased toward or away from the true value depending on the effect of confounding [15]. In 2017, Hemming et al. [19] reviewed the basic statistical model presented by Hussey and Hughes [13] and extended the model to adjust for underlying secular trends. They conducted simulation studies based on a case study [21] to compare models adjusted for calendar time to models that did not adjust. From these simulation studies, they found that failing to account for underlying time trends lead to incorrect inference from the corresponding measure of association. These authors showed the importance of accounting for calendar time in the study analysis. The basic statistical model presented by Hussey and Hughes fails to account for underlying time trends [19]. A systematic review that looked at methodology in SW-CRTs [7] found that 60 of the 102 (60%) studies explored the possible confounding effect by calendar time. The remaining 42 studies did not explore this possible confounder (or made no mention of underlying time trends at all). Based on the results of the simulation studies conducted by 26

27 Hemming et al., they conclude that it is imperative that underlying time trends be adjusted for in all SW-CRT analysis. Effect modification by time on treatment While we might expect that external factors over the calendar time of the study may serve as potential confounders, we also want to consider the possibility that internal factors may be influencing the effectiveness, uptake, or sustainability of the intervention during the time on treatment. During a SW-CRT, the effect of the treatment is not always taken up immediately after the intervention has been delivered. It is possible that change in the outcome measures related to the treatment varies across time on the treatment. That is, effect modification by time on the treatment may be present. There could be an initial transition period where the intervention occurs, and the uptake of the intervention is slow. There may also be situations where investigators continue to record outcome measures after an intervention program has ended to evaluate the sustainability of the program. Hughes et al. [22] presented the following model as an extension of the earlier Hussey and Hughes model: (9) Y ijlms = β 0 + β s + L ijm θ m + u j + v jl + e ijls, where the common terms from equation (5) are defined similarly; m indexes time on treatment (where m = 0 when a cluster is still in a control period; m = 1 at the first time period when intervention is introduced); L ijm = 1 indicates that cluster i at time j is under the intervention; θ m is the intervention effect at the time since treatment m. This model extension uses time on treatment as the time factor, as opposed to a calendar time by intervention interaction term. With this model we can estimate the intervention by time interaction at the cluster level without needing all clusters to be under the control or all clusters be under the treatment condition. Hemming et al. [19] conducted simulation studies which proposed using a time by treatment interaction to allow for heterogeneous treatment effects by time. Beside the drawback of only 27

28 being able to estimate the treatment effect when all clusters are under the control, or all under the treatment, the more complex statistical model also leads to decreased precision. HEALTHY HEARTS FOR OKLAHOMA Introduction Healthy Hearts for Oklahoma (H2O) is a statewide intervention and quality improvement program, funded by the Agency for Health Care Research and Quality (1R18HS , Daniel Duffy, PI). This program works to improve healthy heart outcomes. The program in Oklahoma enrolled 263 primary care practices across the state located in rural and urban counties. Practices eligible for this study were those that provide primary care to adults and have fewer than 11 clinicians at the practice site. The intervention targets primary care physicians and is meant to reinforce best practices in ABCS measures: encouraging patients at risk of poor heart outcomes to take aspirin (A), assisting with blood pressure (B) and cholesterol (C) management, and providing smoking cessation (S) support. H2O implemented a stepped-wedge cluster randomized trial design. Study design There are two levels of nesting that arise in the H2O study, based on the design of the intervention roll-out. The first level of nesting is defined by patients that receive care at a given practice. A group of practices is then served by a given program enhancement assistant (PEA) which defines the second level of nesting. The following diagram shows the four steps, the roughly 60 practices changing from control to intervention per step and the ten 3-month outcome measurement time periods of the study (Figure 4). The initial design was for each PEA to be assigned practices for intervention. An individual PEA could not feasibly implement the intervention at all practices at one time, and therefore, the timing of intervention implementation was randomized using the SW-CRT design. The original design included 20 PEAs that were to serve 3-4 practices in each Wave. 28

29 The randomization was stratified by PEA so that each practice in the catchment area was randomly assigned to Wave 1-4 that indicated the start date for intervention. The ABCS outcome measures (described above) are defined as dichotomous indicators of adherence to established screening and treatment guidelines. Study considerations The study design lends itself to a unique analysis. This analysis must account for (1) the multiple levels of clustering (i.e., patients nested within providers, nested within PEAs) and (2) correlation within and between the clusters of practices over time. Given that the SWD is a relatively new design, optimal data analysis approaches to account for the correlation and time effects are not well established and have not been widely applied to data like that from the H2O study with a large number of patients (approximately 60 patients per practice per time point) nested within a small number of PEAs (n=20). OBJECTIVE In preparation for the final H2O project analysis, which will occur in Fall 2018 after data collection is complete, the objective of this thesis research is to determine the best approach for the statistical analysis of the H2O data. This will be done by conducting simulation studies patterned off the H2O data. The value of these simulation studies is that we specify the parameter values when simulating the data (and therefore know the true values) and can, therefore, estimate the bias and precision of different modeling approaches. AIMS Aim 1: Investigate how the levels of clustering and modeling approaches impact intervention effect estimates. Importance: Failing to account for multiple levels of clustering in the analysis can cause overly precise results that lead to incorrect conclusions. Hypothesis: We expect that, if we ignore the levels of clustering, we may make incorrect 29

30 conclusions where we declare the intervention to be effective when it is not (i.e., a false positive result). Aim 2: Investigate how conclusions differ depending on the extent of extraneous temporal effects (i.e., confounding by calendar time) among modeling approaches Importance: Neglecting to account for extraneous temporal effects in the analysis might cause our estimates to be biased. Hypothesis: We expect that, if we do not account for extraneous temporal effects, we would have estimates that are biased away or towards the null depending on the confounding. Aim 3: Investigate modification by time under the intervention (i.e., is there an initial transition period where orientation occurs, and uptake of the intervention is slow and is the intervention effect maintained after withdrawal of the study program staff) and compare estimates among different modeling approaches Importance: Accounting for modification by time under intervention allows us to determine if the intervention was effective immediately after orientation, if the effect of intervention uptake was delayed, and if the effect of the intervention persisted after orientation and engagement with program staff have ended. Hypothesis: We expect that models that do not include the time by intervention interaction will result in biased intervention effect estimates when the intervention effect, in fact, varies over time. 30

31 CHAPTER II A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED- WEDGE CLUSTER RANDOMIZED TRIALS IN PRACTICE-BASED RESEARCH Lance Ford 1, Ann Chou 2, Daniel Zhao 1, Tabitha Garwe 1, Daniel Duffy 3, Julie A. Stoner 1 Author Affiliation: 1 Department of Biostatistics and Epidemiology University of Oklahoma Health Sciences Center Oklahoma City, Oklahoma 2 Department of Family and Preventive Medicine College of Medicine University of Oklahoma Health Sciences Center Oklahoma City, Oklahoma 3 School of Community Medicine University of Oklahoma-Tulsa Tulsa, Oklahoma 31

32 ABSTRACT Introduction: Stepped-wedge design (SWD), a type of cluster randomized trial, is being used more frequently to study community- and practice-based interventions. SWD is beneficial in cases where the intervention cannot be rolled out all at once, and the intent is to expose all communities or practices to an intervention program. Given the SWD is a relatively new design, optimal analytic approaches to account for correlation and time effects are not well established and have not been widely applied to data from practice-based research settings. Methods: This simulation study is designed based on a statewide intervention program called Healthy Hearts for Oklahoma. This study investigates how (1) levels of clustering and modeling approaches impact statistical efficiency of intervention effect estimates, (2) conclusions differ depending on the extent of extraneous temporal effects among modeling approaches, and (3) estimates vary across modeling approaches when the intervention effect is modified by time under the intervention. Mixed-effects models were used to account for the correlation between and within multiple levels of clustering, in this study, patients nested within practices nested within intervention program staff. Programming simulations were conducted to estimate the bias and precision of estimates from different modeling approaches. Results: In all scenarios, the correct statistical models yielded unbiased estimates of the model coefficients and standard error. However, we found that ignoring all levels of clustering in the modeling approach led to the average coefficient for both the intercept and treatment effect to be underestimated. The estimated intercept coefficient and treatment effect coefficient were both 22% lower than the true parameter value. When confounding by calendar time was introduced, we found that failing to adjust for the confounding effects of time in the modeling approach resulted in an overestimation of the intercept and treatment effect (6% and 31%, respectively). Calculating one treatment effect for the entire study (as opposed to calculating treatment effects for each period when there are clusters receiving treatment) results in an erroneous result for the total treatment effect. 32

33 Conclusions: In a study design where multiple levels of clustering are present, failing to account for such clustering can lead to an underestimation of the model parameters and standard error. This underestimation of the standard error can ultimately lead to an increase in false positive errors. For SW-CRTs that extend over long periods of time, it is important to adjust for the potential confounding effect of calendar time. Failing to do so can lead to an overestimation of the treatment effect. When it is expected that there might be a delay in the uptake of the intervention, it is important to include an interaction for time on treatment in the modeling approach. Failing to do so results in an estimate of the treatment effect that overestimates the treatment effect at the beginning of the intervention, and underestimates the effect once the intervention effect is fully observed. INTRODUCTION Cluster randomized trials Longitudinal data often arise in randomized controlled trials (RCTs) where multiple measures are recorded for participants over time. That is, the investigators are manipulating or assigning some exposure that is expected to impact the outcome. Subjects are then randomized to different treatment arms. The randomization at the beginning of RCTs is an important method for eliminating potential confounders and removing other sources of selection bias. The subjects, or unit of randomization, in RCTs do not have to be the individual. Often it is not feasible to randomize at the individual level due to the method or implementation of the intervention, the social structures of the specific participants that are to be evaluated, or to control for some type of contamination bias (if that is of concern). If these issues are of concern, groups of individuals can define clusters that are then randomized as a group. That is, we don t randomize the individual in the study, but the cluster. The class of RCTs that randomize at the group level are commonly called cluster randomized trials (CRTs). 33

34 Stepped-wedge cluster randomized trials Stepped-wedge design (SWD) is a relatively new type of crossover cluster-randomized trial (where at least one arm of the study switches exposure or intervention status during the study period). Under SWD, the initial measurement period is a control period where data is collected from all clusters before the intervention has started. Then, the timing of intervention initiation is randomized. Groups of individuals or participants cross over to the intervention at specified time points. These changes in assignment are referred to as steps in the design. Outcome measures are recorded for all clusters at all time points during the study period until all clusters have been exposed to the intervention and follow-up data has been collected. The design includes pre-intervention, intervention, and (for some studies) post-intervention sustainability measures for all clusters. SWD is beneficial in cases where the intervention cannot be rolled out all at once, and the intent is to expose all clusters to the intervention program. SWD is being used more frequently to study community- and practice-based interventions. In 2006, Brown and Lilford [7] published results from a systematic review of 12 studies that were classified as SWD trials. Years later, and with other systematic reviews in between [8-9], Barker et al. [10] reviewed 42 SWD methodology papers and 102 papers of SWD trials from all of the published literature before The systematic reviews of SWD literature found that the most common rationale for utilizing SWD was due to: logistical limitations (e.g. not being able to roll out the intervention all at once), limited resources (e.g. funding or human resource constraints), ethical considerations (e.g. when it would be unethical to withhold the intervention), and methodological reasons (e.g., the study design allows for underlying temporal change) [7-9]. Barker et al. found that statistical analysis methods varied greatly among studies [10]. The majority of the studies used generalized linear mixed models (GLMMs; 59%, n=60) while 17 used generalized estimating equations (GEEs; 17%); the remaining 23% used a variety of 34

35 statistical methods to analyze the data including: generalized linear models, Cox proportional hazard models, Analysis of Covariance, Analysis of Variance, paired t-tests, McNemar s test, Discourse mapping, and Mann-Whitney U tests (individual percentages were not report). Previous researchers reported the use of random effects in mixed effect models to account for the between-cluster variation in SWD. The effect of calendar time (i.e., underlying time trends) should also be considered in the analysis of data from SWD. A common feature among SW-CRTs as compared to CRTs is the increased study duration [7]. Underlying time trends may be associated with the outcome of interest as well as the intervention effect. That is, calendar time may be a potential confounder. In a previous review of SW-CRT studies, only 61 of 102 (60%) studies adjusted for calendar time as a potential confounder [10]. The other studies reviewed either failed to adjust for calendar time as a confounder (without reference to exposure time or duration) or found no association between time and the outcome. Neglecting to account for extraneous temporal effects in the analysis might result in biased coefficient estimates. This could lead to bias away or towards the null depending on the confounding. It is also important to account for time on treatment (time as an effect modifier of the treatment) in the setting of SWD. It may be that the full intervention effect is not observed until the unit has been exposed for at least, say, three months, during a 12-month exposure period. Accounting for modification by time under intervention allows us to determine if the intervention was effective immediately after orientation, if the effect of intervention uptake was delayed, and if the effect of the intervention persisted after orientation and engagement with intervention program staff have ended. We hypothesize that models that do not include the time by intervention interaction will result in biased intervention effect estimates when the intervention effect, in fact, varies over time. Lastly, little research has been conducted to show the importance of accounting for multiple levels of clustering in such SW-CRTs. Multiple levels of clustering may arise, for 35

36 example, if health measures are made on patients who are nested within clinicians who are nested within practices in a practice-based study. Failing to account for multiple levels of clustering in the analysis can cause overly precise results that lead to incorrect conclusions. If levels of clustering are ignored, we may make incorrect conclusions where we declare the intervention to be effective when it is not (i.e., a false positive result). Patterning data off parameters from the Healthy Hearts for Oklahoma study (presented in the next section), we conducted a series of simulation studies to understand the effects of having multiple levels of clustering, calendar time as a potential confounder, and time on treatment as a potential effect modifier in the study design. The value of these simulation studies it that the true parameter value is specified when simulating the data and therefore, the bias and precision of different modeling approaches can be evaluated. Healthy Hearts for Oklahoma Study Healthy Hearts for Oklahoma (H2O) is a statewide intervention and quality improvement program, funded by the Agency for Health Care Research and Quality (1R18HS , Daniel Duffy, PI). This program works to reduce the risk for cardiovascular disease. The program in Oklahoma enrolled 263 primary care practices across the state located in rural and urban counties. Practices eligible for this study were those that provide primary care to adults and have fewer than 11 clinicians at the practice site. The intervention targets primary care physicians and is meant to reinforce best practices in ABCS measures: encouraging patients at risk of poor heart outcomes to take aspirin (A), assisting with blood pressure (B) and cholesterol (C) management, and providing smoking cessation (S) support. H2O implemented a steppedwedge cluster randomized trial design. According to the design of the roll-out of the intervention in the H2O study, there are two levels of nesting that arise in this study. The first level of nesting is defined by patients that receive care at a given practice. A group of practices is then served by a given program enhancement assistant (PEA) which defines the second level of nesting. 36

37 The following diagram shows the four steps, the roughly 60 practices changing from control to intervention per step and the ten 3-month outcome measurement time periods of the study (Figure 4). The initial design was for each PEA to be assigned practices for intervention. An individual PEA could not feasibly implement the intervention at all practices at one time, and therefore, the timing of intervention implementation was randomized using the SW-CRT design. The original design included 20 PEAs that were to serve 3-4 practices in each Wave. The randomization was stratified by PEA so that each practice in the catchment area was randomly assigned to Wave 1-4 that indicated the start date for intervention. The ABCS outcome measures (described above) are defined as dichotomous indicators of adherence to established screening and treatment guidelines. The study design lends itself to a unique analysis. This analysis must account for (1) the multiple levels of clustering (i.e., patients nested within practices nested within PEAs) and (2) correlation within and between the clusters of practices over time. Given that the SWD is a relatively new design, optimal data analysis approaches to account for the correlation and time effects are not well established and have not been widely applied to data like that from the H2O study with a large number of patients (approximately 60 patients per practice per time point) nested within a small number of PEAs (n=20). METHODS Simulation Studies/Assumptions Simulation studies were conducted to compare modeling approaches under certain scenarios defined below. For all simulation studies, data was generated according to parameters from the Healthy Hearts for Oklahoma (H2O) study. These parameters include the number of practices, number of patients per practice, number of program enhancement assistants (PEAs), number of practices served per PEA, number of time periods, baseline 37

38 proportion of outcome of interest, and the estimated treatment effect on the proportion adherent to healthy heart guidelines based on ABCS measures. Hemming et al. [14] proposed the following statistical model as an extension of the Hussey and Hughes model [13] to account for multiple levels of clustering in a SW-CRT. (1) Y ijls = β 0 + β s + θx is + u j + v jl + e ijls, where i indexes patient, j indexes practice, l indexes a given PEA, and s indexes time; Y ijls is the continuous outcome for patient i, in practice j, under PEA l, at time s ; e ijls ~ N(0, σ 2 (e) ) represents the individual error terms for patient i, in practice j, under PEA l, at time s ; X is is an indicator variable for exposure to intervention effect for individual i at time s ; θ represents the intervention effect; β s is a fixed categorical time effect; β 0 is an intercept term; u j ~ N(0, σ u 2 ) is a cluster-level random effect (used to account for the correlation between individuals within a given cluster); v jl ~ N(0, σ v 2 ) is a random effect for the j th practice within the l th PEA. The scenarios defined below use variations of this model for to generate data for the given set of simulations. This model was designed for, and is applied to, data that comes from a crosssectional SWD. That is, different individuals are measured at each time point for a given cluster. Generalized linear mixed models (GLMM) are used to model the data generated for each aim. As the outcomes of interest are binary (1=adherence to the guideline; 0=non-adherence), we use the logit link function (i.e. a log-odds transform) extension of GLMM. We simulate 500 data sets for each aim using methods presented by Horton and Kleinman [23]. Random effects are used in the model to account for the correlation introduced by the nesting of patients within practices and nesting of practices within PEAs. These random effects are distributed normally with mean equal to zero, and variance equal to σ 2. We used preliminary data from H2O to calculate the standard deviation of the outcome between practices and the standard deviation of the outcome between PEAs. As a result of the standard deviation values, the correlation among practices is greater than correlation among 38

39 PEAs. The standard deviation of the outcome between practices (σ j = 1.23) and the standard deviation of the outcome between PEAs (σ v = 0.18) are used to generate distributions for the random effects. That is, for all simulations below, u j ~N(0, = 1.513) and v jl ~N(0, = ). We assume 60 patients per practice, 60 practices per step in the SWD, and 20 PEAs that serve 12 practices each. Based on the H2O study, we structure the data such that there are 4 steps, and 10 time points. For all sets of simulations, the baseline proportion of the outcome of interest is assumed to be 0.67, thus the specified value for β 0 = e e 0.67 = The treatment effect is assumed to be different for each set of simulations, thus the treatment effect for each scenario is described in its corresponding section. Multiple levels of clustering The Healthy Hearts for Oklahoma study has multiple-levels of clustering by design. These levels of hierarchy are defined as patients nested within practices and practices nested within PEAs. Data is generated as described in the previous section to pattern the data structure of the H2O study. Model (2) below is used to generate the data for this set of simulations. The treatment proportion is assumed to be 0.78, which implies that e β 0+θ = e e β 0+θ 1 + e0.78 = Thus, the specified value for θ = = β 0, u j, v jl are simulated as defined above. Models following the forms described in Scenarios 1, 2, and 3, are then fit to the simulated data. Scenario 1: Random practice, random PEA. This model accounts for the additional correlation imposed by the multiple levels of nesting by including the random effects for practice (u j ) and random effect for practice within PEA (v jl ). (2) ln ( P(Y ijls=1 X is, u j,v jl ) P(Y ijls =0 X is, u j,v jl ) ) = β 0 + θx is + u j + v jl 39

40 Scenario 2: Random practice. The following model removes the random effect of practice within PEA (v jl ). (3) l n ( P(Y ijs=1 X is, u j ) P(Y ijs =0 X is, u j ) ) = β 0 + θx is + u j Scenario 3: Naïve model. This model removes the random effect of practice and the random effect of practice within PEA. This is referred to as the naïve model as it fails to account for any of the additional correlation that arises due to the multiple levels of clustering in the study. (4) l n ( P(Y is=1 X is ) P(Y is =0 X is ) ) = β 0 + θx is Confounding by calendar time To account for potential confounding effects by calendar time under the intervention, we use the following model to generate data that has an underlying confounding effect of 30%. That is, of the total observed change in the outcome over the course of the intervention, we simulate data such that 30% of the observed change is due to an extraneous time effect β s. The idea is that if we never introduced the intervention, we would anticipate that some external factors (that we can or cannot account for) would have been associated with an increase from our given baseline proportion. We introduce this confounding effect to all clusters after the third time period. Introducing confounding after this time period helps balance the total time of confounding experienced by clusters under control and clusters under intervention. Model (5) below is used to generate the data for this set of simulations. β 0, u j, v jl are simulated as defined above. The parameter value specified for β s is equal to 30% of the treatment effect used in the set of simulations in the previous section (θ = ). Therefore, β s = = and the treatment effect makes up the remaining 70%. Thus, in this set of simulations θ = = Models following the forms described in Scenarios 1 and 2 are then fit to the simulated data. 40

41 Scenario 1: Confounding by calendar time. We generate the data for this set of simulations to represent confounding effect by calendar time under the intervention. (5) ln ( P(Y ijls=1 X is, u j,v jl ) P(Y ijls =0 X is, u j,v jl ) ) = β 0 + β s + θx is + u j + v jl Scenario 2: Naïve model. We then use the following model on the data generated above to determine the effects of failing to account for confounding in our analysis (by removing β s ). (6) ln ( P(Y ijls=1 X is, u j,v jl ) P(Y ijls =0 X is, u j,v jl ) ) = β 0 + θx is + u j + v jl Effect modification by time on treatment To simulate the scenario when the intervention effect might be delayed by time under intervention, we assume that the partial treatment effect increases by 25% after every two time periods. We then generate data using the following model. (7) ln ( P(Y ijls=1 L ijm, u j,v jl ) P(Y ijls =0 L ijm, u j,v jl ) ) = β 0 + L ijm θ m + u j + v jl, where m indexes time on treatment (where m = 0 when a cluster is still in a control period; m = 1 at the first time period when intervention is introduced); L ijm = 1 indicates that cluster i at time j is under the intervention; θ m is the intervention effect at the time since treatment m. Table 1 presents the assumed treatment effect (θ m ) for each time on treatment (m) with the increase in treatment effect after every two time periods. To understand the effects of ignoring effect modification by time on treatment, we then run the following naïve model on the data generated above. That is, this model does not estimate a treatment effect for all time points, but the average treatment effect over the course of the study. (8) ln ( P(Y ijls=1 X is, u j,v jl ) P(Y ijls =0 X is, u j,v jl ) ) = β 0 + β s + θx is + u j + v jl Model Performance Each set of simulation scenarios compares modeling approaches in terms of bias, average estimated standard error, standard deviation of the coefficients, coverage probability, 41

42 and mean square error (MSE) for the independent variables that are shared across the modeling approaches. Bias is defined as the difference between the true parameter value and the average estimated values from the simulation studies. The average standard error is calculated as 1 n n i=1 SE i; where n is the number of simulations and SE i is the estimated standard error for the i th simulation. The standard error coefficient is the standard deviation of the distribution of the estimated coefficient. Coverage probability is defined as the proportion of times the 95% confidence interval covers the estimated coefficient value. MSE is the sum of the biased squared plus the standard error coefficient squared. RESULTS Multiple levels of clustering From the simulation results presented in Table 2, we see that if we ignore all levels of clustering in our study, that the average coefficient for both the intercept (baseline proportion) and the treatment effect are underestimated. The estimated intercept coefficient is 22% lower than the true parameter value. Similarly, the estimated treatment effect is underestimated by 22%. For measures of bias and MSE, the reference model (random practice and random PEA effects included) did not differ as compared to the model that accounts for only one level of clustering (random effects for practice, but no random effects for PEA). The average standard error, standard error coefficient, and coverage probabilities were equivalent between these models for the estimation of the treatment effect. However, the reference model does have a larger average standard error value and coverage probability as compared to the single-level clustering model for the intercept estimates. The coverage probabilities drop drastically for estimation of the intercept and treatment effects when no levels of correlation are accounted for in the model. In our simulations, the 95% confidence interval only contained the true intercept value 3% of the time and only 1% of the 42

43 time for the treatment effect. The average standard error and the standard error coefficient of the intercept estimation notably decrease when all levels of correlation are ignored. The decreasing trend in average standard error is also observed for estimation of the treatment effect, but to a lesser degree. The standard error coefficient that we estimate changes direction and increases for the treatment effect. MSE increases as levels of clustering are ignored. Confounding by calendar time As seen in the results from 500 simulations in Table 3, failing to adjust for the confounding effects of time resulted in an overestimation of the intercept and treatment effect (6% and 31%, respectively). That is, the bias of the estimates increases if time is not accounted for in the model. The average standard error and the standard error measures are relatively comparable between the two modeling approaches for the estimation of the intercept and treatment effect. However, we do observe a slight decrease in the average standard error and standard error coefficient of the treatment effect when moving from the time-adjusted model to the model unadjusted for time. The coverage probabilities decrease when time is not included in the model. The 95% confidence intervals produced for the treatment effect never contain true parameter value out of the 500 simulation iterations. When time is adjusted for in the model, the coverage probabilities for the estimation of the intercept and treatment effects are lower than We see an increase in MSE when we fail to adjust for time in the modeling approach. This difference in MSE from the time-adjusted model to the unadjusted model is larger for the estimation of the treatment effect. Effect modification by time on treatment Failing to account for time on treatment when there was a true delayed treatment effect resulted in an overestimation of the baseline response proportion (Table 4). Calculating one treatment effect for the entire study (as opposed to calculating treatment effects for each period 43

44 when there are clusters receiving treatment) results in an erroneous result for the total treatment effect. The average standard error, standard error coefficient, coverage probability and MSE measures for the estimation of the intercept are all comparable. The average standard error measures and standard error coefficients for the estimation of the treatment effects are lower in the model that does not consider effect modification by time on treatment as compared to all measures from the interaction model. Conversely, the MSE measure is greater in the no interaction model as compared to all of the MSE measures from the interaction model. DISCUSSION Multiple levels of clustering Removing all clustering. These simulation studies show that ignoring all levels of clustering in the modeling approach can lead to an underestimation of the intercept and treatment coefficients. Removing all levels of clustering also resulted in lower average standard error measures. Underestimating the average standard error leads to lower confidence interval coverage, which is what we found in these simulations. Removing partial clustering. In the context of our study, we found that accounting for only one level of clustering had no impact on the bias, standard error coefficient, or MSE measures of the estimates. However, we expect that this may not always hold true. As levels of clustering are removed from the modeling approach, we saw the average standard error, and thus, the coverage probabilities decrease for the intercept and treatment effect. Confounding by calendar time Estimation of intercept. In our simulation studies, we found that failing to adjust for calendar time as a potential confounder leads to an overestimation of the intercept by 6%. For the estimation of the intercept, we found minimal differences between the time-adjusted and unadjusted modeling approaches in terms of average standard error, standard error coefficient, 44

45 coverage probability and MSE. Failing to adjust for calendar time in the modeling approach impacted the bias of the estimate, but not the precision. Estimation of treatment effect. In our simulations, failing to account for confounding by calendar time leads to an overestimate of the effect of the intervention. In practice, using an unadjusted model for calendar time would attribute extraneous temporal effects to the treatment (if such effects exist). The differences in the average standard error between the time-adjusted and time-unadjusted model could also be of concern. This underestimation of the standard error could also lead to lower coverage probabilities, and further, false positive errors. However, in this scenario, the bias of the treatment estimate is the driver of the zero coverage probability. Estimation of treatment, intercept, and time effects. When time is adjusted for in our model, the coverage probabilities for the estimation of the intercept, treatment, and time effects are lower than what we might expect (95% coverage). This may be because we have little information for the practices before the confounding effect is introduced (before the third time period). However, if the confounding effect was introduced at a later time period, we would have fewer, or no, control (non-intervention) time periods with the confounding effect. With no control time periods experiencing the confounding effect, we would be unable to distinguish the estimates between confounding and treatment. Effect modification by time on treatment Estimation of intercept. Failing to account for a possible delay in treatment delay can lead to an overestimation of the baseline proportion when there is, in fact, a delay. That is, failing to account for effect modification by time on treatment introduces bias to the estimation of the intercept. When estimating the intercept, measures such as average standard error, the standard error coefficient, coverage probability, and MSE are not largely different when comparing the interaction model to the non-interaction model. Estimation of treatment effect. When there is a delay in the uptake of treatment, we have shown that failing to account for this delay can result in misleading and over-simplified 45

46 conclusions about the effectiveness of the treatment. By ignoring the interaction in the model, the treatment effect is over-estimated at earlier times on treatment and under-estimated at later periods once the treatment has reached its full effect. Limitations The assumptions of the simulation studies include a fixed number of patients per practice, practices per step, and practices served by a given PEA. To produce a data structure that more closely represents the data collected in the H2O, we could vary these fixed parameters. We used generalized linear mixed models to account for the correlation added by the multiple levels of clustering. Including random effects in this modeling approach limits our inferences to the individual practice or individual PEA level. In this study, it is of particular interest to understand variation at the practice and/or PEA level. However, to make inferences at the population levels, other methods such as generalized estimating equations could be considered. The biggest limitation of the first simulation study scenario (multiple levels of clustering) relates back to the assumptions based on the constant number of patients per practice and the number of practices served by each PEA. Additionally, it would also be of interest to understand how the effects of removing levels of clustering from the model change at different levels of correlation. We expect that as we remove levels of clustering, at lower levels of correlation, the underestimation of the estimate and the average standard error will also decrease. For the second set of simulation studies (confounding by calendar time), we did not explore different magnitudes of confounding. That is, we only considered a confounding effect by calendar time of 30%; we would also like to explore how the measures of bias and precision are impacted as confounding decreases (e.g., at 20%, 15%, and 10% confounding). In these simulations, we only considered a scenario where the confounding started after the second time period. We did not consider scenarios where confounding started at later or earlier time periods. 46

47 Additionally, we only consider a scenario where the extraneous temporal effect was positively associated with the outcome of interest. Further simulations could be done to examine the effects if the potential confounding by calendar time is associated with a decrease in the outcome of interest. There are other scenarios that could be simulated for the last set of simulation studies (effect modification by time on treatment). We simulated the scenario where the treatment is expected to have a delayed effect. That is, there is a slow uptake in the intervention, and over time the treatment effect is fully observed. For the type of intervention delivered in the H2O study, it is also reasonable to assume that there is a delay in the uptake of the intervention, but that once the intervention is fully observed, it starts to taper off. In the context of this study, the delivered intervention included a large involvement of program staff. Once these program staff have left the given practice, it would be reasonable to assume that there is less motivation to continue the learned skills from the intervention and as a result, the treatment effect diminishes. Conclusion In a study design where multiple levels of clustering are present, failing to account for such clustering can lead to an underestimation of the model parameters and of the standard error. This underestimation of the standard error can ultimately lead to an increase in false positive errors. It is important to account for multi-level clustering. SW-CRTs tend to extend over a longer period of time as compared to more traditional types of CRTs. In this case, it is important to adjust for the potential confounding effect of calendar time. Failing to do so can lead to an overestimation of the treatment effect. That is, the effectiveness of the study intervention might be over-reported and not attributed to extraneous effects that occur over the course of the study. When it is expected that there might be a delay in the uptake of the intervention, it is important to include an interaction for time on treatment in the modeling approach. Failing to do so results in an estimate of the treatment effect that overestimates the treatment effect at the 47

48 beginning of the intervention, and underestimates the intervention once the intervention effect is fully observed. An individual treatment effect that is calculated when effect modification by time on treatment is suspected leads to an estimate that is misleading and overly simplifies the underlying treatment effect. Funding Sources Agency for Health Care Research and Quality (1R18HS , Daniel Duffy, PI) Table 1. Assumed treatment effect θ m for each time on treatment (m) with a 25% increase in treatment effect after every two time periods on treatment. Time on treatment, m Partial treatment effect Treatment effect, θ m % 25% 25% 50% 50% 75% 75% 100% 100% 100%

49 Table 2. Results from 500 programming simulations comparing modeling approaches for multiple levels of clustering. Intercept True value, β 0 = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE Random practice, random PEA Failed to converge: 30 Random practice Failed to converge: 0 Naïve model Failed to converge: Treatment Effect True value, θ = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE Random practice, random PEA Failed to converge: 30 Random practice Failed to converge: 0 Naïve model Failed to converge: Abbreviations: MSE = mean square error 49

50 Table 3. Results from 500 programming simulations comparing modeling approaches that do and do not account for confounding by calendar time. Intercept True value, β 0 = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE Time-adjusted Failed to converge: 11 Unadjusted for time Failed to converge: Treatment Effect True value, θ = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE Time-adjusted Failed to converge: 11 Unadjusted for time Failed to converge: Time Effect True value, β s = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE Time-adjusted Failed to converge: 11 Unadjusted for time Not estimated in unadjusted model. Abbreviations: MSE = mean square error 50

51 Table 4. Results from 500 programming simulations comparing the modeling approaches that do and do not account time on treatment. Intercept True value, β 0 = Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE No Interaction model Failed to converge: 37 Interaction model Failed to converge: Treatment Effect Average Coefficient Bias Average Std. Error Std. Error Coefficient Coverage Probability MSE No Interaction model Failed to converge: Interaction model Failed to converge: 37 True values, θ 1 = θ 2 = θ 3 = θ 4 = θ 5 = θ 6 = θ 7 = θ 8 = θ 9 = Abbreviations: MSE = mean square error 51

52 CHAPTER III DISCUSSION Clinical Significance of the Study The simulation studies presented in the manuscript above will help inform the modeling strategies for the final data analysis for the H2O project. The value of simulation studies is that we know the true parameter values and can, therefore, estimate the bias and precision of different modeling approaches. These sets of simulation studies present the importance of accounting for multiple levels of clustering, confounding by calendar time, and effect modification by time on treatment during modeling. Statistical Findings Aim 1 Removing all clustering. These simulation studies show that ignoring all levels of clustering in the modeling approach can lead to an underestimation of the intercept and treatment coefficients. Removing all levels of clustering also resulted in lower average standard error measures. Underestimating the average standard error leads to lower confidence interval coverage, which is what we found in these simulations. Removing partial clustering. In the context of our study, we found that accounting for only one level of clustering had no impact on the bias, standard error coefficient, or MSE measures of the estimates. However, we expect that this may not always hold true. As levels of clustering are removed from the modeling approach, we saw the average standard error, and thus, the coverage probabilities decrease for the intercept and treatment effect. Aim 2 Estimation of intercept. In our simulation studies, we found that failing to adjust for calendar time as a potential confounder leads to an overestimation of the intercept by 6%. For the estimation of the intercept, we found minimal differences between the time-adjusted and unadjusted modeling approaches in terms of average standard error, standard error coefficient, 52

53 coverage probability and MSE. Failing to adjust for calendar time in the modeling approach impacted the bias of the estimate, but not the precision. Estimation of treatment effect. In our simulations, failing to account for confounding by calendar time leads to an overestimate of the effect of the intervention. In practice, using an unadjusted model for calendar time would attribute extraneous temporal effects to the treatment (if such effects exist). The differences in the average standard error between the time-adjusted and time-unadjusted model could also be of concern. This underestimation of the standard error could also lead to lower coverage probabilities, and further, false positive errors. However, in this scenario, the bias of the treatment estimate is the driver of the zero convergence probability. Estimation of treatment, intercept, and time effects. When time is adjusted for in our model, the coverage probabilities for the estimation of the intercept, treatment, and time effects are lower than what we might expect (95% coverage). This may be because we have little information for the practices before the confounding effect is introduced (before the third time period). However, if the confounding effect was introduced at a later time period, we would have fewer, or no, control (non-intervention) time periods with the confounding effect. With no control time periods experiencing the confounding effect, we would be unable to distinguish the estimates between confounding and treatment. Aim 3 Estimation of intercept. Failing to account for a possible delay in treatment delay can lead to an overestimation of the baseline proportion when there is, in fact, a delay. That is, failing to account for effect modification by time on treatment introduces bias to the estimation of the intercept. When estimating the intercept, measures such as average standard error, the standard error coefficient, coverage probability, and MSE are not largely different when comparing the interaction model to the non-interaction model. 53

54 Estimation of treatment effect. When there is a delay in the uptake of treatment, we have shown that failing to account for this delay can result in misleading and overly-simplified conclusions about the effectiveness of the treatment. By ignoring the interaction in the model, the treatment effect is overestimated at earlier times on treatment and underestimated at later periods once the treatment has reached its full effect. Comparison with Previous Studies Multiple levels of clustering. Hemming et al. present a general framework for multiple levels of clustering in SWD [14]. This framework extends power calculations of single-level studies to SWD studies with multiple levels of clustering. This study continues to show the relationship between ICC and power for these newly presented formula. However, the study by Hemming et al. is one of the few studies to present methodology in the setting of multiple levels of clustering. After reviewing the current literature, no published simulation studies were found to which we can compare our results in terms of bias and measures of precision. Confounding by calendar time. A previous study presented a case study to compare modeling approaches among SWD trials with repeated cross-sectional samples (as was assumed for our simulation studies) [19]. These authors compared a model that adjusted for calendar time to a naïve model that failed to do so. When the researchers failed to account for the effects of calendar time, the odds ratio of the treatment effect was equal to 1.11 (95% CI: 0.95, 1.30). However, once they adjusted for time in the modeling approach, the estimated odds ratio of the treatment was 0.78 (95% CI: 0.55, 1.12). While we should be cautious about comparing the results of our studies (due to study limitations and concerns about comparability), we found similar results that, failing to adjust for calendar time in the modeling approach (when extraneous time effects are assumed to be present) can result in an overestimation of the treatment effect. The authors of this comparison study conclude their final remarks by suggesting that further research, using simulation studies, should be conducted to better understand the consequences of model misspecification on measures of bias and precision. 54

55 A study by Barker et al. [12] also demonstrated through simulation studies that failing to account for calendar time (if there are assumed to be extraneous time effects) leads to biased estimates. These authors presented positive confounding by calendar time of approximately 20%. They ran several scenarios, but if we select the scenario that most closely matches our study (36 clusters per step, 3 steps, 50 individuals per cell, GLMM approach with random effects, and baseline proportion equal to 0.2), we arrive at similar results that failing to account for (positive) confounding by time in the modeling approach can lead to an overestimation of the treatment effect. Effect modification by time on treatment. Hemming et al. [19] present a case-study that introduces an interaction of time by treatment in their modeling approach. This study is limited to only using time periods where all clusters are under control or all under intervention. However, the results of this study were uninformative as a decrease in precision, due to the complex statistical model, resulted in wide confidence intervals. After reviewing the current literature, no published simulation studies were found to which we can compare our results in terms of bias and measures of precision. Limitations The assumptions of the simulation studies include a fixed number of patients per practice, practices per step, and practices served by a given PEA. To produce a data structure that more closely represents the data collected in the H2O study, we could vary these fixed parameters. We used generalized linear mixed models to account for the correlation added by the multiple levels of clustering. Including random effects in this modeling approach limits our inferences to the individual practice or individual PEA level. In this study, it is of particular interest to understand variation at the practice and/or PEA level. However, to make inferences at the population levels, other methods such as generalized estimating equations could be considered. 55

56 The biggest limitation in the first simulation study scenario (multiple levels of clustering) relates back to the assumptions based on the constant number of patients per practice and the number of practices served by each PEA. Additionally, it would also be of interest to understand how the effects of removing levels of clustering from the model change at different levels of correlation. We expect that as we remove levels of clustering, at lower levels of correlation, the underestimation of the estimate and the average standard error will also decrease. For the second set of simulation studies (confounding by calendar time), we did not explore different magnitudes of confounding. That is, we only considered a confounding effect by calendar time of 30%; we would also like to explore how the measures of bias and precision are impacted as confounding decreases (e.g., at 20%, 15%, and 10% confounding). In these simulations, we only considered a scenario where the confounding started after the second time period. We did not consider scenarios where confounding started at later or earlier time periods. Additionally, we only consider a scenario where the extraneous temporal effect was positively associated with the outcome of interest. Further simulations could be done to examine the effects if the potential confounding by calendar time is associated with a decrease in the outcome of interest. There are other scenarios that could be simulated for the last set of simulation studies (effect modification by time on treatment). We simulated the scenario where the treatment is expected to have a delayed effect. That is, there is a slow uptake in the intervention, and over time the treatment effect is fully observed. For the type of intervention delivered in the H2O study, it is also reasonable to assume that there is a delay in the uptake of the intervention, but that once the intervention is fully observed, it starts to taper off. In the context of this study, the delivered intervention included a large involvement of program staff. Once these program staff have left the given practice, it would be reasonable to assume that there is less motivation to continue the learned skills from the intervention. 56

57 Further Research While simulated data rarely captures the idiosyncrasies of real data, our simulation studies would be strengthened by loosening some of the assumptions made in our study and incorporating these changes into our simulation studies. Further research could include varying the number of patients per practice. We assumed that there would be 60 patients per practice, but because of the rural locations of some of the practices, the amount of missing observations, the type of electronic health records uses, or other factors, this number is not feasible for all practices. In future work, we would also like to incorporate the actual number of practices per step that was observed in the H2O study to observe the implications of having unequal step sizes on the modeling approaches if any. We also assumed that the number of practices that each PEA was constant over the duration of the study. In future research, we would like to explore the possibility of having uneven numbers of practices per PEA. In practice, there were some PEAs that might have been more proactive in the intervention process (this could be due to their workload as influenced by the number of other practices they serve, the size of the practices they serve, how far of a commute to the practices they serve, and so on). This random variation among the PEAs was included in our original simulations by introducing random effects, but it would also be of interest to explore the impacts of PEAs serving a different number of practices. That is, what if a PEA that was highly active (and effective) in the practices served fewer practices and a PEA that was less active (and less effective) served a larger number of practices. The use of GLMMs in our simulation studies, and further in the analysis of the real H2O data, limit the level of inference that can be made. Further research could expand the types of modeling approaches used to generalized estimating equations or different multilevel modeling approaches. Concerning the first aim of our study, we would like to extend the scenarios under which we consider the multilevel clustering. This set of simulations could be extended to include 57

58 different levels of correlation introduced by the random effects of practice and PEAs. We would like to see the impact of lower levels of correlation on the measures of bias and precision. Additional scenarios include where a practice has higher correlation, and the PEA a lower correlation, or vice versa. The results of varying number of patients per practice, having unequal numbers of practices per step, and introducing variable practices per PEA would be of particular interest for this study objective. For extraneous effects that occur by calendar time (our second study objective), we would like to simulate different scenarios of confounding. Further research could include varying degrees of confounding. One scenario would be to determine how the measures of bias and precision change when there is only half of the confounding that we originally introduced. It would also be of interest to shift the starting confounding time. These scenarios could introduce the confounding effects in earlier or later time periods. We would also like to consider a scenario where the extraneous effects are inversely associated with the outcome. That is, we would like to simulate a situation where there is negative confounding. We can think of scenarios where over the course of calendar time, different factors could decrease the proportion of time that physicians comply with ABCS measures. These factors could include (but are not limited to) the time physicians have with patients or a disruption in the workplace atmosphere. Our third study objective focuses on the implications of heterogeneous treatment effects over time. In future work, we would like to simulate scenarios where the uptake of treatment may be delayed but then tapers off. Alternatively, a scenario where the intervention has a negative effect over time could be considered. Additionally, it would be of interest to consider a situation where the treatment is delayed, but at a different gradient than what was originally proposed. 58

59 Summary In a study design where multiple levels of clustering are present, failing to account for such clustering can lead to an underestimation of the model parameters and of the standard error. This underestimation of the standard error can ultimately lead to an increase in false positive errors. It is important to account for multi-level clustering. SW-CRTs tend to extend over a longer period of time as compared to more traditional types of CRTs. In this case, it is important to adjust for the potential confounding effect of calendar time. Failing to do so can lead to an overestimation of the treatment effect. That is, the effectiveness of the study intervention might be over-reported and not attributed to the extraneous effects that occur over the course of the study. When it is expected that there might be a delay in the uptake of the intervention, it is important to include an interaction for time on treatment in the modeling approach. Failing to do so results in an estimate of the treatment effect that overestimates the treatment effect at the beginning of the intervention, and underestimates the intervention once the intervention effect is fully observed. An individual treatment effect that is calculated when effect modification by time on treatment is suspected leads to an estimate that is misleading and an over-simplification of the underlying treatment effect. 59

60 LIST OF REFERENCES [1] Verbeke, G. (1997). Linear mixed models for longitudinal data Linear mixed models in practice (pp ): Springer. [2] Hardin, J. W., & Hilbe, J. M. (2002). Generalized estimating equations: Chapman and Hall/CRC. [3] McCulloch, C. E., & Neuhaus, J. M. (2001). Generalized linear mixed models: Wiley Online Library. [4] Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2012). Applied longitudinal analysis (Vol. 998): John Wiley & Sons. [5] McNeish, D., & Kelley, K. Fixed effects models versus mixed effects models for clustered data: Reviewing the approaches, disentangling the differences, and making recommendations. Psychological methods. (2018) [6] McCullagh, P. Generalized linear models. European Journal of Operational Research, 16(3), (1984) [7] Brown, C. A., & Lilford, R. J. The stepped wedge trial design: a systematic review. BMC Med Res Methodol, 6, 54. (2006) [8] Mdege, N. D., Man, M. S., Taylor Nee Brown, C. A., & Torgerson, D. J. Systematic review of stepped wedge cluster randomized trials shows that design is particularly used to evaluate interventions during routine implementation. J Clin Epidemiol, 64(9), (2011) [9] Beard, E., Lewis, J. J., Copas, A., et al. Stepped wedge randomised controlled trials: systematic review of studies published between 2010 and Trials, 16, 353. (2015) [10] Barker, D., McElduff, P., D'Este, C., & Campbell, M. J. Stepped wedge cluster randomised trials: a review of the statistical methodology used and available. BMC Med Res Methodol, 16, 69. (2016) [11] Taljaard, M., Teerenstra, S., Ivers, N. M., & Fergusson, D. A. Substantial risks associated with few clusters in cluster randomized and stepped wedge designs. Clinical Trials, 13(4), (2016) [12] Barker, D., D Este, C., Campbell, M. J., & McElduff, P. Minimum number of clusters and comparison of analysis methods for cross sectional stepped wedge cluster randomised trials with binary outcomes: A simulation study. Trials, 18(1), 119. (2017) [13] Hussey, M. A., & Hughes, J. P. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials, 28(2), (2007) [14] Hemming, K., Lilford, R., & Girling, A. J. Stepped-wedge cluster randomised controlled trials: a generic framework including parallel and multiple-level designs. Stat Med, 34(2), (2015) [15] Hemming, K., Haines, T. P., Chilton, P. J., Girling, A. J., & Lilford, R. J. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ, 350, h391. (2015) 60

61 [16] Hooper, R., Teerenstra, S., Hoop, E., & Eldridge, S. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in medicine, 35(26), (2016) [17] Copas, A. J., Lewis, J. J., Thompson, J. A., Davey, C., Baio, G., & Hargreaves, J. R. Designing a stepped wedge trial: three main designs, carry-over effects and randomisation approaches. Trials, 16(1), 352. (2015) [18] Woertman, W., de Hoop, E., Moerbeek, M., Zuidema, S. U., Gerritsen, D. L., & Teerenstra, S. Stepped wedge designs could reduce the required sample size in cluster randomized trials. J Clin Epidemiol, 66(7), (2013) [19] Hemming, K., Taljaard, M., & Forbes, A. Analysis of cluster randomised stepped wedge trials with repeated cross-sectional samples. Trials, 18(1), 101. (2017) [20] Heo, M., & Leon, A. C. Statistical power and sample size requirements for three level hierarchical cluster randomized trials. Biometrics, 64(4), (2008) [21] Kenyon, S., Dann, S., Hope, L., et al. Evaluation of a bespoke training to increase uptake by midwifery teams of NICE Guidance for membrane sweeping to reduce induction of labour: a stepped wedge cluster randomised design. Trials, 18(1), 357. (2017) [22] Hughes, J. P., Granston, T. S., & Heagerty, P. J. Current issues in the design and analysis of stepped wedge trials. Contemporary clinical trials, 45, (2015) [23] Horton, N. J., & Kleinman, K. (2015). Using R and RStudio for data management, statistical analysis, and graphics: CRC Press. 61

62 APPENDICES Appendix A: Diagram of Parallel Cluster Randomized Trial Appendix B: Diagram of Crossover Cluster Randomized Trial Appendix C: Diagram of Stepped-Wedge Cluster Randomized Trial Appendix D: Diagram of Healthy Hearts for Oklahoma (H2O) Study Design Appendix E: R code 62

63 APPENDIX A: Figure 1. Diagram of Parallel Cluster Randomized Trial 63

64 APPENDIX B: Figure 2. Diagram of Crossover Cluster Randomized Trial 64

65 APPENDIX C: Figure 3. Diagram of Stepped-Wedge Cluster Randomized Trial 65

66 APPENDIX D: Figure 4. Diagram of Healthy Hearts for Oklahoma Study Design 66

Sample size calculation for a stepped wedge trial

Baio et al. Trials (2015) 16:354 DOI 10.1186/s13063-015-0840-9 TRIALS RESEARCH Sample size calculation for a stepped wedge trial Open Access Gianluca Baio 1*,AndrewCopas 2, Gareth Ambler 1, James Hargreaves