Large simple randomized trials and therapeutic decisions Adam La Caze November 3, 2011 Contents 1 Background 1 2 Large simple randomized trials 3 2.1 The argument for large simple trials.................. 3 2.2 The assumptions regarding Type S errors are unmotivated....... 7 2.3 Accurate estimation of effects are important for therapeutic decisions. 8 3 Alternative approaches 8 3.1 Rothwell s rules............................. 8 3.2 Model-based approaches........................ 9 1 Background Large simple randomized trials Salim Yusuf, Rory Collins and Richard Peto (1984) argue large and simple randomized trials are the best way to determine small to modest benefits in important endpoints such as death The simple trials: Simple inclusion and exclusion criteria Simple intervention 1
Focus on single important endpoint, e.g. mortality Collect minimal prognostic data Yusuf et al. argue that such trials are reliable and clinically relevant Not all randomized trials are simple randomized trials, but there are numerous examples of large and simple trials Yusuf et al. have conducted (and continue to conduct) many. (See, for instance Population Health Research Institute and the POISE study (2008)) I focus on the arguments given for large simple randomized trials and the limitations of these arguments especially in terms of therapeutic decisions. Much of my focus is negative, but I will sketch an alternative approach (after all, whatever the limitations of large simple randomized trials, we would be stuck with them if they provided the only feasible approach). I don t suggest that large simple trials are not clinically relevant on the contrary they can answer an important therapeutic question (a population question) quite well. Rather, I suggest (i) that large simple trials don t do as well on certain (individual) therapeutic decisions are Yusuf et al. argue, and (ii), somewhat more speculatively, that alternative methods can do better on these individual therapeutic decisions. Examples Example 1 (ISIS-2). ISIS-2 (1988) randomized 17,187 patients suffering acute myocardial infarction from 16 countries to treatment with either streptokinase, aspirin, streptokinase and aspirin, or placebo ISIS-2 resolved important clinical controversies about the use of aspirin and thrombotics in patients suffering an acute myocardial infarction. It changed practice and, when compared to what was then standard practice, it has saved lives. Peto et al. (1995, 25) quotes a survey showing that routine use of aspirin in acute coronary care went from under 10% in 1987 to over 90% in 1989. Example 2 (POISE). POISE (2008) randomized 8351 patients undergoing non-cardiac surgery to peri-operative metoprolol Perioperative metoprolol reduced the primary endpiont (cardiovascular death, non-fatal myocardial infarction, non-fatal cardiac arrest): 5.8% c.f. 6.9%, p=0.0399 But, increased total mortality: 3.1% c.f. 2.3%, p=0.0317; and stroke 1.0% c.f. 0.5%, p=0.0053 Example 3 (POISE-2). POISE-2 is currently enrolling patients undergoing non-cardiac surgery to test the effects of aspirin (or placebo) and clonidine (or placebo) on mortality and non-fatal myocardial infarction 2
The function of randomized trials 1. Regulatory approval for the marketing of medicines Are the benefits of the drug likely to outweigh the harms in a population of patients? 2. Inform therapeutic decisions Are the benefits of the drug likely to outweigh the harms in an individual patient? The subgroup problem In the hope of individualising therapy, matching the characteristics of the individual to a relevant group of patients within the trial is tempting. But the statistical properties of subgroup analyses are often (very) poor Brookes et al. (2001) conducted a simulation study to quantify the risks of falsepositive and false-negative results in subgroup analyses. When there was no overall effect from treatment, 7 26% of trials showed one subgroup analysis gave a statistically significant result; When there was an overall effect from treatment, only one of the two subgroup analyses gave statistically significant results in 41 66% of trials. There are statistical approaches to improve inferences about subgroup effects, but none resolve the problem: formal tests of subgroup interactions reduce the false positives (but most trials are underpowered to assess). Responses to the subgroup problem: Alvan Feinstein (1984, p. 421) made the following comment in discussion of (Yusuf et al., 1984) The main problem, it seems to me, is again the question of whether we are evaluating two treatments or are we evaluating treatments for the care of patients? The different kinds of patient that are being lumped together into these heterogenous pastiches under the name of the same disease or under the name of the same therapeutic agents may produce results with excellent statistical ability to compare two treatments, but will be relatively worthless when people try to use the consequences in practice. What is required in a degree of humility in the face of an issue for which there is no statistical or clinical solution. [... ] The development of randomised clinical trials since Mackenzie s time has provided a much sounder 3
basis for making decisions about abstract patients and if representative samples of patients are included in the trials for deciding if the overall effect on population health of a treatment is beneficial or harmful. Randomised trials have not, however, answered the question of which individuals actually benefit from medical interventions. This, surely, is the key issue in clinical research in for the next millennium. (Smith and Egger, 1998) Rothwell (2007c, 142) comments on the logic of the argument that what matters is overall benefits and harms: The need for reliable data on risks and benefits in subgroups and individuals is greatest for potentially harmful interventions, such as warfarin or carotid endarterectomy, which are of overall benefit but which kill or disable a significant proportion of patients. yet, evidence-based guidelines usually recommend these treatments in all cases similar to those in the relevant RCTs. IN considering this approach, it is useful to draw an analogy with the criminal justice system. Suppose that research showed that individuals charged by the police with certain crimes were usually guilty. Few would argue that they should therefore be sentenced without trial. Automatic sentencing would, on average, do more good than harm, with most criminals correctly convicted, but any avoidable miscarriages of justice are widely regarded as unacceptable. In contrast, relatively high rates of treatment-related death or disability ( miscarriages of treatment ) are tolerated by the medical scientific community precisely on the basis that, on average, treatment will do more good than harm. Model-based drug development Model-based drug development uses mathematical models to account for and predict variation in pharmacological, pharmacokinetic and pharmacodynamic relationships over time: Dose P K Exposure : Exposure P D Biomarker response Sheiner (1997) made the argument that a key source for inefficiency in drug development was the insufficient use of the right kind of methodological tools in the learning phases of drug development. Additional relationships that are modelled include: Target Drug activity and Biomarker response Clinical outcome. 4
2 Large simple randomized trials 2.1 The argument for large simple trials Outline of the argument in Yusuf et al. (1984) 1. Effects of an intervention on an important endpoint (e.g. death) are likely to be modest 2. Well-conducted large randomized trials are more reliable in testing the modest effects of an intervention than observational studies 3. To be feasible large trials have to be simple 4. Large simple trials are clinically relevant Simple = broad and practical enrolment; little prognostic data collected; focus on single endpoint Relevant = relevance of overall effect c.f. subgroups Well-conducted randomised trials have to be large to reliably test for modest effects This can be illustrated by a hypothetical trial that is actually quite inadequate... in which a 20 per cent reduction in mortality is supposed to be detected among 2000 patients (1000 treated and 1000 not).... Even if exactly this difference were observed, however, it would not be conventionally significant (P = 0.1).... Yusuf et al. (1984, 412) Hence, reliable is construed within the context of a frequentist hypothesis test. Large simple trials are clinically relevant A key principle underlying the argument that clinical trials can be simple and yet provide medically relevant conclusions involves careful distinction between quantitative interactions and qualitative interactions. Yusuf et al. (1984, 413) 5
Quantitative and qualitative interactions Quantitative interactions: different magnitude, same direction Qualitative interactions: different direction, i.e. benefit in one subgroup, harm in another Andrew Gelman 1 calls the errors that can arise from undetected interactions: Type M (measurement) errors and Type S (signal) errors. Unanticipated qualitative interactions (Type S errors)... unanticipated qualitative interactions (whereby treatment is of substantial benefit among one recognizable category of patients in a trial and not among another) are probably extremely rare, even though in retrospective subgroup analysis they may seem extremely common. Our expectation is not that all qualitative interaction are unlikely, but merely that unanticipated qualitative interactions are unlikely... Yusuf et al. (1984, 413) Qualitative interactions are either (i) unanticipated and unlikely, or (ii) anticipated and incorporated into the specification of the trial. Shorter Yusuf et al. 1984 1. Conduct large simple trials because frequentist statistical approaches require them for identifying small to modest effect sizes 2. Individual therapeutic decisions can be based on the overall trial results because: (a) Frequentist statistical analyses of subgroup data unreliable or infeasible (b) Unanticipated Type S errors are rare (c) Anticipated Type S errors are avoided in trial design (d) Type M errors are common but unimportant to decisions Quotes representing this view: 1 See for instance, Gelman and Tuerlinckx (2000) 6
The treatment that is appropriate for one patient may be inappropriate for another. Ideally, therefore, what is wanted is not only an answer to the question Is this treatment helpful on average for a wide range of patients?, but also an answer to the question For which recognisable categories of patient is this treatment helpful? This ideal is, however, difficult to attain, for the direct use of clinical trial results in particular subgroups of patients is surprisingly unreliable. (Peto et al., 1995, p. 35) There are two main remedies for this unavoidable conflict between the reliable subgroup-specific conclusions that doctors want and the unreliable findings that direct subgroup analyses can usually offer. But, the extent to which these remedies are helpful in particular instances is one on which informed judgements differ. The first is to emphasise chiefly the overall results for particular outcomes as a guide (or at least a context for speculation) as to the qualitative results in various specific subgroups of patients, and to give proportionally less weight to the actual results in that subgroup than to extrapolation of the overall results. The second is to be influenced, in discussing the likely effects on mortality in specific subgroups, not only on the mortality in these subgroups, but also by the analyses of recurrence-free survival or some other surrogate outcome. (Peto et al., 1995, p. 35) 2.2 Problems with the argument for clinical relevance Clinically important differences in response are the norm Rothwell (2007c) provides numerous examples where clinically important differences in treatment effect arise Heterogeneity related to risk (risk of treatment and risk without treatment) Heterogeneity related to pathophysiology Heterogeneity related to stage of disease and timing of intervention Heterogeneity related to comorbidity Not all of these sources or heterogeneity are known to occur in specific cases and so can not be anticipated. And even when a difference in treatment effect can be anticipated, it can t always be incorporated when specifying the trial. Examples: Risk: difference in absolute risk without treatment (hypertension and stroke) Pathophysiology: genetic variation (response to treatment); aspirin in cardiovascular disease (CRP) 7
Timing: thrombolytics; lipid agents with differing LDL level Comorbidity: thiazides and betablockers in hypertensive patients with diabetes Type S errors are to be expected Progress in clinical science continuously identifies groups of patients who are particularly benefited or harmed by a therapy Example 4 (ISIS-2). Since ISIS-2 much of the progress in the use of thrombolytic therapy is in being able to identify groups of patients who respond differently based on their ECG (e.g ST interval depression) and area of infarct. Therapeutic decisions depend on effect sizes Therapeutic decisions in individuals requires weighing likely benefits with potential risks it is not just the overall direction of the effect, the estimating the magnitude is critical for decisions Effect sizes vary in different patient subgroups, hence the risk of Type M errors is high Trial RR 95% CI ARR 95% CI MRC 0.55 0.40 0.75 0.12 0.06 1.17 STOP 0.53 0.33 0.86 1.45 0.45 2.45 Heterogen. p=0.90 Heterogen. p=0.009 Example 5 (Antihypertensives). Table 1: Rothwell (2007c, 141) Table 1 compares two antihypertensive trials (MRC and STOP). MRC was conducted on relatively young (otherwise healthy) patients. STOP was conducted in elderly patients with multiple comorbidities. Despite providing effects in the same direction Estimates of effect size separate two treatments that provide benefits overall. And estimates are critical when weighing up the risks and benefits. Yusuf et al. agree that quantitative interactions in subgroups are common (i.e. the possibility of Type M errors), but don t seem to recognise the importance of this point for therapeutic decision makers. The importance of accurately estimating treatment effects undermines the clinical relevance of large simple trials. 8
Errors based on subgroup interactions are to be expected Errors are particularly difficult to identify when: The intervention is new Trials recruit heterogenous patient populations (i.e. large simple trials) The endpoint has multiple causes, e.g. death a variation on effect sizes on nonfatal MI or bleeding rates may create Type S errors on mortality. Weaker Yusuf et al. 1984 1. Conduct large simple trials because frequentist statistical approaches require them for identifying small to modest effect sizes 2. Individual therapeutic decisions can be based on the overall trial results when the following assumptions hold: (a) An unanticipated Type S error is considered unlikely (b) All anticipated Type S errors are avoided in trial design (c) Type M errors of a clinically important magnitude are considered unlikely 3 Alternative approaches 3.1 Rothwell s rules Rothwell s rules (2007a) Trial design Subgroups analyses should be defined a priori and limited to a small number focussing on the primary endpoint. Direction and magnitude should be predicted (and reported). Obtain expert clinical input in design Test with formal subgroup interaction test Analysis and reporting Adjust for multiple subgroup analyses Report as absolute and relative risks 9
Check for comparability of prognostic factors in subgroups Interpretation Ignore statistical significance of effect of treatment in individual subgroups Reproducibility is the best test for subgroup-treatment effects The false-negative rate for formal subgroup interaction tests is high (due to lack of power) Rules for subgroup analysis (Rothwell, 2007a, Panel 11.2) Rothwell provides rules for approaching subgroup analyses for trial design, analysis and reporting and interpretation. Essentially, the rules provide best practice given the subgroup problem and the use of Neyman-Pearson hypothesis testing Focusses solely on anticipated subgroup interactions Provides advice for setting up trials to better identify subgroups that respond differently 3.2 Model-based approaches Model-based drug development Model-based drug development Key criticisms of traditional drug development approaches (i.e. hypothesis testing in early drug development): 1. Inefficient use of information (both prior information and data collected during early studies) 2. provides a dichotomous answer when what is needed is an understanding of the relationship between dose and exposure (pharmacokinetic models) and exposure and response Model-based drug development incorporates a range of approaches, including: modelling, simulation, and adaptive designs 10
Figure 1: Sheiner (1997, 276) Model-based drug development 1. Build model: based on understanding of the target phenomena and available data A range of different models are used. May be placed in a hierarchy and make predictions about populations, groups, individual or observations. Include population parameters (clearance, volume of distribution), experimentally controllable variables (dose), and independent variables (time). Often incorporate stochastic variability (within-subject variability, measurement error) 2. Model criticism using data collected Test assumptions; select from a family of models 3. Conduct analyses Extending model-based approaches Large-scale simple (confirmatory) trials are open the same criticisms made of traditional early-stage drug trials: inefficient use of information, dichotomised results c.f. understanding variation Model-based approaches can make some therapeutic decisions tractable: what key prognostic factors influence outcomes? and, by how much? 11
Example 6 (Holford and Nutt (2011)). Standard descriptive analyses are unable to identify disease-modifying effects of treatments of Parkinson s disease it is difficult to separate disease progression, symptomatic effects and disease modifying effects Holford and Nutt (2011) use a modelling approach to better assess diseasemodifying effects from three large Parkinson s disease trials S(t) = S(0) + (α + β C e (t)) t + E max C e (t) EC 50 + C e (t) Where: S(t) is disease state at time = t; S(0) is the disease state at time = 0 (without treatment); α is disease progression; β represents disease-modifying effects; C e (t) plasma concentration of drug at time = t; E max is the maximum pharmacological effect, and EC 50 is the plasma concentration of the drug producing half the maximum drug effect. Figure 2: Holford and Nutt (2011) Concluding remarks 12
Questions/criticisms for the model-based approach What is the respective role of theory and data in specifying and evaluating the model? What role (if any) can model-based approaches play in confirming efficacy/effectiveness? Large simple randomized trials versus large randomized trials References Brookes, S., Whitely, E., Peters, T., Mulheran, P. A., Egger, M., and Davey Smith, G. (2001). Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technology Assessment, 5(33):1 58. 3 Feinstein, A. R. (1984). Why do we need some large, simple randomized trials? Discussion. In Yusuf et al. (1984), pages 421 422. 3 Gelman, A. and Tuerlinckx, F. (2000). Type s error rates for classical and bayesian single and multiple comparison procedures. Computational Statistics, 15(3):373 390. 5 Group, I.-. C. (1988). Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17 187 cases of suspected acute myocardial infarction: Isis-2. The Lancet, 332(8607):349 360. 2 Holford, N. H. G. and Nutt, J. G. (2011). Interpreting the results of Parkinson s disease clinical trials: Time for a change. Movement disorders : official journal of the Movement Disorder Society, 26(4):569 577. 11 Peto, R., Collins, R., and Gray, R. (1995). Large-scale randomized evidence: Large, simple trials and overviews of trials. Journal of Clinical Epidemiology, 48(1):23 40. 2, 6 POISE Study Group (2008). Effects of extended-release metoprolol succinate in patients undergoing non-cardiac surgery (POISE trial): a randomised controlled trial. The Lancet, 371(9627):1839 1847. 2 Rothwell, P. M. (2007a). Reliable estimation and interpretation of the effects of treatment in subgroups. In Rothwell (2007b). 8, 9 Rothwell, P. M., editor (2007b). Treating Individuals: From randomised trials to personalised medicine. Elsevier, Philadelphia. 13 Rothwell, P. M. (2007c). When should we expect clinically important differences in response to treatment? In Rothwell (2007b). 4, 7, 8 Sheiner, L. B. (1997). Learning versus confirming in clinical drug development. Clin Pharmacol Ther, 61(3):275 91. 9, 10 13
Smith, G. D. and Egger, M. (1998). Incommunicable knowledge? Interpreting and applying the results of clinical trials and meta-analyses. Journal of Clinical Epidemiology, 51(4):289 295. 4 Yusuf, S., Collins, R., and Peto, R. (1984). Why do we need some large, simple randomized trials? Statistics in Medicine, 3:409 420. 1, 3, 4, 5, 6, 12 4 Additional examples of large simple trials Examples Example 7 (POISE). POISE (2008) randomized 8351 patients undergoing non-cardiac surgery to peri-operative metoprolol Perioperative metoprolol reduced the primary endpiont (cardiovascular death, non-fatal myocardial infarction, non-fatal cardiac arrest): 5.8% c.f. 6.9%, p=0.0399 But, increased total mortality: 3.1% c.f. 2.3%, p=0.0317; and stroke 1.0% c.f. 0.5%, p=0.0053 Example 8 (POISE-2). POISE-2 is currently enrolling patients undergoing non-cardiac surgery to test the effects of aspirin (or placebo) and clonidine (or placebo) on mortality and non-fatal myocardial infarction 5 Quotes regarding the subgroup problem Alvan Feinstein (1984, p. 421) made the following comment in discussion of (Yusuf et al., 1984) The main problem, it seems to me, is again the question of whether we are evaluating two treatments or are we evaluating treatments for the care of patients? The different kinds of patient that are being lumped together into these heterogenous pastiches under the name of the same disease or under the name of the same therapeutic agents may produce results with excellent statistical ability to compare two treatments, but will be relatively worthless when people try to use the consequences in practice. What is required in a degree of humility in the face of an issue for which there is no statistical or clinical solution. [... ] The development of randomised clinical trials since Mackenzie s time has provided a much sounder basis for making decisions about abstract patients and if representative 14
samples of patients are included in the trials for deciding if the overall effect on population health of a treatment is beneficial or harmful. Randomised trials have not, however, answered the question of which individuals actually benefit from medical interventions. This, surely, is the key issue in clinical research in for the next millennium. (Smith and Egger, 1998) Rothwell (2007c, 142) comments on the logic of the argument that what matters is overall benefits and harms: The need for reliable data on risks and benefits in subgroups and individuals is greatest for potentially harmful interventions, such as warfarin or carotid endarterectomy, which are of overall benefit but which kill or disable a significant proportion of patients. Yet, evidence-based guidelines usually recommend these treatments in all cases similar to those in the relevant RCTs. In considering this approach, it is useful to draw an analogy with the criminal justice system.... Suppose that research showed that individuals charged by the police with certain crimes were usually guilty. Few would argue that they should therefore be sentenced without trial. Automatic sentencing would, on average, do more good than harm, with most criminals correctly convicted, but any avoidable miscarriages of justice are widely regarded as unacceptable. In contrast, relatively high rates of treatment-related death or disability ( miscarriages of treatment ) are tolerated by the medical scientific community precisely on the basis that, on average, treatment will do more good than harm. Rothwell (2007c, 142) 15