Sample Size and Screening Size Trade Off in the Presence of Subgroups with Different Expected Treatment Effects

Similar documents
Quantifying the benefit of SHM: what if the manager is not the owner?

COVER THE CATERPILLAR

GENETIC AND SOMATIC EFFECTS OF IONIZING RADIATION

Statistical Analysis of Method Comparison Data

An investigation of ambiguous-cue learning in pigeons

Register studies from the perspective of a clinical scientist

Invacare Matrx Libra

1 Thinking Critically With Psychological Science

Culture Bias in Clinical Assessment: Using New Metrics to Address Thorny Problems in Practice and Research

Autoencoder networks for HIV classification

The Whopper has been Burger King s signature sandwich since 1957.

Outcomes for COPD pharmacological trials: from lung function to biomarkers

The Leicester Cough Monitor: preliminary validation of an automated cough detection system in chronic cough

TRICHOMES AND CANNABINOID CONTENT OF DEVELOPING LEAVES AND BRACTS OF CANNABIS SATIVA L. (CANNABACEAE) 1

How can skin conductance responses increase over trials while skin resistance responses decrease?

JEJUNAL AND ILEAL ABSORPTION OF DIBASIC AMINO ACIDS AND AN ARGININE-CONTAINING DIPEPTIDE IN CYSTINURIA

RELATIONSHIPS OF MECHANICAL POWER TO PERCEIVED EXERTION AND VARIOUS PHYSIOLOGICAL PARAMETERS MEASURED IN ELITE YOUTH DISTANCE RUNNERS AND CONTROLS

North Wales Area Planning Board for Substance Misuse

Left Ventricular Mass and Volume: Fast Calculation with Guide-Point Modeling on MR Images 1

What happened on the Titanic at 11:40 on the night of April 14, 1912,

Incentives, information, rehearsal, and the negative recency effect*

Demography and Language Competition

DIRECT TRANSHEPATIC MEASUREMENT OF PORTAL VEIN PRESSURE USING A THIN NEEDLE

The timed walk test as a measure of severity and survival in idiopathic pulmonary fibrosis

WATSON CLINIC CANCER & RESEARCH CENTER WATSON CLINIC CANCER & RESEARCH CENTER

Edge. Danbred. The. Livingston Enterprises... A Study in Persistence and Commitment. Volume 1 Issue 1 May 2006

Determinants of Cancer Screening Frequency: The Example of Screening for Cervical Cancer

Pharmacokinetics of phenylpropanolamine in humans after a single-dose study

PULSATILE UREA EXCRETION IN GULF TOADFISH (OPSANUS BETA): EVIDENCE FOR ACTIVATION OF A SPECIFIC FACILITATED DIFFUSION TRANSPORT SYSTEM

Preview and Preparation Pack. AS & A2 Resources for the new specification

u Among postmenopausal women, hormone therapy with u CEE plus MPA for a median of 5.6 years or u CEE alone for a median of 7.

Respiratory heat and moisture loss is associated with eosinophilic inflammation in asthma

Sexual Behavior, HIV, and Fertility Trends: A Comparative Analysis of Six Countries

Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms

Contrast Affects Flicker and Speed Perception Differently

Preview. Guide. Introductory Exercise: Fact or Falsehood?

Balkan Journal of Mechanical Transmissions (BJMT)

Allergic asthma is characterised by bronchial. Montelukast as add-on therapy to b-agonists and late airway response

Instantaneous Measurement and Diagnosis

cystic fibrosis today

Exercise testing in pulmonary arterial hypertension and in chronic heart failure

A Radically New Theory of how the Brain Represents and Computes with Probabilities

Self-Fuzzification Method according to Typicality Correlation for Classification on tiny Data Sets

Cumulative pregnancy rates for in vitro fertilization

Recommendations. for the Governance & Administration of Destination Marketing Fees

Upright versus upside-down faces: How interface attractiveness varies with orientation

EPSAC Predictive Control of Blood Glucose Level in Type I Diabetic Patients

Scratch and Match: Pigeons Learn Matching and Oddity With Gravel Stimuli

Me? Debunk a Vancomycin myth?... Take my life in my hands?

Temporal organization of pattern structure

Bullous pemphigoid (BP) represents the commonest

Tricarboxylic Acid Metabolism Studies in the Ovary Throughout the Menstrual Cycle. S. J. Behrman, M.D., M.R.C.O.G., and Gregory S. Duboff, M.S., D.Sc.

Mathematical Model of Pulsed Immunotherapy for Superficial Bladder Cancer

HEPTADECAPEPTIDE GASTRIN: MEASUREMENT IN BLOOD BY SPECIFIC RADIOIMMUNOASSAY

Talking About. And Dying. A Discussion Tool For Residential Aged Care Facility Staff

Quinpirole and d-amphetamine administration posttraining enhances memory on spatial

This series of articles will

Standardization of the One-stage Prothrombin Time for the Control of Anticoagulant Therapy

Idiopathic chronic eosinophilic pneumonia and asthma: how do they influence each other?

and Fertility Decline in Southeast Asia: to

LEUKOCYTE AND LYMPHOCYTE CYCLIC AMP RESPONSES IN ATOPIC ECZEMA

Pulmonary Hypertension In Pediatrics

Quantitative Fecal Indium Ill-Labeled Leukocyte Excretion in the Assessment of Disease in Crohn's Disease

TREATMENT of hypogonadotropic hypogonadism

Mould exposure at home relates to inflammatory markers in blood

GLYCEROL SYNTHESIS IN THE RAINBOW SMELT OSMERUS MORDAX

Plasma exchange, which was first introduced over 10

SUPPORTING PREGNANT AND PARENTING WOMEN WHO USE SUBSTANCES What Communities are Doing to Help

Polysaccharide Hydrolysis and Metallic Impurities Removal Behavior of Rice Husks in Citric Acid Leaching Treatment

The Male Orgasm: Pelvic Contractions Measured by Anal Probe

Optimized Fuzzy Logic Based Segmentation for Abnormal MRI Brain Images Analysis

Controlled processing in pigeons

Branch and central retinal vein occlusion

Dosage and Important Administration Instructions. Shake vial for 5 to 8 seconds

Myocardial Catecholamines in Hypertrophic and Dilated (Congestive) Cardiomyopathy: A Biopsy Study

MR Detection of Brain Iron

Long-term effects of food deprivation: II. Impact on morphine reactivity

Starch Digestion in Normal Subjects and Patients With Pancreatic Disease, Using a

Review Protocol for Radiation Thermometry CMCs

Effects of alpha-1 adrenergic receptor antagonist, terazosin, on cardiovascular functions in anaesthetised dogs

The role of neutropenia on outcomes of cancer patients with community-acquired pneumonia

COMBUSTION GENERATED PARTICULATE EMISSIONS

Factors Affecting Unconfined Compressive Strength of Salt-Lime-Treated Clay

Perceptual equivalence between visual and tactual. pattern perception: An anchoring study 1 A' ~ A 2 A 3 ~ A\ 3.

Experimental Methods 2/9/18. What is an Experimental Method?

The effect of movement velocity on form perception: Geometric illusions in dynamic displays

The Salvo Combat Model with a Sequential Exchange of Fire 1

Hepatitis C & B Co-infection PROJECT ECHO HEPC FEBRUARY 9, 2017 PRESENTED BY: DR. JOHN GUILFOOSE

Assimilative hue shifts in color gratings depend on bar width

Alcohol, Tobacco, and Drug Use and Abuse

Fluorescent body distribution in spermatozoa in the male with exclusively female offspring*

TRACE ELEMENTS IN THE HAIRS OF WINTERING MEMBERS OF THE 13TH JAPANESE ANT ARCTIC RESEARCH EXPEDITION. Hiroshi KozuKA * and Yukio KANDA *

Effect of stool size and consistency on defecation

Toxic stress in children. Dr. Kristin Hadfield Department of Biological and Experimental Psychology Queen Mary University of London

Advance Care Planning in the Chronic Kidney Disease Population A Quality Improvement Project

Complete Dental Wings In-Office Solution

Discrimination of color-odor compounds by honeybees: Tests of a continuity model

ExcipientFest Americas May 5-6, 2010

4/2/18. Integrating Harm Reduction and Homelessness Services. Outline. Objectives

6 Sensation and Perception

Transcription:

Sample Size and Screening Size Trade Off in the Presence of Sbgrops with Different Expected Treatment Effects Kyle D. Rdser, Edward Bendert, Joseph S. Koopmeiners Division of Biostatistics, School of Pblic Health, University of Minnesota, 420 Delaware St. SE, Minneapolis, Minnesota, 55455, U.S.A. Statistics Collaborative, 1625 Massachsetts Ave., NW; Site 600 Washington, DC 20036, U.S.A. Abstract Statistical stdy design considerations typically focs on sample size, power, and a single poplation treatment effect given a fixed significance level (generally 0.05). Eligibility criteria is formlated to select the patient poplation of interest to be stdied for which the magnitde of the treatment effect is expected to hold. In some instances researchers may expect there to be sbgrops sch that the treatment is expected to have the largest effect in one grop while the others will exhibit an attenated effect. Identification of these sbgrops can be based on a clinical decision rle, e.g., biomarker ctoff, bt may not be precise, i.e., sensitivity and specificity are not simltaneosly at 100%. In the context of sbgrops with different expected treatment effects, screening procedres may be adjsted to different levels of sensitivity and specificity to detect those patients in the sbgrop with the greatest expected treatment effect. As a reslt, depending on the corresponding positive predictive vale, the sample size reqired, power, and/or treatment effect expected to hold for the stdy will change. We evalate the impact on design operating characteristics of power and sample size, and illstrate scenarios where overall trial dration may be shortened. Keywords: Sample size; Clinical trial design; Biomarker; Heterogeneos effects; Operating characteristics 1 Introdction In a typical stdy design, e.g., a new therapy for a particlar patient poplation of interest, considerations revolve arond a treatment effect to detect (i.e., alternative), variability of the statistic sed, and power for a level of type I error (sally set at 0.05). Logistically, one needs to balance the sample size, treatment effect, and power. Researchers need enogh patients to detect a clinically meaningfl difference between treatment grops with a predetermined level of power, in a reasonable time frame. Formlas can be generated articlating the relationships between these parameters based on a single treatment effect for the overall grop. When condcting a randomized controlled trial, there are many measrements sed as eligibility criteria. In some instances it may be the case that there are two distinct grops of patients screened; 1

one sbgrop where patients are expected to exhibit the greatest treatment effect ( optimal grop) and another where patients are expected to have an attenated treatment effect ( sboptimal grop). There may be a nmber of crrent medical conditions or general health measrements taken on the patients that can be sed in a medical decision rle to sggest if they will respond well to the treatment at hand based on prior stdies. One sch measrement may be a biomarker: a scientific measrement sed as a predictor or indicator of a biological state, e.g., C-reactive protein in blood as a predictor of cardiovasclar disease. Patients in the sboptimal grop are expected to have an attenated treatment effect relative to the optimal grop, bt are also nder consideration for enrollment, either intentionally or inadvertently de to imprecise enrollment criteria. When the optimal and sboptimal grops can be distingished precisely and immediately, we may choose to condct a trial in only the optimal grop. Althogh the sample size for the trial will be minimized, the length of time to screen enogh patients to complete enrollment may be prohibitive and restricted inclsion may lessen generalizability. When the enrolled grop is comprised of a mixtre of optimal and sboptimal poplations, we may choose to evalate 1) the effect in the combined poplation, 2) the effect in either sbpoplation (with control for experimentwise type I error), 3) the effect in both sbpoplations separately, i.e., the alternative is an effective treatment in both, or 4) a differential effect, i.e., interaction. Each of these have different advantages/disadvantages and goals nderlying them. The primary focs here is when interest is in evalating the effect in the combined poplation and screening criteria may be imprecise in identifying optimal and sboptimal grops. In this setting, there is a tradeoff in the design of the stdy between having a smaller grop of patients with a large treatment effect (restricted enrollment criteria), or a larger grop of patients with a small treatment effect (broader enrollment criteria). Investigators want to inclde as many patients from the optimal grop as possible to minimize the sample size reqired to detect a significant effect between the treatment grops. However, investigators may also be faced with low prevalence of the optimal grop, to the degree that the length of time needed to enroll the reqired nmber may be logistically nrealistic. In this case, researchers may be able to move forward more qickly by broadening enrollment criteria, thereby enrolling more patients of the sboptimal grop. Enrolling patients from the sboptimal grop wold reslt in lower power for the stdy de to a smaller effect expected in a grop of patients that incldes some with an attenated treatment effect. To 2

maintain power, the sample size wold need to be increased in order to conter the smaller expected overall treatment effect. 1.1 Overview We examine the relationships between parameters of sample size, power, treatment effect (design alternative), and dration, as well as the specificity and sensitivity of different screening procedres. We also illstrate scenarios where overall trial dration may be shortened. In order to do this we first define design parameters and introdce notation in section 2, then explain simlation methods in section 3. We then illstrate the impact attenation has on power while holding sample size constant in section 3.1, and evalate the additional sample size reqired to maintain power in section 3.2. In section 3.3 we look at trial dration by examining the nmber of patients needed to screen in order to enroll the necessary nmber of patients. Lastly, we present considerations in the context of a mean variance relationship in section 4 and conclde with discssion in section 5. 2 Design Parameters and Notation Trial operating characteristics of power and sample size can be calclated based on the following sample size formla (Emerson 2003): N = δ2 α,β V, (1) 2 where V denotes the variability contribted by each sample nit to the statistic and δ α,β represents a fnction of the critical vales for testing hypotheses with type I error α and type II error β. The symbol is the alternative that the stdy has power 1 β, which often represents the expected overall treatment effect for the stdy. This general formla determines the sample size for each arm in a trial; withot loss of generality, we focs on two arm stdy designs here. For calclations based on a normally distribted statistic, δ α,β = z 1 α/2 + z 1 β, as will be the case for reslts inclded here. A comment on the se of other qantiles, e.g., those of a t distribtion, is inclded later. The overall expected treatment effect will vary depending on the performance of the screening procedre for the trial. Most often, screening procedres are some form of physiologically based decision rle to differentiate between sbgrops of the patient poplation being screened 3

for enrollment, e.g., based on inclsion and exclsion criteria. For the setting examined here, we have optimal and sboptimal grops of patients that we, perhaps imprecisely, identify throgh or screening procedre, e.g., throgh a biomarker with a corresponding sensitivity and specificity. For ease of discssion, the remainder of this manscript will se biomarker to denote any medical decision rle to differentiate between optimal and sboptimal grops for trial enrollment. Let F op denote the distribtion of the biomarker in the optimal grop. We expect the treatment to have the greatest effect among these patients. Ideally, these patients will be plentifl as then we wold be able to condct or trial qickly, i.e., or screening procedre wold enroll patients at a high rate and fewer patients wold be needed to demonstrate a significant effect if these patients were enrolled exclsively becase these patients are expected to have a larger treatment effect than those of the sboptimal grop. In an analogos fashion, let F so denote the distribtion of the biomarker in the sboptimal grop. Then F all is the distribtion of the biomarker in the entire poplation: F all = (prevalence)f op + (1 prevalence)f so. (2) The ct point of the biomarker that is chosen for determining who is enrolled in the trial and who is not is an important decision. Let the screening ct point of the biomarker be denoted by c. Withot loss of generality, consider patients with biomarker vales greater than c will be enrolled while those with vales less than c will not. For each vale of c, let denote the corresponding inverse of the cmlative distribtion fnction F all, which is the proportion of the overall poplation below the ct point c. For example, if = 0.9 and F 1 all (0.9) = 2.71, then 10% of the vales of the biomarker fall above 2.71, and these 10% of patients wold be enrolled in the trial. In these terms, 1 is the proportion of screened individals enrolled, and is the proportion of screened individals not enrolled. For the prposes of discssion here, we presme all patients meeting enrollment criteria are enrolled, e.g., consent is obtained prior. Each c() has a corresponding sensitivity and specificity, and therefore a corresponding positive predictive vale depending on the prevalence of the optimal grop. Note that sensitivity and specificity are with respect to the trial screening procedre discriminating between patients in the optimal and sboptimal grops. Sensitivity is the probability a patient is enrolled given they are 4

from the optimal grop and specificity is the probability a patient is not enrolled given they are from the sboptimal grop: sensitivity() = specif icity() = c() c() f op (t)dt = 1 F op (c()), (3) f so (t)dt = F so (c()). (4) The positive predictive vale (PPV) of or screening procedre is then the probability that a patient is from the optimal grop, given they were enrolled: P P V () = (sensitivity())prevalence (sensitivity())prevalence + (1 specificity())(1 prevalence). (5) The reslting expected treatment effect for the enrolled stdy poplation is expressed as follows: () = P P V () 1 + (1 P P V ())a 1, (6) where the treatment effect in the optimal grop is denoted as 1, and the treatment effect in the sboptimal grop is attenated by a factor of a. This also will inflence the variance: V () = P P V ()σ 2 1 + (1 P P V ())σ 2 a + P P V ()(1 P P V ())(1 a) 2 2 1 (7) where σ1 2 and σ2 a denote the variability of a sample nit from the optimal and sboptimal grops respectively. An example of possible distribtions for the optimal grop, sboptimal grop, and overall are presented in Figre 1. The overall distribtion represents the mixtre of the optimal and sboptimal grop that is screened for the trial. In this hypothetical example, the distribtions were arbitrarily set as N(0, 1) and N(2, 1) for the sboptimal and optimal grops respectively, with a prevalence of 40%. Again, the degree of attenation between the optimal and sboptimal grops is denoted as a scalar vale a where lower vales of a indicate a larger degree of attenation. Vales of a < 0 represent a harmfl treatment effect in the sboptimal grop, while a > 1 wold reflect a scenario where the sboptimal grop has a greater treatment effect than the optimal grop. Therefore, we 5

restrict attention to vales between 0 and 1, e.g., if a = 0.5, we wold expect the treatment effect in the sboptimal grop to be half that of the treatment effect in the optimal grop, and if a = 0, we wold expect there to be no treatment effect in the sboptimal grop. Based on these parameters we can calclate the sample size (or power) from eqation 1 and may characterize them as fnctions of. We can also calclate the nmber of patients needed to screen in order to sccessflly enroll the reqired nmber of patients: ScN() = N() 1 = (z 1 α/2 + z 1 β ) 2 V () [ ()] 2 (1 ) (8) where () and V () are as defined in eqations 6 and 7 above. 3 Simlations Throghot the simlations, α was set to 0.05 and the treatment effect for the optimal grop was set to 0.4. The variance was set to 1 and will initially be independent of the mean, bt scenarios where this may not be the case are discssed frther in section 4. The overall treatment effect expected for the trial, (), depends on the inclsion criteria ct-off and is a weighted average of the optimal treatment effect and the sboptimal treatment effect, as seen in eqation 6. For evalating trial dration, the distribtions for the optimal and sboptimal grops, F op and F so were taken to be normal with variance 1 and different locations. It shold be noted that featres of the normal distribtion are not central to any calclations and that choice of distribtion family is arbitrary. Instead, they represent a degree to which the distribtions of the optimal and sboptimal grops overlap and how far apart the centers of their distribtions are. For PPV the key aspects are the prevalence, initially set to 0.4 and then varied, and 1 F op (c()) relative to 1 F so (c()). We presme withot loss of generality that higher vales of the biomarker are indicative of patients from the optimal grop. For each vale of (or eqivalently c) we can determine the corresponding sensitivity and specificity (eqations 3 and 4), the positive predictive vale given prevalence (eqation 5) and the stdy effect size () (eqation 6). We can also, after specifying type I error α and type II error β, calclate the sample size reqired (eqation 1) and sbseqently the nmber of patients needed to screen in order to enroll that sample size (eqation 8). In smmary, in addition to α and β, we have the following variables for the calclation of 6

sample size and nmber of patients needed to screen: 1. Prevalence of optimal sbgrop 2. (hence c; inflenced by F op, F so, and prevalence of optimal grop) 3. Sensitivity of screening procedre to identify optimal sbgrop (dependent on and F op ) 4. Specificity of screening procedre to identify sboptimal sbgrop (dependent on and F so ) 5. Effect size of optimal grop 6. Degree of attenation in sboptimal grop relative to the optimal grop 3.1 Power for Fixed Sample Size The level of power will be affected by the proportion of individals enrolled in the trial who are from optimal and sboptimal grops (decreased for any a < 1) when maintaining a fixed sample size. Based on a fixed sample size of N = 50, and power of 80% ( 1 0.396), Figre 2 displays the power across the fll range of PPV for varying degrees of attenation. As expected, higher attenation cases a larger decrease in power to detect a treatment effect of (). It is worth noting that the area of particlar interest for this type of graph is that for which the PPV is greater than the prevalence. This is consistent with presming the screening criteria sed to identify those of the optimal poplation is at least as good as flipping a coin (which has PPV=prevalence). The lower portion of Figre 2 smmarizes the loss in power for a range of high positive predictive vales between 0.75 and 1.0 and reflects a meaningfl impact that can be observed with PPV not far from 1.0. For a positive predictive vale of 0.8, and attenation factor of 0.4, there is a 11.05% loss of power compared to when there is no attenation. This is a sbstantial loss in power. The last colmn reflects no loss in power for any level of attenation when the positive predictive vale is eqal to 1, i.e., when only those from the optimal grop are enrolled. In analogos cases for power set to 90% and 97.5%, there is a loss in power of 9.07% and 4.72% respectively, for predictive vale of 0.8 and an attenation factor of 0.4 (data not shown). 3.2 Sample Size for Fixed Power In order to avoid a loss in power for the expected treatment effect, the sample size needs to be adjsted (increased for any a < 1) depending on the proportion of individals from the sboptimal 7

grop that are enrolled in the trial. Figre 3 presents the necessary sample size modification to maintain power over varying levels of attenation and the fll range of positive predictive vales. As the positive predictive vale increases, the ratio of sample sizes necessarily converges to 1. This is becase larger positive predictive vales correspond to a larger proportion of the enrolled patients being from the optimal grop, meaning fewer patients will be reqired to maintain power. It is also apparent by this graph that as the degree of attenation increases (a becomes smaller) the nmber of patients reqired to maintain power increases dramatically. For example, if the positive predictive vale is 0.5, i.e., half of enrolled patients come from the optimal grop, and a = 0, we will need for times as many patients to maintain power. This is an expected reslt from the stand point that this represents the scenario where the alternative for the trial is half of that compared to when there is no attenation. As can be seen in eqation 1, the relationship between treatment effect and sample size goes by the sqare of the inverse, hence for times the sample size is reqired to maintain the power for a treatment effect half as big. The lower portion of Figre 3 highlights the sample size inflation factor to maintain power over a range of high PPV between 0.75 and 1.0. The PPV of screening procedres does not need to be very far from 1.0 before meaningfl repercssions are observed. For the sitation where the attenation factor is 0.4 and PPV is 0.80, we wold need to enroll 29% more patients than if there was no attenation at all. This is a significant increase and reflects the strong effect an attenated grop can have on the nmber of patients reqired to maintain power. The relationship of the ratio of sample sizes between no attenation and attenated cases will be identical for any level of power and eqal to: P P V σ 2 1 + (1 P P V )σ2 a + P P V (1 P P V )(1 a) 2 2 1 (P P V σ 2 1 + (1 P P V )σ2 a) (P P V + (1 P P V )a) 2 (9) For each of the tabled cells in Figre 3, there is an associated expected treatment effect for the resltant enrolled stdy poplation. A high degree of attenation, combined with a low positive predictive vale may reslt in an expected treatment effect that is no longer clinically relevant. E.g., if the positive predictive vale is 0.2, and the attenation factor is also 0.2, the expected treatment effect for the stdy will only be 0.14, based on a treatment effect of 0.4 in the optimal grop, which may no longer constitte a clinically meaningfl difference and therefore wold not 8

be a viable trial design. This potential reslt needs to be kept in mind when planning a trial and considering inclsion of sbgrops with an attenated effect. 3.3 Trial Dration The past two sbsections have aimed to qantify the inflence an attenated treatment effect has on trial operating characteristics of power and sample size. De to the dependence on the performance of the screening procedre, we have presented the reslts across a range of possible positive predictive vales. Operationally, control over the positive predictive vale of or screening procedre is realized throgh the ctoff vale c, or eqivalently, (the proportion of the screening poplation with a biomarker vale below c). By changing this ctoff vale, we indce changes in the sensitivity and specificity of or screening procedre. Up to this point we have focsed on the reqired nmber of enrolled patients and not the nmber reqired to screen (a srrogate for the time reqired to complete a trial). It may, however, be the case that patients in the optimal grop are scarce and choice of a ctoff with high positive predictive vale is prohibitive, making broader enrollment criteria needed. If that were the case, althogh broader criteria will reslt in enrolling a larger proportion of screened patients (potentially redcing the time reqired to complete stdy enrollment), at the same time, more patients will be reqired to be enrolled to maintain power of the expected treatment effect (a conter inflence to shortening the time reqired to complete the stdy). For larger vales of (and necessarily c), criteria for entering the stdy is more stringent, and the sensitivity will decrease while the specificity will increase. That is, as the reqirement to enter the stdy is raised, fewer patients in the optimal grop will be enrolled, bt there is also a corresponding affect of avoiding those in the sboptimal grop. The positive predictive vale increases monotonically as the qantile increases with the minimm positive predictive vale eqal to the prevalence (at = 0). The overall expected effect size for the stdy is the same as was expressed in eqation 6, althogh we now note the positive predictive vale is a fnction of. As increases, the specificity increases, making or sample increasingly pre, casing the expected effect size to increase toward 0.4, the treatment effect in the optimal grop (Figre 4(a)). As a reslt, increased sample size is needed depending on the vale of (Figre 4(b)). Related to the sample size of Figre 4(b) is the nmber of patients needed to screen in order to 9

enroll a given sample size (eqation 8, Figre 5). As the attenation factor a decreases from 1 to 0, there is a certain point where the minimm nmber of patients needed to screen is no longer at = 0, i.e., inclding everyone screened. In order to find this vale of a, we take a partial derivative with respect to to get the first derivative in terms of the attenation factor a given all of the other vales are fixed. We then set this eqation eqal to zero and find the maximm vale of a sch that there exists a soltion to the following eqation: [ ] (z1 α/2 + z 1 β ) 2 V () [ ()] 2 (1 ) = 0. (10) For the example highlighted here, there exists a soltion to this eqation for vales of a between 0 and 0.295 (evalated nmerically). This means if we expect the treatment effect in the sboptimal grop to be at least 29.5% of the treatment effect in the optimal grop, enrolling all patients regardless of their biomarker vale will reslt in the least nmber of patients screened. However, in the event the treatment effect in the sboptimal grop is sspected to be less than 29.5% of that in the optimal grop, investigators can minimize the nmber screened by restricting what levels of the biomarker are allowed to be entered into the stdy. This can save calendar time by reqiring fewer patients screened. Figre 6 displays a heatmap across and a where the shading corresponds to the nmber of patients needed to screen in order to enroll the necessary nmber of patients to maintain power for the expected treatment effect for the stdy. The distribtion of the optimal grop and sboptimal grop, prevalence, and 1 remain the same as in previos sections. When a is 0.1, there is a vale of 0 for which a minimm nmber of patients are reqired to be screened. This can be seen by following a straight line horizontally at a = 0.1. We see that at ( = 0, a = 0.1) the trial wold reqire approximately 250 patients to be screened. As increases, the reqired nmber to screen drops at first, down to approximately 175 patients before rising again. This is a 30% decrease in patients to screen (hence time to complete enrollment for a stdy) by adjsting the eligibility criteria. This plot also shows that any vale of above approximately 0.8 wold be disadvantageos to select as a ct point for the biomarker from the standpoint that a very large nmber of patients wold be needed to screen in order to obtain the necessary sample size. Referring to eqation 8, 10

it can be seen that the expression for the nmber of patients needed to screen has 1 in the denominator. Ths, for = 0.8, on average for every person enrolled, 5 people are needed to be screened. If is increased frther to 0.9, (halving 1 ), the reslt is a dobling of the nmber of patients needed to screen. This explains the rapid increase in nmber of patients needed to screen as approaches 1. When considering changing enrollment criteria, researchers need to be cognizant of the reslting expected treatment effect for the stdy. Presmably there is a minimm scientifically relevant treatment difference, and it may be the case that characteristics of the screening procedre combined with the attenated treatment effect of the optimal grop will reslt in an irrelevant expected treatment effect. The two solid black lines in Figre 6 represent contors for two alternatives. The oter most line represents the points where = 0.25, and the inner most line represents the points where = 0.2. For example, if it were the case that the minimal scientifically relevant effect size is 0.2, then the inner line represents a ctoff for combinations of a and that correspond to a scientifically viable stdy design. One may sbseqently calclate the range of valid c() to examine the spport of the biomarker distribtion, which may serve as a basis for generalizability of the patient poplation in the trial. The series of Figre 7 diagrams illstrate the scenario where the centers of the optimal and sboptimal distribtions become increasingly farther apart, i.e., less overlap in their nderlying distribtions, which represents a scenario where the decision rle (e.g., biomarker) sed for screening has better discriminatory performance between the optimal and sboptimal grops. For this sitation it was still assmed the prevalence = 0.4, 1 = 0.4, and the variance of the biomarker distribtions was held constant at 1. Figres 7(a):7(d) indicate the corresponding nmber of patients needed to screen (Figre 7(b) is identical to Figre 5) while Figres 7(e):7(h) illstrate the nderlying distribtions. We see that as the distance increases to a sitation where there is essentially complete separation, (Figre 7(d) and 7(h)), the contors representing different vales of the attenation factor a converge at a point where = 0.6, and overlap each other for > 0.6. This is a reslt of the prevalence being 0.4. As the nderlying distribtions become separated, any vale of > (1 prevalence) will reslt in only patients from the optimal grop being enrolled into the trial, hence the vale of a has no impact for those vales of. For vales of < 0.6, the nmber of patients needed to screen varies more dramatically across the a contors than in the sitation 11

where the distribtions overlap. This is becase as decreases beyond the lower bond of spports for the optimal grop in the complete separation example, any additional patients allowed to be enrolled will only be from the sboptimal grop. There is no chance we may also be inclding additional patients from the optimal grop as well (they are all already captred). Therefore the attenation will be more dramatic, and we will need more patients to compensate. While Figres 7(a):7(h) examine the affect increasing the separation of the distribtions has on nmber of patients to screen, Figres 8(a):8(h) examine changes in prevalence for vales of 0.2, 0.4, 0.6, and 0.8 respectively where Figres 8(e):8(h) show nderlying distribtions in each of these scenarios. When there is low prevalence and dramatic attenation for the sboptimal grop, e.g., a = 0, the nmber of patients needed to recrit will dramatically increase (Figre 8(a)). However, as the prevalence increases, the impact of different attenation factors decreases. Between the series of plots in Figre 7 and in Figre 8 it is clear that the range of vales of a for which a minimm nmber to screen exists for 0 depends on the prevalence of the optimal grop and also the degree of dissociation between optimal and sboptimal biomarker distribtions. The bonds on a have been evalated nmerically, which are also inflenced by the power, althogh to a lesser degree (Figre 9 The largest vale of a for which there is a minimm nmber to screen for 0 occrs when there is complete separation of biomarker distribtions and when the optimal grop has higher prevalence. Figre 9(b) presents these vales across a range of prevalence from 5% to 95% and illstrates the power level of interest will impact the reslt, bt not to a great degree, particlarly for low prevalence. 4 Mean-Variance Relationship To this point, we have considered scenarios in the absence of a mean-variance relationship. We now comment on the impact sch a relationship can have on the previosly mentioned design considerations. In the setting of a mean-variance relationship, the variance will depend on. As sch, we may adjst eqation 1 as follows: N = ( ) 2 z 1 α/2 V0 ( ) + z 1 β V1 ( ) 2 (11) 12

where V 0 ( ) and V 1 ( ) denote the variance, as a fnction of, nder the nll and alternative hypotheses respectively. It is sefl to distingish between a mean-variance relationship and heteroskedasticity. We recognize that a treatment may not have an impact on the smmary measre of interest, e.g., the mean, yet still change the variability of or estimate of the smmary measre of interest. With heteroskedasticity, the variance of the smmary measre of the mixtre of patients from optimal and sboptimal grops need not eqal the variance among those from the optimal grop alone. We will leave this aside for the discssion here as the principles are the same. Focs will be entirely on V 1 ( ) becase nder the nll, the optimal and sboptimal grops have the same magnitde of effect, i.e., treatment effect of zero. Based on eqation 11, we can see that the effect will depend on the direction of the mean-variance relationship. If the variance is a decreasing fnction in the mean, then the reslts shown in the previos sections will be magnified as the nmerator will be increasing (V 1 ( ) > V 1 ( 1 )) along with the denominator decreasing ( < 1 ) in the degree of attenation. If instead the variance is an increasing fnction in the mean, e.g., in the case of a Poisson distribtion where a higher mean corresponds to higher variance, then there is a trade-off. The impact relative to having no attenated effect will be mted to the degree that the attenated variability may offset the attenated magnitde of effect. 4.1 Binomial Proportions In the case of sing a difference of two proportions, we have = p 1 p 0, V 0 = 2p 0 (1 p 0 ) and V 1 ( ) = p 0 (1 p 0 ) + p 1 (1 p 1 ) to be sed in eqation 11. In this setting, V 1( ) can be increasing or decreasing in depending on p 0 and p 1. For example, if p 0 = 0.1 and p 1 = 0.3, attenation from a sboptimal grop will case the variance to decrease. Alternatively, for p 1 = 0.9, and p 0 = 0.7, attenation will case the variance to increase. This is becase the maximm variance of a Bernolli random variable occrs at p = 0.5. Instead of looking at an otcome of a difference of two proportions, it may be be the case that investigators are instead interested in an odds ratio or relative risk. Here also, depending on the vales of the proportions the investigators se in planning their stdy, the variance may not be changing similarly in all scenarios. Overall, if the variance decreases with larger alternatives (increases with larger attenation), the effect attenation has as described in previos sections will be magnified. If instead the variance increases with larger alternatives (decreases with larger 13

attenation), then the degree to which the variance decreases along with the magnitde of effect will impact the extent to which these may offset each other and diminish the effect described in previos sections. 5 Discssion In the presence of a patient poplation with sbgrops that have different treatment effects, eligibility criteria and screening procedres designed to target one grop over the others inflence sample size, power, difference to detect, and nmber of patients needed to screen. It is important researchers know how many patients they need to recrit: not enogh patients will lead to lack of evidence, and too many patients will be inefficient. We evalated the impact of the design operating characteristics of power and sample size, and illstrated scenarios where overall trial dration may be shortened. Allowing patients with a sboptimal treatment effect to enroll will lower power to detect a particlar alternative for a given sample size with higher attenation reslting in a larger loss in power. In order to remediate this, a larger sample size will be reqired. Frthermore, depending on the degree of attenation in the sboptimal grop and positive predictive vale of screening procedres, the minimm nmber of patients to screen can be evalated and has been shown not to be trivial. For scenarios where the sbgrop is expected to experience a moderately attenated treatment effect, the nmber of patients reqired to screen was the least when all screened patients were enrolled. This was not the case for larger degrees of attenation. Reslts will depend on the nderlying distribtions of the sbgrops and prevalence of the optimal grop. As sch, evalation for each specific context is warranted and may be condcted following the otline presented here. As discssed in sections 3.2 and 3.3, care needs to be exercised to be sre inclsion of a larger proportion of patients with an expected attenated effect will not reslt in a stdy design for an effect that is not meaningfl scientifically. It shold be noted that the sample size formla sed throghot is qite general and pertains to any normally distribted statistic, e.g., estimates for log odds ratios or regression coefficients (Emerson 2003). As often the case in analyses and hence sample size/power estimates, asymptotic argments are made for the se of sch formlas withot reliance on normally distribted data. At times, some may choose to se other distribtional qantiles, e.g., t qantiles. This will lead to qantitative differences from the reslts presented here in some places, e.g., loss in power will be 14

of a different magnitde, bt qalitatively the reslts wold be the same. There is a tradeoff between allowing more patients into the trial reslting in a diminished treatment effect expected to be observed, or allowing only those with the greatest treatment effect and having a smaller sample size. Changing the patient poplation may also impact the generalizability and/or safety profile. The risk-benefit ratio as part of the stdy design process cannot be ignored, especially in patients who are likely to experience an attenated treatment effect. If the safety profile is identical for both the optimal and sboptimal grops, the risk-benefit trade-off is necessarily different becase the sboptimal grop is expected to have an attenated effect. As sch, it may no longer be ethically appropriate for these patients to be enrolled, either absoltely or relative to another treatment available that is potentially more efficacios, thereby presenting a better risk-benefit. These considerations are needed to be evalated on a trail by trial basis. For a stdy design with a srvival endpoint, statistical information is generally tied to the nmber of events observed rather than the nmber of patients. In this case, there is an additional aspect to the tradeoff between nmber of patients enrolled and calendar time becase statistical information is not immediately obtained with the enrollment of an individal. If many patients were enrolled immediately, it may affect or ability to assess time-varying effects, sch as whether there is greater magnitde of effect earlier or after some delay. This srvival endpoint scenario is more complex and warrants frther investigation. The focs of discssion here has been on fixed sample designs. We recognize that often the economic and/or ethical inflences will impact the design of trials and in some cases demand a seqential analysis to meet those needs, e.g., throgh grop seqential designs (Jennison and Trnbll 2000; Emerson, Kittelson, and Gillen 2007) or adaptive designs (Tsiatis and Mehta 2003; Jennison and Trnbll 2006). For these designs, it is often the case that the stdy will be powered for the minimal clinically meaningfl alternative and by design will stop early when the observed effect is sfficiently large at an interim analysis. For the scenario presented here, with optimal and sboptimal sbgrops, the powered alternative wold be nchanged (an attenated expected treatment effect does not change what is minimally clinically meaningfl) bt the trial operating characteristics will be different becase at the same interim analysis where a trial among only patients from the optimal grop wold stop early for efficacy (or not stop for ftility), the trial that incorporates a larger proportion from the sboptimal grop may not stop for efficacy (or stop 15

for ftility) de to an attenated treatment effect at that time. This deserves more consideration as well, bt is beyond the scope here. References Emerson, S. S. (2003). S+seqtrial technical overview. Technical Report, Insightfl Corporation, Seattle, Washington. Emerson, S. S., J. M. Kittelson, and D. L. Gillen (2007). Freqentist evalation of grop seqential clinical trial designs. Statistics in Medicine 26, 5047 5080. Jennison, C. and B. W. Trnbll (2000). Grop Seqential Methods With Applications to Clinical Trials. CRC Press. Jennison, C. and B. W. Trnbll (2006). Adaptive and non-adaptive grop seqential tests. Biometrika 93 (1), 1 21. Tsiatis, A. A. and C. R. Mehta (2003). On the inefficiency of the adaptive design for monitoring clinical trials. Biometrika 90, 367 378. 16

Distribtion of Continos Biomarker Probability 0.0 0.1 0.2 0.3 0.4 Overall Optimal Grop ~ N(2,1) Sboptimal Grop ~ N(0,1) 2 0 2 4 c Figre 1: Example of possible distribtions of a biomarker in a patient poplation with optimal and sboptimal grops. Prevalence of the optimal grop here is 40%. Power a=1.0 PPV Difference in Power 0.75 0.80 0.85 0.90 0.95 1.00 4.20 3.32 2.47 1.63 0.81 0.00 8.93 7.02 5.16 3.38 1.65 0.00 14.13 11.05 8.08 5.24 2.54 0.00 19.72 15.39 11.21 7.23 3.48 0.00 25.58 19.99 14.54 9.33 4.46 0.00 Figre 2: Impact of attenation on power for fixed sample size (N = 50). Tabled vales below plot represent the difference in power from no attenation (a = 1, power = 80%) for PPV between 0.75 and 1.0. 17

Sample Size Ratio 2 4 6 8 10 a=1.0 PPV 0.75 0.80 0.85 0.90 0.95 1.00 1.11 1.09 1.06 1.04 1.02 1.00 1.24 1.19 1.14 1.09 1.04 1.00 1.40 1.30 1.22 1.14 1.07 1.00 1.59 1.44 1.31 1.19 1.09 1.00 1.83 1.60 1.41 1.25 1.12 1.00 Figre 3: Ratio of sample sizes reqired to maintain power (relative to no attenation: a=1.0) for the expected treatment effect in the stdy depending on the degree of attenation and positive predictive vale (PPV) of screening procedres. 1 and prevalence are 0.4. Tabled vales below plot represent the ratios for PPV between 0.75 and 1.0. * 0.0 0.1 0.2 0.3 0.4 a=1.0 N * () 0 50 100 150 200 a=1.0 (a) (b) N Figre 4: Impact of attenation on and sample size (N ) needed to overcome enrollment of patients with attenated treatment effect and keep power constant, with power eqal to 80% across vales of. For the example here F op N(2, 1), F so N(0, 1), prevalence = 0.4, and 1 = 0.4. 18

ScN 0 50 100 150 200 250 300 350 a=1.0 Figre 5: Impact of attenation on nmber of patients needed to screen, with power eqal to 80% and with F op N(2, 1), F so N(0, 1), prevalence = 0.4, and 1 = 0.4. Nmber of Patients Needed to Screen 1.0 300 0.8 250 200 0.6 a 150 0.4 * = 0.25 100 0.2 * = 0.20 50 0.0 0 Figre 6: Elevation map depicting nmber of patients needed to screen while varying a and with F op N(2, 1), F so N(0, 1), prevalence = 0.4, and 1 = 0.4. The horizontal line indicates the vale of a wherein all of the vales above the line have a minimm nmber of patients to screen eqal to =0 and all vales below have a minimm with > 0. 19

a=1.0 a=1.0 a=1.0 a=1.0 ScN 0 50 100 150 200 250 300 350 (a) Optimal Grop Mean = 1 (b) Optimal Grop Mean = 2 (c) Optimal Grop Mean = 3 (d) Optimal Grop Mean = 10 c Probability 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.1 0.2 0.3 0.4 ScN 0 50 100 150 200 250 300 350 ScN 0 50 100 150 200 250 300 350 ScN 0 50 100 150 200 250 300 350 Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop (e) Optimal Grop Mean = 1 (f) Optimal Grop Mean = 2 (g) Optimal Grop Mean = 3 (h) Optimal Grop Mean = 10 c c Probability 0.0 0.1 0.2 0.3 0.4 Probability 0.0 0.1 0.2 0.3 0.4 0 5 10 0 5 10 0 5 10 0 5 10 Figre 7: Impact of dissociation of marker distribtions between optimal and sboptimal grops on the nmber of patients needed to screen (ScN). Nmber of patients needed to screen is in sbfigres 7(a):7(d) with corresponding plot of marker densities in sbfigres 7(e):7(h). c 20

a=1.0 a=1.0 a=1.0 a=1.0 ScN 0 100 200 300 400 500 600 (a) Optimal Grop Prevalence = 0.2* (b) Optimal Grop Prevalence = 0.4 (c) Optimal Grop Prevalence = 0.6 (d) Optimal Grop Prevalence = 0.8 c Probability 0.0 0.1 0.2 0.3 0.4 0.5 Probability 0.0 0.1 0.2 0.3 0.4 0.5 ScN 0 50 100 150 200 250 300 350 ScN 0 50 100 150 200 250 300 350 ScN 0 50 100 150 200 250 300 350 Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop Optimal Grop Sboptimal Grop (e) Optimal Grop Prevalence = 0.2 (f) Optimal Grop Prevalence = 0.4 (g) Optimal Grop Prevalence = 0.6 (h) Optimal Grop Prevalence = 0.8 c c Probability 0.0 0.1 0.2 0.3 0.4 0.5 Probability 0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 0 5 10 0 5 10 0 5 10 Figre 8: Impact of prevalence of optimal grop on the nmber of patients needed to screen (ScN). Nmber of patients needed to screen is in sbfigres 8(a):8(d) with corresponding plot of marker densities in sbfigres 8(e):8(h). * Note in plot (a) a larger range for the y-axis was needed. c 21

a 0.0 0.1 0.2 0.3 0.4 0.5 95% Prevalence 60% Prevalence 40% Prevalence 5% Prevalence a 0.20 0.25 0.30 0.35 0.40 0.45 0.50 97.5% power 90% power 80% power 1 2 3 4 5 6 7 8 Mean of Optimal Distribtion Prevalence of Optimal Grop (a) Separation of Sbgrop Biomarker Distribtions (b) Largest Attenation Factor Across Prevalence of Optimal Distribtion Figre 9: Impact of the degree of separation of sbgrop biomarker distribtions and prevalence for power of 80%, 90% and 97.5% on the vales of attenation factor a for which the minimm nmber of screened patients is not at = 0. 22