A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests

Size: px

Start display at page:

Download "A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests"

Luke Gardner
6 years ago
Views:

medical tests unlei Cheng, Baylor Health Care System Adam J Branscum, University of

1 Baylor Health Care System From the SelectedWorks of unlei Cheng 1 A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests unlei Cheng, Baylor Health Care System Adam J Branscum, University of Kentucky James Stamey, Baylor University Available at:

2 A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests ULEI CHEG 1, AAM J. BRASCUM, and JAMES. STAMEY 3 1 Institute for Health Care Research and Improvement, Baylor Health Care System, allas, TX 756, USA epartments of Biostatistics, Statistics, and Epidemiology, University of Kentucky, Lexington, KY 4536, USA 3 epartment of Statistical Science, Baylor University, Waco, TX 76798, USA Correspondence to: unlei Cheng, Institute for Health Care Research and Improvement, Baylor Health Care System, 88. Central Expressway, Suite 5 allas, TX 756, unleic@baylorhealth.edu 1

3 Abstract We develop a Bayesian approach to sample size and power calculations for crosssectional studies that are designed to evaluate and compare continuous medical tests. For studies that involve one test or two conditionally independent or dependent tests, we present methods that are applicable when the true disease status of sampled individuals will be available and when it will not. Within a hypothesis testing framework, we consider the goal of demonstrating that a medical test has area under the receiver operating characteristic (ROC) curve that exceeds a minimum acceptable level or another relevant threshold, and the goals of establishing the superiority or equivalence of one test relative to another. A Bayesian average power criterion is used to determine a sample size that will yield high posterior probability, on average, of a future study correctly deciding in favor of these goals. The impacts on Bayesian average power of prior distributions, the proportion of diseased subjects in the study, and correlation among tests are investigated through simulation. The computational algorithm we develop involves simulating multiple data sets that are fit with Bayesian models using Gibbs sampling, and is executed by using WinBUGS in tandem with R. Key Words: iagnostic test, ROC curve, power calculations, simulation 1. Introduction Medical tests are used to accurately classify individuals into one of several groups. In the two-group classification problem that we consider here, one or two tests are used to distinguish between two groups of individuals, which for ease of discussion we will refer to as a diseased () group and a non-diseased ( ) group. One phase in the development of a new medical test involves characterizing the test s ability to accurately discern from individuals in the target population. The accuracy of a continuous test can be quantified by first defining a cutoff threshold, c, for a positive test, and then estimating the sensitivity, η(c), and specificity, θ(c), of the test at that cutoff. The parameter η(c) denotes the probability of a diseased individual having a positive test result at cutoff c, and θ(c) is the probability of a non-diseased individual having a

4 negative result. Without loss of generality we adopt the usual convention that test scores (y) are expected to be larger for the group, so that η(c) = Pr(y > c ) and θ(c) = Pr(y < c ). Instead of focusing inference on a single cutoff value, an alternative approach to evaluating the accuracy of continuous tests that avoids the loss of information that comes from dichotomization involves estimating the receiver operating characteristic (ROC) curve. The ROC curve represents the plot of a test s true positive fraction (sensitivity) versus its false positive fraction (1 specificity) across all possible cutoff thresholds. Thus, the ROC curve is obtained by plotting the pairs (1- θ(c), η(c)) for all values of c. The area under the ROC curve (AUC) is a summary index that measures the overall accuracy of a test, reflecting with equal weight the test s ability to distinguish between subjects with and without a medical condition. The value of AUC typically ranges from.5 (for a useless diagnostic procedure that classifies disease status in a purely random fashion) to 1 (for tests that have perfect classification accuracy). In this paper, we treat AUC as the focal parameter for use in evaluating and comparing continuous medical tests when true disease status is known and when it is not, and we develop a simulation-based procedure for sample size estimation and power calculations in these contexts. We emphasize at the outset that although our focus is on the use of medical tests to classify health status, and our notation and terminology are consistent with biomedical applications, the methods presented in this paper apply more broadly. For instance, the methods we develop here can aid in sample size selection to investigate any general continuous classification procedure. 3

5 The remainder of the paper is organized as follows. Common goals of test accuracy studies and some background on ROC analysis are outlined in Section. Section 3 details the Bayesian models that we use in our sample size determination procedure. In Section 4 we discuss the Bayesian average power criterion used in our computational algorithm. Results from simulations are presented in Section 5, and concluding remarks are given in Section 6.. Goals and Background In designing a study that will measure and/or compare test performance, an appropriate sample size that will ensure adequate statistical power without overextending limited resources is needed. We consider study designs that involve either a single medical test, or two conditionally independent or correlated tests. The possible goals of test accuracy studies are numerous. We focus on three common goals and we note that many other cases can be handled with slight modifications of the ideas and methods presented here. We assume that a new and/or a standard test are under investigation, but in general the study could involve any tests. The goals include establishing that a continuous test has at least a certain desired level of accuracy, establishing that one test has superior accuracy over another test, and establishing that two tests are (practically) equivalent in terms of accuracy. Specifically, we consider power calculations to (i) verify that the AUC of a newly developed continuous medical test exceeds some threshold, (ii) verify that the AUC of a newly developed medical test is greater than that of a standard test, and (iii) verify that the AUC of a new diagnostic procedure is equivalent to that of a standard classifier. These three objectives were also the focus of 4

6 Branscum, Johnson, and Gardner (7) to determine required sample sizes for estimating sensitivity and specificity of binary tests. We do not discuss ordinal tests here; we refer the reader to Wang and Gatsonis (8) for a Bayesian treatment of multi-test, multi-reader ordinal ROC analysis, including methods for sample size determination in that context. We may represent the above three goals as certain hypothesis tests regarding AUC. Let the subscripts and S denote the new and standard tests, and let, λ, and ε represent pre-determined positive constants. In case (i), we formulate the following hypothesis H: AUC >. The hypotheses of test superiority (case ii) and equivalence (case iii) are written as: respectively. H: AUC AUCS >λ and H: AUC AUC < ε, With respect to sample size determination, we assume that each proposed hypothesis is true and that a future study is being designed to test the hypothesis. A sample size is selected that ensures, on average, that the posterior probability of the hypothesis H is high when in fact H is true. In addition to being able to test in a single framework many different types of hypotheses that reflect many different study goals, we also build into our all-purpose sample size determination procedure the ability to accommodate a key complicating issue in medical test evaluation, namely handling data from sampled individuals whose true disease status is unknown. Information on true disease status will be missing when a S 5

7 perfect reference test (also called a gold-standard test) does not exist or cannot be applied without unacceptable consequences. The methods developed here for sample size calculations in the gold-standard (GS) setting are a special case of those for the non gold-standard (GS) setting. Much of the literature on ROC analysis assumes that the true disease status is known for each subject, however research on the development of GS ROC analysis has recently increased (some examples include Branscum et al., 8; Albert, 7; Choi et al., 6; Erkanli et al., 6; Zhou, Castelluccio, and Zhou, 5). Often, imperfect tests are used when GS tests are either too expensive or invasive, or don t exist with (near) perfect accuracy. Obuchowski (1998) reviewed methods for sample size calculations for ROC curves and functionals of them, including procedures for a single diagnostic test, for comparing two diagnostic tests and for multi-reader ROC analysis. Obuchowski and McClish (1997) studied sample size requirements when the ROC curve is only considered in a specific range of false positive values (partial AUC) rather than the full area under the ROC curve. Sample size estimation for clustered ROC curves was addressed in Obuchowski (1997), whose method incorporated a cluster design effect. All three of these papers assumed that a GS test was available to ascertain the true medical condition of each subject. The present study addresses the issue of sample size and power estimation for ROC analysis via the Bayesian paradigm. A primary advantage of the Bayesian approach is the allowance for uncertainty in parameter values in the planning stages of the 6

8 experiment as opposed to use of plug-in values. The particular criterion we use is referred to by Wang and Gelfand () as Bayesian power. 3. Bayesian Models 3.1. One Gold-Standard Test Let TS i (i = 1,, n 1 ) and j TS (j = 1,, n ) denote scores of a new test obtained from a random sample of n (n = n 1 + n ) individuals who have a disease or are disease-free, respectively. We suppose that a GS test has been used to identify each individual s true disease status before the application of the new test. We further assume that TS i and j TS are both normally distributed or could be modeled with normal families after an appropriate transformation is applied, namely TS i ~ (, ), i = 1,, n 1, TS j ~ (, ), j = 1,, n, (1) where and denote the means, and and denote the variances of the distributions of the measurements from the diseased and non-diseased populations, respectively. The AUC for a diagnostic test under this two-group normal model is given by AUC = Φ +, () where Φ ( ) is the c.d.f. of the standard normal distribution., The completion of the Bayesian model requires a prior distribution for (,, ). We assume prior independence of the component parameters. We employ a 7

9 Bayesian sample size determination method as described in Wang and Gelfand (). Their simulation-based approach requires the user to construct two sets of distributions. One set, called sampling or design priors, is used to simulate parameter values that are then used to generate multiple data sets, and the other set, called fitting or analysis priors, is used as priors for data analysis on the simulated data sets in the conventional way. By using sampling priors, the method accounts for uncertainty about the true values of parameters in the model, in contrast to using fixed planning estimates as is commonly done in frequentist sample size determination. Sampling priors contain substantive information, which can be extracted from historical data, based on expert experience and knowledge, or constructed using a combination of the two. One approach that has been previously used in practice places uniform distributions with relatively narrow intervals around the elicited planning estimate (Wang and Gelfand, ; Cheng, Stamey, and Branscum, 8). For instance, suppose a value for of.5 is elicited. In the present Bayesian framework, uncertainty could be accounted for by generating from, say, a uniform(1.5, 3.5) distribution, or a normal distribution with mean.5 and an appropriate standard deviation. The fitting priors for and in the data analysis portion of the sample size procedure are generally relatively flat normal distributions centered at zero. Additional details about these two sets of distributions are provided in Section One on Gold-Standard Test When it is difficult or impossible to have a definitive diagnosis of tested individuals, the model in Section 3.1 needs to be modified to account for unknown disease status. We introduce a latent disease indicator variable Z k, k = 1,, n, where 8

10 Z k =1 if the kth sampled individual is and Z k = otherwise, and where n is the total number of enrolled subjects into the study. If individuals are sampled randomly from a large population that has disease prevalence π, then Z k ~ Bernoulli(π ), k = 1,, n. (3) The data are modeled according to the mixture TS k ~ π f (, ) + (1 π ) g(, ), (4) where f (, ) and g(, ) are the p.d.f.s of the (, ) and (, ) distributions, respectively. The procedure to construct priors for,,, and is analogous to that previously described for the GS case. In addition, it is required to assign an informative fitting prior to π, and we furthermore incorporate the constraint > to insure identifiability (Choi et al., 6). A beta prior distribution is often reasonable for a prevalence parameter (e.g., Johnson, Gastwirth, and Pearson, 1; Joseph, Gyorkos, and Coupal, 1995, and many others). For the sampling prior, we use a uniform distribution over an elicited range, for instance π ~ uniform(.3,.5) Two Gold-Standard Tests In the two test scenario, both the standard and new tests will be applied to all sampled subjects. In the GS case, the medical condition is known for each subject, with n 1 subjects having the disease and n subjects being disease-free. We assume an independent, two-group bivariate normal model for the future (possibly transformed) data as in Choi et al. (6). Here, TS i and j TS denote vectors that contain scores from the 9

11 standard and new tests for the ith diseased and jth non-diseased subjects, respectively, which are modeled as TS TS i j TS = TS TS = TS i S i j S j (, Σ ), i = 1,, n 1, ~ (, Σ ), j = 1,, n. ~ Using obvious notation for parameters, the AUCs of the standard and new diagnostic tests are given by AUC S =Φ S S S + S and AUC =Φ +. (5) The decision related to superiority of the new test or to equivalence of test accuracy can be determined by the magnitude of AUC AUC S. With two GS tests, sampling and fitting priors are assigned to the following parameters: S,,,, S S,,, S, ρ S, ρ. Since in S most situations with dependent tests, the new and standard tests are positively correlated, the lower bounds of ρ S and ρ are assumed to be no smaller than. Uniform and S beta distributions are used as sampling and fitting priors for ρ S and ρ in our S study Two on Gold-Standard Tests In the two-test, GS scenario, we again use the latent Bernoulli variables Z k, k = 1,, n, with parameter π just as was done in the one GS case. The kth measurement is assumed to follow a distribution that is a mixture of two bivariate normals 1

12 TS k π (, Σ ) + (1 π ) (, Σ ). (6) ~ Prior construction for π is done in the same manner as discussed in Section 3.. In order to avoid problems with identifiability in this mixture model, we add the constraints S > S and >, and place an informative prior on π. All the other prior densities are as described in the two GS case. 4. Bayesian Power Criterion and Simulation Algorithm We apply the Bayesian power criterion proposed by Wang and Gelfand (), which is used in hypothesis testing. For case (i), this criterion selects a combination of n 1 and n (GS test), or n (GS test) so that the posterior probability, averaged over potential future data sets, that the AUC of the new diagnostic test exceeds some benchmark,, is sufficiently high when in fact the AUC is expected to be greater than. Specifically, the average power criterion in this setting is where m TS and m m E {Pr( AUC > TS, TS )} 1 β m TS denote test scores from future study data associated with diseased and non diseased subjects. Typical values forβ are.5,.1, and., and the value of is problem-specific, but for accurate tests the value.85,.9, or.95 can be used. studies Similar expressions in the two-test situation can be formulated for superiority and equivalence studies m m m m E{Pr( AUC AUCS > λ TS, TS, TS S, TS )} 1 β m m m m E{Pr( AUC AUC S < ε TS, TS, TS S, TS )} 1 β, S S 11

13 m m where potential future data sets, ( TS, TS, TS, TS ), represent a composition of m new and standard test scores on subjects with or without disease. Choices of λ include, for instance,.1,.15, and., and ε is chosen to be a small constant, such as.5. Wang and Gelfand () argue that sampling priors should be informative whereas fitting priors for data analysis can be less informative or even diffuse. In our study, the sampling priors for all parameters are uniform distributions centered on the most likely value and with a small range. However, identifiability issues prevent the use of diffuse fitting priors for all parameters with GS data, so π, ρ S, and ρ are S each assigned a beta fitting prior that has the same mean as the corresponding uniform sampling prior, but with slightly larger variance. The fitting and sampling priors for the mean parameters follow the order constraints outlined in Sections 3.1 and 3.3. Choi et al. (6) handled this by constraining the normal sampling models over a parameter space in which the mean test score for the population is greater than the mean for the population. We found that computation using this truncated normal model breaks down for certain data sets in the simulation framework required for our sample size procedure. Therefore, as a practical alternative we instead mitigated lack of identifiability by modeling the means for the diseased and non-diseased populations with prior distributions that have non overlapping 99% intervals. The variance parameters are given somewhat diffuse inverse gamma fitting priors. The following algorithm can be used to compute Bayesian power for the one GS test setting of case (i). Similar algorithms can be used for cases (ii) and (iii), and for m S studies that will involve GS data and/or other goals. 1. Specify, β, and G pairs of sample size combinations (n 1, n ). S 1

14 . For l = 1,, B Monte Carlo iterations, at each sample size combination (i) Generate values of,, (ii) Simulate data formula (1)., and from their sampling priors. l l TS i, i = 1,, n 1, and TS, j = 1,, n j, according to (iii) To each simulated data set generated in step (ii), fit the two-group normal model and approximate the posterior distribution of (). A Gibbs sampler can be employed at this step. AUC as defined in equation (iv) For the lth simulated data set, calculate the posterior probability that l AUC exceeds. This posterior probability is computed as the proportion of Monte Carlo (MC) iterates sampled from the posterior of l AUC that are greater than l l l. An approximation to Pr( AUC > TS, TS ) is obtained as M l 1 l t p = I( AUC > ), where I ( ) denotes the indicator function, M denotes M t= 1 the number of MC iterates, and AUC denotes the tth MC iterate generated l t from the posterior of l AUC. (v) Calculate the Bayesian average power for each sample size combination, which is obtained as 1 B B l= 1 l p. 3. Fit a curve or surface through the G Bayesian power values and find an adequate sample size combination for the desired power. 13

15 In the case of one GS test, steps (i) - (iii) need to be altered. In step (i), the latent variable Z k needs to be generated after obtaining a value of π from its sampling prior; then, based on expression (4), data are simulated in step (ii). The approximation of the posterior distribution of l AUC requires the fitting prior of π in step (iii). For studies that involve two GS tests, values of S,,,, S S,,, S, ρ S, and ρ need to be generated in step (i) via sampling priors. S In the next step, we simulate scores from the standard test, namely TSi S and TS, using formula (1). Then, values of the new test are simulated conditional on the data j S from the standard test, i.e., generate TS i TSi S and TS TS j j S. In step (iii), the joint posterior distribution of both AUCs is obtained from a two-group bivariate normal analysis. In the next step, if the goal of the study is to demonstrate the superiority of the new diagnostic test relative to the standard test, calculate the posterior probability l l l l Pr( AUC AUCS > λ TS, TS, TS S, TS ) using M l 1 l t l t p = I( AUC AUCS >λ ). If the study objective is to demonstrate M t= 1 M l 1 l t l t equivalence, the value p = I( AUC AUCS <ε ) is an MC approximation to M t= 1 l l l l Pr( AUC AUCS < ε TS, TS, TS S, TS ). In both cases, the average S S power is obtained as the mean of the l p s. When the disease condition is not identified in the case of two tests, data will be simulated and analyzed after incorporating the Bernoulli random variable Z k. 14

16 All computations in these algorithms can be carried out using R and posterior distributions can be approximated using the WinBUGS software. Computer code for some examples can be found online at the website 5. Illustrations We consider sample size calculations in a variety of scenarios. We first investigate the impact of and λ on the required sample size. In the one test and two tests settings, we compare Bayesian average power when disease status is available to the case where this information is not ascertained. The simulations also assess the influence of sampling priors on average power. The issue of the fraction of non-diseased versus diseased subjects is also studied. Moreover, we discuss the influence of ρ S and ρ S on sample size when two tests are applied to each subject. These simulations are all performed for case (i) or (ii). In addition, we calculate required sample size under the proposed Bayesian method and under a standard frequentist method in the one GS test setting. All simulations in this section use 1 data sets generated as described in Section 4, with 5 posterior iterates after a 1 iteration burn-in used for posterior approximations with each simulated data set. Computational times for the one GS, one GS, and two GS cases are relatively fast with more time required for the two GS case. For example, using a ell PC with a.66 GHz processor and 3.5 GB of RAM, the run time of a simulation with 1 data sets is less than 3 minutes for one GS test, about 7 minutes for one GS test,.5 hours for two GS tests, and 6.5 hours for two GS tests with a total sample size of 1. 15

17 5.1 Impact of and λ on Sample Size We will first illustrate how the average Bayesian power varies over a range of sample size combinations for different values of and λ. Three benchmark values were selected, namely {.85,.9,.95} and λ {.1,.15,.}. For both one and two GS tests, the sample size combinations used are (n 1, n ) = (1, 1), (, ),, (1, 1). When medical conditions are unknown, π is given a uniform(.4,.6) sampling prior and a beta(5, 5) fitting prior with both priors reflecting a 5-5 chance for each individual having the disease. The total sample sizes used in the GS scenario are n =, 4,,, matching the totals used in the GS cases. In the simulations involving a single test, we used a uniform(.5, 3.5) as the sampling prior for and a uniform(-.5,.5) as the sampling prior for so that the average test score in the disease group is larger than that in the disease-free group. Uniform(1.8,.) and uniform(.8, 1.) distributions were used as the sampling priors of and, respectively, which conforms to the convention that test values tend to fluctuate more widely in the disease group. For the fitting priors of and, a normal(3,.58) and a normal(,.58) distribution were used, respectively, for the diseased and non-diseased populations. The lower endpoint of the central 99% interval of the normal(3,.58) is 1.56, and the upper endpoint of the central 99% interval of the normal(,.58) is 1.494; therefore, the fitting priors for and have minimal overlap with each other. The fitting priors for and were relatively non- informative IG(.5,.5) and IG(.5,.5) distributions, reflecting the prior belief that is smaller as is commonly seen in practice. 16

18 Figure 1 contains three cubic polynomial spline fits to the Bayesian power estimates at (n 1, n ) = (1, 1), (, ),, (1, 1) corresponding to the different values of. The size of appreciably alters the magnitude of the average Bayesian power. When =.85, the smallest Bayesian power is.97, with only 1 subjects in each group, and the power is almost 1 (.996) with a total sample size of. However, when is raised to.95, the largest Bayesian power never exceeds.68 even after recruiting as many as 1 subjects in each group. The average Bayesian power for =.9 ranged from.768 to.943 across the selected range of sample sizes. Figure 1 about here The corresponding Bayesian power curves for the one GS test are also plotted in Figure 1. Clearly, a study involving individuals with unknown disease status will be less powerful than a study in which medical conditions are known. For example, using a third order polynomial regression fit, at least 13 subjects are required to achieve a Bayesian power of.8 with =.9 when one GS test is used. However, to obtain the same power with the same, the sample size is reduced to 3 if one GS test is applied. When the number of GS tests increases to two, six more parameters are needed to complete the model. Table 1 categorizes the sampling and fitting priors used for the parameters in the two-test GS simulations. The sampling prior means of, S, S, and are, 1,, and 3, respectively. The fitting priors for and S S, and and are formulated so that there is no overlap among 99% intervals. Both the 17

19 sampling and fitting priors of ρ and ρ S S are informative, and each has a prior mean of.5. Table 1 about here Figure demonstrates the impact of the effect size ( λ ) of superiority on the average Bayesian power. With λ =.15, a decrease in power is seen at each sample size by approximately 5 to 1 per cent compared with λ =.1. The larger effect size in AUC of λ =. further reduces the average power by about 3 to 43 per cent at every sample size in comparison with λ =.1. Figure about here Figure also has the corresponding three splines for the two GS case. A comparison of the GS and GS curves demonstrates that unknown disease status requires a much larger sample size in this example. When λ =.1, a power of.95 requires only 3 subjects for the two GS test scenario. However, unidentified disease status will add more than 1 additional subjects to obtain the same power. 5. Impact of Sampling Priors We next investigate changes in sample size as a result of different sampling priors for the mean or variance parameters. We consider GS scenarios with =.9 for the 18

20 one test case and λ =.15 for two tests, the latter placed on a website in order to save space. In the one test case, three different sets of sampling priors for and are compared. One set follows those discussed in Section 5.1 with ~ uniform(.5, 3.5) and ~ uniform(1.8,.). The second set uses the same uniform distribution for but increases the prior mean of, from 3 to 3.5. The third set gives a more precise sampling prior, a uniform(1.4, 1.8), to, and uses the same prior on as in the first set of sampling priors. The results presented in Table show that either raising the prior mean of or reducing the prior mean of boosts the average power, which was expected since both will increase the separation between the distributions of test scores for the diseased and non-diseased populations. With the first set of sampling priors, to achieve power of at least.9 a study needs to enroll more than 8 individuals. However, when the sampling prior of changes to the uniform(1.4, 1.8), only 36 total subjects are required to obtain a 9% level of average power. With a uniform(3, 4) sampling prior on, the sample size decreases from 8 to 3 for a power of.9. Comparing the power numbers at each sample size shows that in this example a 16.67% increase in the prior mean of raises the average power by a greater amount compared to a % increase in the prior precision of. Table about here 19

21 The results summarizing the impact of mean and variance sampling priors for the two GS test scenario can be found online at Impact of Prevalence and the Ratio of iseased-to-on iseased The simulations in the previous two sections were based on a population prevalence of 5% so the ratio of diseased and non-diseased sampled individuals was 1. A balanced allocation of subjects will not be expected when the prevalence is different from.5. We therefore study the influence of an imbalanced design on sample size determination for ROC analysis. We fix the total sample size at 1 for the one GS and GS cases. The sample size combinations for the GS scenario are (n 1, n ) = (1, 9), (, 8),, (8, ), (9, 1). For unknown disease status, we investigate changes in average power when the prior mean of π ranges from. to.8 in increments of.1. In both cases, was set equal to.9. The average Bayesian power does not reach a maximum when there are an equal number of subjects in the diseased or disease-free groups, as demonstrated in Table 3. In fact, with a ratio of n 1 to n of.33 (7 vs. 3), we obtained the largest average power of.917 in this example. The average power decreased by only.1 when the sample size ratio was 1.5. However, further increases of the ratio to 4 and 9 reduce the Bayesian power by about 1 and 4 per cents, respectively, compared with the balanced design. Table 3 also shows that the power continues to decrease as the ratio between n 1 and n approaches. In this simulation, a larger value of average power is achieved if more diseased subjects, rather than more disease-free subjects, are recruited.

22 Table 3 about here The website mentioned in Section 5. also presents results regarding prevalence prior changes for a single GS test with = Impact of Correlation Here we evaluate how correlation among tests affects the sample size requirement in ROC studies. We compare four scenarios involving two GS tests with λ equal to.15. The first scenario uses the sampling and fitting priors of ρ and ρ S S listed in Table 1. In the second scenario, we increase the prior mean of ρ S from.5 to.8 for both data simulation and data analysis. In the third scenario, we change the sampling and fitting prior means of ρ to.8. The last scenario considered assumes conditional S independence of the two tests by setting ρ S and ρ equal to. Two tests that have S a different biological basis are often viewed as conditionally independent (see, for example, Branscum, Gardner, and Johnson, 5). Based on Table 4, a total of at least 6 subjects are needed to achieve a power of.9 to detect a.15 difference in AUC when the prior means of both ρ and ρ S S are.5. Our simulations show that ρ S has a greater impact on Bayesian power compared with ρ. When the sampling prior mean of ρ S S increases from.5 to.8, the average Bayesian power increases by about.15 across all sample sizes. However, the same amount of change in ρ leads to an increase in Bayesian power of only.6, S less than half the size of the power increase seen with the same amount of change in 1

23 ρ S. When ρ and ρ S S are both equal to, more subjects are required in order to maintain the same level of power. Table 4 about here 5.5 Impact of Accounting for Uncertainty We investigate the impact of the degree of uncertainty in sampling and fitting priors on sample size requirements for one GS test with =.9. The following scenarios are considered: (1) the same sampling and informative fitting priors as used in Section 5.1; () the same sampling priors as used in Section 5.1, but a diffuse proper fitting prior that approximates Jeffreys prior for this model, namely ormal(,1) for means and IG(.1,.1) for variances; (3 and 4) sampling priors are not used, instead data are simulated using fixed values for parameters ( = 3, =, = 1 for case 3, and = 3.5, = -.5, =, and =, and = 1 for case 4). Table 5 reports average power for these four cases when 1 to 1 diseased and non-diseased individuals are sampled. Comparing informative to diffuse fitting priors (case 1 versus case ), the difference in average power does not exceed. with a total of 14 or more subjects, and the difference remains relatively low (.58) with only 1 subjects per group. The similarities in power for cases 1 and support the use of diffuse fitting priors as the default in one GS test studies. In case 3, fixed inputs instead of sampling priors were used to simulate data sets; specifically, the means of the sampling priors from case 1 were used as fixed input values. The average power is consistently higher (by between.7% and 6.1%) for case 3 compared to case 1. In case 4, we shifted the value of the

24 mean for the diseased population up by.5 and shifted the mean for the non-diseased population down by.5, relative to the fixed values used in case 3. This increased the average power substantially, which highlights the sensitive of the method to fixed input values and the need to carefully select them when they are used. Table 5 about here. 5.6 Traditional Methods for Sample Size Calculations The review article by Obuchowski (1998) and Obuchowski and McClish (1997) contain sample size formulas for measures of medical test performance. We consider one GS test that has true AUC of.9, with the goal of testing null hypothesis values of.8 or.85. The effect sizes are therefore.1 and.5, respectively. The parameter values from cases 3 and 4 in Section 5.5 are used to determine fixed inputs (using formula T1 in Obuchowski, 1998) for frequentist sample size computation under each effect size, yielding four total scenarios. Taking averages over the four scenarios, we calculated that 3 subjects in each group are needed to reach 8% power, 3 subjects are needed per group to reach 9% power, and 38 subjects per group for 95% power. The proposed Bayesian approach using the sampling and fitting priors outlined in case 1 of Section 5.5 gives sample sizes of 14 subjects in each group for 8% power, 4 subjects per group for 9% power, and 11 subjects per group for 95% power. In this setting, accounting for uncertainty in parameter values through sampling priors leads to higher sampling size combinations to achieve 9% power, with more than double the number of individuals needed for 95% 3

25 power. Similar qualitative findings were reported by Branscum, Johnson, and Gardner (7) for binary tests. 6. Conclusions We present a Bayesian approach to sample size and power calculations for ROC studies designed to measure and compare the performance of medical tests. The criterion adopted for this problem is Bayesian average power, which can be applied to several common study designs involving a single test or two tests, both with and without gold standard information. Through simulation studies we illustrated the impact of effect size, the ratio of the number of diseased to non-diseased subjects enrolled into a study, disease prevalence, and correlation among tests on the required sample size. The simulation study emphasizes the importance of incorporating prior information, especially at the data simulation step and with GS tests. Further research topics in this area may extend the Bayesian framework to sample size estimation for designs with clustered data. Power and sample size calculations for studies involving three or more tests or repeated testing are also worth further investigation. Methods designed specifically for ordinal tests without a gold-standard are also needed. Acknowledgments We thank two anonymous referees for their helpful suggestions, which resulted in an improved manuscript. 4

26 References Albert, P.S. (7). Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard. Biometrics 63, Branscum, A.J., Gardner, I.A., Johnson, W.O. (5). Estimation of diagnostic test sensitivity and specificity through Bayesian modeling. Preventive Veterinary Medicine 68, Branscum, A.J., Johnson, W.O., and Gardner, I.A. (7). Sample size calculations for studies designed to evaluate diagnostic test accuracy. Journal of Agricultural, Biological, and Environmental Statistics 16, Branscum, A.J., Johnson, WO, Hanson, TE, Gardner, IA, (8). Bayesian semiparametric ROC curve estimation and disease diagnosis. Statistics in Medicine 7, Cheng,., Stamey, J.., and Branscum, A.J. (9). Bayesian approach to average power calculation for binary regression with misclassified outcomes. Statistics in Medicine, OI: 1.1/sim.355 Choi, Y.-K., Johnson, W.O., Collins, M.T., and Gardner, I.A. (6). Bayesian inferences for receiver operating characteristic curves in the absence of a gold standard. Journal of Agricultural, Biological, and Environmental Statistics 11, 1-9. Erkanli, A., Sung, M., Costello, E.J., and Angold, A. (6). Bayesian semi-parametric ROC analysis. Statistics in Medicine 5, Johnson, W.O., Gastwirth, J.L., and Pearson, L.M. (1). Screening without a gold standard: the Hui-Walter paradigm revisited. American Journal of Epidemiology 153, Joseph, L., Gyorkos, T.W., and Coupal, L. (1995). Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American Journal of Epidemiology 141, Liu, J.-P., Ma, M.-C., Wu, C.-Y., and Tai, J.-Y. (6). Tests of equivalence and noninferiority for diagnostic accuracy based on the paired areas under ROC curves. Statistics in Medicine 5, Obuchowski,.A. (1997). onparametric analysis of clustered ROC curve data. Biometrics 53, Obuchowski,.A. (1998). Sample size calculations in studies of test accuracy. Statistical Methods in Medical Research 7,

27 Obuchowski,.A. (6). An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale. Statistics in Medicine 5, Obuchowski,.A. and McClish,.K. (1997). Sample size determination for diagnostic accuracy studies involving binormal ROC curve indices. Statistics in Medicine 16, Wang, F. and Gelfand, A.E. (). A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science 17, Wang, F. and Gatsonis, C.A. (8). Hierarchical models for ROC curve summary measures: esign and analysis of multi-reader, multi-modality studies of medical test. Statistics in Medicine 7, Zhou, X.-H., Castelluccio, P., and Zhou, C. (5). onparametric estimation of ROC curves in the absence of a gold standard. Biometrics 61,

28 Table 1: Sampling and fitting priors for two GS test scenario Parameter Sampling Prior Fitting Prior uniform(-.5,.5) normal(,.19) S uniform(.75, 1.5) normal(1,.19) S uniform(-.5,.5) normal(,.58) uniform(.5, 3.5) normal(3,.58) uniform(.8, 1.) IG(.5,.5) S uniform(1.8,.) IG(.5,.5) S uniform(.8, 1.) IG(.5,.5) uniform(1.8,.) IG(.5,.5) ρ uniform(.4,.6) beta(5, 5) S ρ uniform(.4,.6) beta(5, 5) S 7

29 Table : Bayesian power with one GS test when =.9 with three different combinations of sampling priors for and : (1) ~ uniform(.5, 3.5) and ~ uniform(1.8,.), () ~ uniform(3, 4) and ~ uniform(1.8,.), and (3) ~ uniform(.5, 3.5), ~ uniform(1.4, 1.8). Sample Size Power 1 Power Power 3 (1, 1) (, ) (3, 3) (4, 4) (5, 5) (6, 6) (7, 7) (8, 8) (9, 9) (1, 1)

30 Table 3: Bayesian power with one GS test when =.9 (n 1, n ) Bayesian Power (1, 9).83 (, 8).878 (3, 7).896 (4, 6).9 (5, 5).91 (6, 4).916 (7, 3).917 (8, ).91 (9, 1).876 9

31 Table 4: Bayesian power with two GS tests when λ =.15 with four different correlation structures. (1) sampling prior of ρ, ρ S S ~ uniform(.4,.6) and fitting prior of ρ S ~ beta(5, 5), () ρ same as in (1) with sampling prior ρ S S ~ uniform(.7,.9) and fitting prior ρ S ~ beta(4, 1), (3) sampling prior ρ ~ S uniform(.7,.9) and fitting prior ρ ~ beta(4, 1), and ρ S S same as in (1), and (4) ρ ρ = in both data simulation and analysis. S = S Sample Size Power 1 Power Power 3 Power 4 (1, 1) (, ) (3, 3) (4, 4) (5, 5) (6, 6) (7, 7) (8, 8) (9, 9) (1, 1)

32 Table 5: Bayesian power under informative (1) and diffuse () fitting priors, and when sampling priors are replaced with fixed point estimates. Sample Size Ave.power (case 1) Ave. power (case ) Ave. power (case 3) Ave. power (case 4) (1, 1) (, ) (3, 3) (4, 4) (5, 5) (6, 6) (7, 7) (8, 8) (9, 9) (1, 1)

33 Figure 1: Bayesian power curves for one test scenario. Bold lines represent one GS test and thin lines one GS test. represents =.85, _ represents =.9, and. represents =.95. 3

34 Figure : Bayesian power curves for two-test scenario. Bold lines represent two GS tests and thin lines two GS tests. represents λ =.1, _ represents λ =.15, and. represents λ =.. 33

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003