Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Size: px

Start display at page:

Download "Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T."

Ethelbert Wilcox
5 years ago
Views:

1 Diagnostic Tests 1

2 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement X i is thought to be related to the characteristic Y i in the sense that units with higher X i values are more likely to have Y i = 1. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T. This is called a diagnostic test. 2

3 Applications of diagnostic testing Cancer detection The amount or concentration of a protein X i in serum obtained from person i may be used to predict whether the person has a particular form of cancer. Credit scoring A person s credit score at the time that he or she receives a loan may be used to predict whether the loan is repaid on time. 3

4 Labeling conventions The labeling of outcome categories as 1 or 0 is arbitrary in principal for example we could label cancer as 1 and non-cancer as 0, or vice-versa. But in practice, label 1 is typically used for the rarer category, or the category that would require some action or intervention. Label 0 usually denotes a default category that requires no action. Depending on the situation, it may be that either larger values of X or smaller values of X are associated with higher probabilities that Y i = 1. In the latter case we can work with X i, or use prediction rules of the form X i < T rather than X i > T. 4

5 Diagnostic testing terminology A diagnostic test is a balance between two types of successful predictions and two types of errors: Successful predictions: True positive a situation in which X i > T and Y i = 1, for example when a person with cancer is predicted to have cancer. True negative a situation in which X i < T and Y i = 0, for example when a cancer-free person is predicted to be cancer-free. Errors: False positive a situation in which X i > T but Y i = 0, for example when a person is predicted to have cancer but actually does not. False negative a situation in which X i < T but Y i = 1, for example when a person is predicted to be cancer-free but actually has cancer. 5

6 Marginal categories The actual status of a unit is positive or negative: Positive everyone with Y i = 1 (all true positives and false negatives). The proportion of positives is often called the prevalence. Negative everyone with Y i = 0 (all false positives and true negatives). The predicted status of a unit is called positive or called negative: Called positive everyone with X i > T (all true positives and false positives). Called negative everyone with X i < T (all true negatives and false negatives). 6

7 The relationships among all these terms is summarized as follows: Called positive Called negative Positive True positive False negative Negative False positive True negative 7

8 Sensitivity and specificity A common way to evaluate a diagnostic test is in terms of sensitivity and specificity. Sensitivity the proportion of positive units that are called positive, the population value is P (X i > T Y i = 1). Specificity the proportion of negative units that are called negative, the population value is P (X i < T Y i = 0). Since sensitivity and specificity are calculated conditionally on case/control status (Y i ), they can be estimated using either a population sample or a case/control sample. 1-specificity is called the false positive rate (FPR) 1-sensitivity is called the false negative rate (FNR) 8

9 Example: Suppose we have a biomarker X i for colon cancer such that 75% of people with colon cancer have X i > T and 5% of people without colon cancer have X i > T. Thus the sensitivity is 75% and the specificity is 100%- 5%=95%. We then screen 1000 people from a population with 15% colon cancer prevalence. We should expect the following results: Called positive Called negative Positive = = 37.5 Negative = = The overall error rate is 80/1000 = 8%, and there is a rough balance between false positives and false negatives. Most of the people who have colon cancer are detected. 9

10 Example: Now suppose we are screening for pancreatic cancer with a prevalence of 0.5% using tests with the same sensitivity and specificity. We expect to get: Called positive Called negative Positive = = 1.25 Negative = = The overall error rate improves to 50.25/1000 5%. The errors overwhelmingly consist of cancer-free false positives. Note that we could get an error rate of 0.5% by predicting everybody to be cancer-free. 10

11 Sensitivity and specificity for normal populations Suppose that X Y = 0 is normal with mean µ 0 and standard deviation σ 0, X Y = 1 is normal with mean µ 1 and standard deviation σ 1. Sensitivity = P (X > T Y = 1) = P ((X µ 1 )/σ 1 > (T µ 1 )/σ 1 Y = 1) = P (Z > (T µ 1 )/σ 1 Y = 1) = 1 P (Z (T µ 1 )/σ 1 Y = 1) P (Z ) can be obtained from a normal probability table. Exercise: Derive a similar formula for specificity. 11

12 Positive and negative predictive values Another way to evaluate a diagnostic test is based on the positive and negative predictive values. Positive predictive value (PPV) the proportion of units called positive that are positive, the population value is P (Y i = 1 X i > T ). Negative predictive value (NPV) the proportion of units called negative that are negative, the population value is P (Y i = 0 X i < T ). 1-PPV is called the false discovery rate the proportion of called positives that are negative. 12

13 Relationships between sensitivity, specificity, positive predictive value, and negative predictive value If we know the prevalence, we can use Bayes theorem to convert between sensitivity/specificity and positive/negative predictive values. For example: P (Y i = 1 X i > T ) = P (X i > T Y i = 1)P (Y i = 1)/P (X i > T ) PPV = sensitivity prevalence/p (positive call) Exercise: Derive a similar relationship for NPV. Note: If pre valance/p (positive call) is approximately 1 then the PPV and sensitivity are similar. Note: PPV depends on prevalence, so cannot be estimated from a case/control sample unless we have an independent estimate of the prevalence. 13

14 Example: The probability of being a called positive in the colon cancer example above is = Thus the positive predictive value is /0.155 = Exercise: show that the negative predictive value for the colon cancer example is Example: For the pancreatic cancer example the probability of being a called positive is = 0.05, so the positive predictive value is /0.05 = Exercise: show that the negative predictive value for the pancreatic cancer example is Note that pancreatic cancer screening looks easier than colon cancer screening based on overall error rate (5% versus 8%) but PPV reveals that the pancreatic cancer test produces a high fraction of false positives. 14

15 Which cancer is truly easier to detect? It depends on the follow-up: Suppose that for colon cancer there is a secondary test that can quickly and safely differentiate the 113 true positives from the 43 false positives, and there is a treatment that substantially helps 50% of people whose colon cancer is detected at screening. Then the 43 false positive only need to go through the inconvenience and stress of a secondary test, and half of the 113 true positives have substantially improved outcomes. Suppose that for pancreatic cancer the only way to confirm the disease is by an invasive procedure that has a 10% rate of serious complications, and therapy only improves the outcome for 20% of people with the disease. Then 4.6% (=46/10) of healthy people are put at serious risk in order to identify 5 people with pancreatic cancer, of whom only one on average will benefit from treatment. Note: the numbers used for the colon and pancreatic cancer examples are made up, but are roughly realistic. 15

16 ROC curves Suppose we want to evaluate how much information a measurement X i contains about a characteristic Y i, but we don t yet want to fix a specific threshold value T. A graphical approach is to plot sensitivity on the vertical axis against 1 specificity on the horizontal axis for all possible values of T. 1.0 Sensitivity Sensitivity Specificity Red Blue Green Specificity 16

17 The following facts constrain a plot of sensitivity against 1 specificity: As T increases, the sensitivity is non-decreasing. As T increases, the specificity is non-increasing, so 1-specificity in nondecreasing. When T is the sensitivity and 1-specificity are both 0. When T is + the sensitivity and 1-specificity are both 1. 17

18 ROC curves A plot of sensitivity against 1-specificity is called a Receiver Operating Characteristics curve, or ROC curve. Due to the constraints discussed above, a ROC curve is a non-decreasing path from (0, 0) to (1, 1). 18

19 Reading and interpreting ROC curves If X contains no information about Y, the sensitivity is P (X > T Y = 1) = P (X > T ), and the specificity is P (X < T Y = 0) = P (X < T ). Therefore 1-specificity is P (X < T ), so the ROC curve is a plot of P (X < T ) against P (X < T ) a diagonal line from (0, 0) to (1, 1). Note that in this case sensitivity = 1 specificity, or sensitivity + specificity = 1. If X is perfectly informative about Y, then there exists a point T such that P (X > T Y = 1) = 1 and P (X < T Y = 0) = 1. We can always determine the value of Y based on whether X is greater than, or less than T. In this case the ROC curve is a path from (0, 0) to (0, 1) to (1,0). If X is partially informative about Y, then for at least some values of T, sensitivity + specificity > 1, so the ROC curve is sometimes above the diagonal. The more it lies above the diagonal, the better. If X is usually or always below the diagonal, the relationship between X and Y is inverted, and we should be using X rather than X to form our predictions. 19

20 Graphs of population ROC curves The following plots show population ROC curves (right side) together with the population densities (left side) of X values in the Y = 0 group (orange) and in the Y = 1 group (blue). 20

21 Probability X Sensitivity Specificity AUC=0.56 Probability X Sensitivity Specificity AUC=

22 Probability X Sensitivity Specificity AUC=0.92 Probability X Sensitivity Specificity AUC=

23 Probability X Sensitivity Specificity AUC=

24 Area under the curve (AUC) The ROC curve always lies in the unit box (0, 1) (0, 1). In the most favorable situation for prediction, the ROC curve consists of the left and top edges of the box, so the area under the ROC curve is 1 (the area of the whole box). In the least favorable situation for prediction, X and Y are independent, the ROC curve follows the diagonal from (0, 0) to (1, 1), and the area under the ROC curve is 1/2. In general, the area under the ROC curve (AUC) can be used as an overall measure of the information in X about Y. The AUC can fall anywhere between 0 and 1, but if the correct orientation of X is known the AUC will fall between 1/2 and 1. Higher AUC values correspond to a greater amount of information in X about Y. 24

25 Sampling interpretation of the AUC Suppose a positive unit Y i = 1 and a negative unit Y j = 0 are selected at random. The population AUC is the probability that X i > X j. The sample AUC is also known as the Mann-Whitney statistic, and can be equivalently calculated as i,j I(X i > X j, Y i = 1, Y j = 0) i,j I(Y i = 1, Y j = 0) The AUC can be calculated in R as follows: wilcox.test(x1,x0)$statistic/(length(x1)*length(x0)) 25

26 Inference Sensitivity, specificity, PPV, and NPV are proportions. population sensitivity is For example, the p = P (X > T Y = 1) which we estimate as ˆp = i I(X i > T )Y i / i Y i. If the i Y i is fixed (as in a case/control study), ˆp is a simple average. In this case, it is unbiased and has variance var ˆp = p(1 p)/n 1, where n 1 = i Y i. 26

27 If the data are from a random sample of size N, then n 1 is random. In this case, the estimate of sensitivity is still unbiased, but the variance is larger than in a case/control study. The conditional variance is: var(ˆp n 1 ) = p(1 p)/n 1 Using the law of total variation, we get var ˆp = vare(ˆp n 1 ) + Evar(ˆp n 1 ) N = 0 + p(1 p) P (n 1 = n)/n. Since n 1 is the number of cases out of a total sample size of N, and each sampled unit has a fixed probability q of being a case, n 1 has a binomial distribution n=1 P (n 1 = n) = ( N n ) q n (1 q) N n. 27

28 How does N n=1 P (n 1 = n)/n relate to 1/n 1? This tells us about the efficiency of a case/control study compared to a random population sample for estimating the sensitivity. These plots compare standard errors for the two types of sampling when the total sample size is 50. Standard error Case/control Population sample P(Y=1) Difference in standard errors P(Y=1) The difference could be important if p < 0.2 or so. 28

29 Sample ROC curves for various sample sizes ROC curves based on data fluctuate around their mean value and become more accurate as the sample size increases. The following plots show sample ROC curves when X Y normal and X Y = 1 is normal with mean 1 and variance 1. = 0 is standard 29

30 Sensitivity Sensitivity Sample size 25 (per group) Specificity Sample size 100 (per group) Specificity Sensitivity Sensitivity Sample size 50 (per group) Specificity Sample size 200 (per group) Specificity

31 Inference for the AUC For most statistics, the standard error of the statistic based on a sample of size N approximately has the form SE c/ N. When this holds, we can form a log/log plot of SE against sample size and the slope will be 1/2: log SE log(c) log(n)/2. 30

32 Here is the plot for AUC: log2 SE(AUC) log2 total sample size Standard error Linear fit The slope of the grey line is 1.16, so this is not a typical statistic. It appears that SE c/n for the AUC. There are complicated analytic expressions for the standard error of the AUC, but the bootstrap is also a good approach. 31

33 Generalization of the threshold value A common way to set the threshold value T is to specify a lower bound κ on specificity, and set T to the lowest value such that the sample specificity in the training set is greater than T. What is the distribution of threshold values associated with this procedure? What is the distribution of population specificity values associated with this procedure? 32

34 Distributions of threshold values T when κ = 0.9, X Y = 0 is standard normal and X Y = 1 is normal with mean µ 1 and variance 1. Both groups have sample size n. Density Density n =25, µ 1 = Threshold n =50, µ 1 = Threshold Density Density n =50, µ 1 = Threshold n =100, µ 1 = Threshold 33

35 Population specificities corresponding to the distributions of threshold values on the previous slide. Density Density n =25, µ 1 = Population specificity n =50, µ 1 = Population specificity Density Density n =50, µ 1 = Population specificity n =100, µ 1 = Population specificity 34

36 Parametric bootstrap for ROC analysis Suppose a diagnostic test is available that has an AUC of 0.8. Someone has developed a new test, and wants to show that it is superior to the gold standard. The following code outlines the parametric bootstrap using normal models for the X Y = 0 and X Y = 1 populations. 35

37 ## Estimate means and standard deviations for the X Y=0 and X Y=1 ## populations. m0 = mean(x0) s0 = sd(x0) m1 = mean(x1) s1 = sd(x1) nboot = 1000 ## The number of bootstrap samples to use. auc = rep(0, nboot) for (k in 1:nboot) { ## Generate a bootstrap data set. x0 = rnorm(length(x0), mean=m0, sd=s0) x1 = rnorm(length(x1), mean=m1, sd=s1) } auc[k] = wilcox.test(x1,x0)$statistic/(length(x1)*length(x0)) auc = sort(auc) lb = auc[0.025*nboot] ub = auc[0.975*nboot] ## The lower bound of the CI. ## The upper bound of the CI. 36

38 The following plots show the observed ROC curve in red, along with 10 ROC curves from parametric bootstrap samples in grey. Sensitivity Sample size 25 (per group) Specificity Sensitivity Sample size 50 (per group) Specificity µ 1 (σ 1 ) µ 0 (σ 0 ) n ˆµ 1 (ˆσ 1 ) ˆµ 0 (ˆσ 0 ) AUC 95%CI Left 1(1) 0(1) (0.97) -0.09(0.85) 0.80 (0.66,0.92) Right 1(1) 0(1) (0.95) -0.08(0.99) 0.82 (0.75,0.90) The bootstrap CI is based on 1000 samples. Based on the CI s, we cannot be confident that these tests are better than the existing test with an AUC of

39 Power and sample size analysis for ROC curves Suppose we intend to evaluate a diagnostic test base on its AUC, using the parametric bootstrap to construct a 95% confidence interval for the population AUC. For power analysis purposes, suppose we wish to identify the smallest sample size that gives us 80% power to conclude that the population AUC is greater than 0.6, when the data follow a standard normal population when Y = 0, and a normal population with mean 1 and standard deviation 1 when Y = 1. The R code on the following slide uses simulation to estimate the power for a range of sample sizes. 38

40 R = NULL ## Storage for the results. ## Loop over possible sample sizes. for (n in c(30,40,50)) { m = 0 for (r in 1:nrep) { ## Use nrep~100 to save time. X1 = rnorm(n, mean=1, sd=1) ## Actual data. X0 = rnorm(n, mean=0, sd=1) ## Actual data. m1 = mean(x1); s1 = sd(x1) ## Calculate parameters to use for m0 = mean(x0); s0 = sd(x0) ## the parametric bootstrap. auc = rep(0, nboot) ## Calculate AUC values for bootstrap sets. for (k in 1:nboot) { x1 = rnorm(length(x1), mean=m1, sd=s1) ## Bootstrap data. x0 = rnorm(length(x0), mean=m0, sd=s0) ## Bootstrap data. auc[k] = wilcox.test(x1,x0)$statistic/(length(x1)*length(x0)) } } auc = sort(auc) if (auc[0.025*nboot] > 0.6) { m=m+1 } ## Check if the interval covers. } R = rbind(r, c(n, m/nrep)) ## Record the result. 39

41 Using nrep=100 and nboot=1000 I got the following power results: n power Based on these results it seems that a sample size between 40 and 50 per group should be used. Further simulation can pin this down to a single number. Note: The population AUC for this simulation study is Note: We have σ 1 = σ 2 for this analysis, but if there is reason to believe that σ 1 σ 2, additional simulations should be run. Note: These results are from a simulation with a small nrep value. For a real power analysis a larger nrep should be used. 40

42 Example: PSA and prostate cancer 41

An Introduction to Bayesian Statistics

An Introduction to Bayesian Statistics Robert Weiss Department of Biostatistics UCLA Fielding School of Public Health robweiss@ucla.edu Sept 2015 Robert Weiss (UCLA) An Introduction to Bayesian Statistics